WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
Developing A Big Data Search Engine - Where we have gone. Where we are going: Presented by Mark Miller, Cloudera
1. O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
2. Developing a Big Data Search Engine 2.0
Where we have gone, where we are going.
Mark Miller
Software Engineer, Cloudera
3. 3
01
Who am I?
I’m Mark Miller
I’m a Lucene junkie (2006)
I’m a Lucene committer (2008)
And a Solr committer (2009)
And a member of the ASF (2011)
And a former Lucene PMC Chair (2014-2015)
I co-created SolrCloud (????)
4. 4
02
A Quick Tour Through History
First there was Lucene.
It took a little while, but soon it was “good enough” to
replace most search engines. And faster. And more
efficient.
Lots of Search Engines built on Lucene (I made one!)
Then there was Solr.
And then there were Others.
6. 6
01
What Search Engines Matter?
Lucene search engines lead the pack.
How can you tell?
I like to look at db-engines.org - these are the skills devs
have, users are talking about and employers are hiring
for.
Also, plenty of anecdotal evidence that others are using
Lucene for the core.
Classic Enterprise Search Engines matter a little bit.
8. 8
01
Enterprise Search Engines
Oracle buys Endeca (2011 - $1.075B), Microsoft buys Fast
(2008 - $1.2B), HP buys Autonomy (2011 - $10.3B)
World Happiness Decreases.
The old leaders.
2006: Autonomy, FAST, Endeca tops in Gartner site search
study
2015 Leaders: Coveo, HP, Sinequa, Attivio, Lexmark
You can bet any large company using one of these also uses a
Lucene based solution.
The new leaders.
9. 9
But Search is a general tool and Open Source is the core of the
Future and a virtuous cycle
Open source has become the default approach for
software with more than 66 percent of respondents
saying they consider OSS before other options.
(2015 BlackDuck)
It’s an open-source world: 78 percent of companies
run open-source software. (2015 BlackDuck)
Less than 3% DON’T USE OSS IN ANY WAY.
11. 11
“It is hopeless to talk to both of
you, you don't understand
virtual memory.”
Uwe Schindler @thetaph1 @uwesays
12. 12
01
What is the future of Search?
More NoSQL
More SQL
More Realtime Analytics
More System of Record
More Graph
More Scale
Search will eat away at the stack.
Search focuses on pre processing and efficient in memory
data structures for fast responses.
13. 13
01
The Solr Beginnings - Single node, then DIY distributed
Solr started as a single node solution, followed by master/
slave replication, followed by simple distributed search.
This was ‘good enough’ for a long time.
Classic ‘innovators dilemma’ problem.
Scaling out was super important, but not as soon as some
thought and sooner than others thought.
The challenge was, how do we evolve the system and can we
move the Solr user base to this evolution without disrupting
current users?
15. 15
01
Solr Meets Hadoop
First Class Solr Integrations
HDFS
MapReduce
Spark
Flume
HBase
Sentry
Etc
SolrCloud was already built on ZooKeeper
16. 16
01
Now it’s all about scale and correctness.
The search features for the big data world are here and
rapidly advancing.
The next step is being able to handle Hadoop scale in the
‘general’ case.
And to be able to handle that correctly ‘enough’ of the
time.
17. 17
“In my opinion the whole
code is a bug by itself.”
Uwe Schindler @thetaph1 @uwesays
18. 18
01
The Call Me Maybe Tests
https://aphyr.com/tags/jepsen
Some basic testing around how systems live up to their CAP
promises. Heavy focus on partitions.
Most systems fail pretty badly. ZooKeeper rocked it.
SolrCloud did pretty darn well*.
Kyle
Kingsbury
19. 19
01
Call Me … Maybe ??
Passing is actually like a very minimum bar. It doesn’t at all
mean your system is correct.
Your system could be complete crap and still pass.
In fact, in the general case, all the current best search
engines are still flakey at scale.
20. 20
01
Search at Scale is still Flakey?
Yes, yes it is. Most systems at scale are still flakey. Most systems
don’t deliver on their promises. It’s a matter of degree.
How does search in particular get away with it?
Users are already used to not considering it the system of record.
Its easier to scale specialized than general - project has to scale
general but massive users can scale specialized.
We want the project to easily scale generally - no expertise
needed. You can already scale pretty large, but it takes a
‘vertical’ and expertise.
21. 21
01
Search in Particular is HARD
The search engine is a many faceted beast.
There is a lot of surface area.
You need many very different features to all integrate well
together, usually in near realtime.
It sounds a lot easier than it is.
22. 22
"Lucene is maybe the world's
most tested open source
project."
Uwe Schindler @thetaph1 #bbuzz 2014
23. 23
01
The Lucene Testing Framework
Lucene regularly finds bugs in new Java releases.
Seriously. Regularly.
Many of those bugs are fixed and fixed quickly. Many are
not.
Randomized testing, reproducible master seeds.
“Test Beasting” and seti@home type resource requirements.
24. 24
01
The Lucene Testing Framework
Code checkers and build enforcers galore, as well as test
level checkers and enforcers.
Who is policing the policeman?
You need a vibrant community that gives a damn.
25. 25
“The stack trace is only
impossible if you look at the
code.”
Uwe Schindler @thetaph1 @uwesays
26. 26
01
Testing is the Key and the Answer
Just because your tests don't normally fail doesn't mean
they are great. You probably just don’t normally see the
problems.
Our test framework exposes the problems - quickly.
This has pluses and minuses, but the pluses greatly
outweigh the minuses!
27. 27
01
More on Testing
Integration and unit tests are equally important.
Integration tests are a little more important.
Testing, testing, and more testing is your best friend.
Communities grow, communities change, one or two can’t
hold the code together.
28. 28
01
More on Testing
Distributed testing takes more.
You want testing to the hardware level, not as just as part
of a simple test framework.
You want to test on large, expensive clusters.
Debugging grows as an issue.
Companies are taking on this work and the results are and
will be funneled back into the project.
Age is a virtue.
30. 30
01
RAW
TBD: At Cloudera we are building htrace, chaos
monkey w/ fault injection, etc - higher level is
important, beast test cluster
31. 31
01
The Race for Scalable search is on!
My approach will be to leverage Hadoop as much as possible!
Many companies are focused on Solr - there will be many
approaches!
It’s still early in the game.
32. 32
01
Leverage Hadoop
A distributed filesystem is a beautiful crutch to lean on!
Consider a single index shared by all replicas.
ETL at scale is not Solr’s strength.
A marriage with Hadoop is natural and has been long
ongoing
Hadoop will push Solr to it’s limits and
beyond.
33. 33
"As a good policeman I have
all open source ‘guns’ for
code checking available."
Uwe Schindler @thetaph1 @uwesays
https://code.google.com/p/forbidden-apis/
http://labs.carrotsearch.com/randomizedtesting.html