ApacheCon Europe 2012 -Big Search 4 Big Data

4,615 views
3,671 views

Published on

Got hundreds of millions of documents to search? DataImportHandler blowing up while indexing? Random thread errors thrown by Solr Cellduring document extraction? Query performance collapsing? Then you've searching at Big Data scale. This talk will focus on the underlying principles of Big Data, and how to apply them to Solr. This talk isn't a deep dive into SolrCloud, though we'll talk about it. It also isn't meant to be a talk on traditional scaling of Solr.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,615
On SlideShare
0
From Embeds
0
Number of Embeds
70
Actions
Shares
0
Downloads
113
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • \n
  • Search was the original big data problem. Now search is back, but with a new cooler name “Big Data”, and search is the dominant metaphor for exposing big data sets to business users to make actual decisions. Big Data is rapidly changing fields such as HealthCare, and I maintain that the next revoultion in healtchare won't be via a doctor wielding a scalpel, but via a doctor wielding a mouse.\n
  • SOLR-284 back in July 07 was a first cut at a content extraction library before Tika came along.\n
  • \n
  • And I love Agile development processes. And I think of agile as business -> requirements -> development -> testing -> systems administration\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • And I don’t mean this as a shot against Hadoop, but with the right hardware, you can get a lot done in bash, with a bit of Java or Perl sprinkled in.\nThere is a lot of value in getting started today building large scaled out ingestors.\n
  • \n
  • Notice our property style? Made it easy to read in properties in both Bash and Java!\n
  • Try sharding at different sizes using Mod\nTry sharding by month, or week, or hour depending on your volume of data.\n
  • \n
  • \n
  • We had huge left over “enterprise” boxes with ginourmous amounts of ram and cpu. We were IO bound.\n\n
  • \n
  • \n
  • \n
  • \n
  • The verbose:gc and +PrintGCDetails lets you grep for the frequency of partial versus full garbage collecitons. We rolled back from 3.4 to 3.1 based on this data on one project.\n
  • Again, horse racing two slaves can help. You can also pass in the connection information via jconsole command line which makes it easier to monitor a set of Solrs\n
  • \n
  • \n
  • i love working with CSV and Solr. The CSV writer type is great for moving data between solrs. (Don’t forget to store everything!)\n
  • \n
  • \n
  • \n
  • You have many fewer Solrs then you do Indexer processors.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Jukka did a great presentation yesterday.\n
  • \n
  • \n
  • \n
  • dollar tree makes crap. Stores are always empty or missing items. You don’t want your indexing like that. Space shuttle costed 500 MILLIOn dollars to launch it every time. You don’t want your indexing process to be like launching the space shuttle.\n
  • \n
  • \n
  • \n
  • runs every hour.\nLooks at log files to determine if a solr cluster is misbehaving.\n
  • Hal 9000 misbheaved\nruns every hour.\nLooks at log files to determine if a solr cluster is misbehaving.\nEspecially if you are on cloud platform. They implement their servers on the cheapest commodity hardware \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Kaa the snake from The Jungle Book hynotizing Mowgli. \nDanah Boyd among others have said that Big Data sometimes throws out thousands of years \n\n
  • \n
  • Nathan Marz\n
  • \n
  • ApacheCon Europe 2012 -Big Search 4 Big Data

    1. Big Search w/ Big Data Principles ApacheCon Euro 2012 Sinsheim Eric Pugh | epugh@o19s.com | @dep4b
    2. What is Big Search?
    3. Who am I?• Principal of OpenSource Connections - Solr/Lucene Search Consultancy• Member of Apache Software Foundation• SOLR-284 UpdateRichDocuments (July 07)• Fascinated by the art of software development
    4. 0! 4. on ng ki orW CO-AUTHOR
    5. war Telling some stories ^• Prototyping• Application Development• Maintaining Your Big Search Indexes
    6. Not an intro toSolrCloud/ElasticSearch!• Great round table discussion yesterday led by Mark Miller• SolrCloud 4 Architecture talk in this room NEXT!• Solr4 vs Elastic Search at 4:45 PM TODAY!
    7. Background for Client X’s Project• Big Data is any data set that is primarily at rest due to the difficulty of working with it.• 100’s of millions of documents to search• Aggressive timeline.• All the data must be searched per query.• Limited selection of tools available.• On Solr 3.x line
    8. Telling some stories• Prototyping• Application Development• Maintaining Your Big Search Indexes
    9. Boy meets Girl StoryContent Files Ingest Solr Solr Pipeline Solr SolrMetadata
    10. Boy meets Girl StoryContent Files Ingest Solr Solr Pipeline Solr SolrMetadata
    11. Boy meets Girl StoryContent Files Ingest Solr Solr Pipeline Solr SolrMetadata
    12. Boy meets Girl StoryContent Files Ingest Solr Solr Pipeline Solr SolrMetadata
    13. Boy meets Girl StoryContent Files Ingest Solr Solr Pipeline Solr SolrMetadata
    14. Bash Rocks
    15. Bash Rocks• Remote Solr stop/start scripts• Remote Indexer stop/start scripts• Performance Monitoring• Content Extraction scripts (+Java)• Ingestor Scripts (+Java)• Artifact Deployment (CM)
    16. Make it easy to change approach
    17. Make it easy to change sharding public void run(Map options, List<SolrInputDocument> docs) throwsInstantiationException, IllegalAccessException, ClassNotFoundException { IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); } }
    18. Separate JVM from Solr Cores• Step 1: Fire up empty Solr’s on all the servers (nohup &).• Step 2:Verify they started cleanly.• Step 3: Create Cores (curl http:// search1.o19s.com:8983/solr/admin? action=create&name=run2)• Step 4: Create a “aggregator” core, passing in urls of Cores. (&property.shards=)
    19. Go Wide Quickly
    20. search1.o19s.comsearch1.o19s.com shard1 shard1 shard1 shard1 :8983 shard1 shard1 shard1 shard1 :8983 search2.o19s.com shard1 shard1 shard1 shard8 :8984 shard1 shard1 shard1 :8983 shard8 shard1 shard1 shard1 :8985 shard12 search3.o19s.com shard1 shard1 shard1 :8985 shard12 shard1 shard1 shard1 :8983 shard12
    21. Why so many pipelines?
    22. Simple Pipeline• Simple pipeline• mv is atomic
    23. Don’t Move Files• SCP across machines is slow/error prone• NFS share, single point of failure.• Clustered file system like GFS (Global File System) can have “fencing” issues• HDFS shines here.• ZooKeeper shines here.
    24. Can you test your changes?
    25. JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+AggressiveHeap-XX:+DisableExplicitGC-XX:ParallelGCThreads=16-XX:+UseParallelOldGC
    26. Run, don’t Walk
    27. Telling some stories• Prototyping• Application Development• Maintaining Your Big Search Indexes
    28. Using Solr as key/value store Solr Key/ Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content Files
    29. Using Solr as key/value store • thousands of queries per second without real time get.http://localhost:8983/solr/run2_enrichment/select?q=id:DOC45242&fl=entities,html • how fast with real time get? http://localhost:8983/solr/run2_enrichment/get? id=DOC45242&fl=entities,html
    30. Push schema definition to the application • Not “schema less” • Just different owner of schema! • Schema may have common set of fields like id, type, timestamp, version • Nothing required.q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitor
    31. Don’t do expensive things in Solr• Tika content extraction aka Solr Cell• UpdateRequestProcessorChain
    32. Don’t do expensive things in Solr• Tika content extraction aka Solr Cell• UpdateRequestProcessorChain
    33. Beware JavaBin Solr Key/ Value CacheMetadata Ingest Solr Solr Pipeline Solr SolrContent Files
    34. Beware JavaBin Solr Key/ Solr 3.4 Value CacheMetadata Ingest Solr Solr Pipeline Solr SolrContent Files
    35. Beware JavaBin Solr Key/ Solr 3.4 Value CacheMetadata Solr 4 Ingest Solr Solr Pipeline Solr SolrContent Files
    36. Beware JavaBin Solr Key/ Solr 3.4 Value CacheMetadata Solr 4 Ingest Solr Solr Pipeline Solr SolrContent Which SolrJ Files version do I use?
    37. No JavaBin /u G te p iv / da e av m r e o!• Avoid Jarmaggeddon• Reflection? Ugh.
    38. Avro!• Supports serialization of data readable from multiple languages• It’s smart XML, w/o the XML!• Handles forward and reverse versions of an object• Compact and fast to read.
    39. Avro! Solr Key/Value Cache .avro Metadata Ingest Solr Solr Pipeline Solr Solr Content Files
    40. Tika as a pipeline? • Auto detects content type • Metadata structure has all the key/value needed for Solr • Allows us to scale up with Behemoth project.
    41. Telling some stories• Prototyping• Application Development• Maintaining Your Big Search Indexes
    42. Upgrade Lucene Indexes Easily • Don’t reindex! • Try out new versions of Lucene based search engines. David Lylejava -cp lucene-core.jarorg.apache.lucene.index.IndexUpgrader [-delete-prior-commits] [-verbose] indexDir
    43. Indexing is Easy and Quick
    44. CHEAP AND CHEERFUL < >
    45. NRT versus BigData
    46. The tension between scale and update rate10 million Bad Place 100’s of millions
    47. Grim Reaper
    48. Delayed Replication<requestHandler name="/replication" class="solr.ReplicationHandler" ><lst name="slave"> <str name="masterUrl">http://localhost:8983/solr/replication</str> <str name="pollInterval">36:00:00</str></lst></requestHandler>
    49. Enable/Disable• Solr-3301
    50. Enable/Disable<requestHandler name="/admin/ping" class="solr.PingRequestHandler"><lst name="invariants"> <str name="q">MY HARD QUERY</str> <str name="shards">http://search1.o19s.com:8983/solr/run2_1,http://search1.o19s.com:8983/solr/run2_2,http://search1.o19s.com:8983/solr/run2_2</lst><lst name="defaults"> <str name="echoParams">all</str></lst><str name="healthcheckFile">server-enabled.txt</str></requestHandler>
    51. Provisioning• Chef/Puppet• ZooKeeper• Have you versioned everything to build an index over again?
    52. TRADITIONAL ENVIRONMENT
    53. th inPOOLED ENVIRONMENT k Cl ou d!
    54. Building a Patents Index 300 300 225Machine Count 150 75 1 5 0 5 days 3 days 30 Minutes What happens when we want to index 2 million patents in 30 minutes?
    55. Amazon AWS is Good but...• EC2 is costly• Issues of access to internal data• Firewall and security
    56. Can we Cycle Scavenge?• Data Center is heavily used 9 to 5 EST.• Lesser, but significant load 8 to 10 PM EST• Minimal CPU load at night. • Amazon Spot Pricing for EC2 • Seti @HOME • JavaGenes - Genetics processing • Condor Platform (http:// research.cs.wisc.edu/condor/) 49
    57. Balancing Load Production Load Batch Jobs100 75 50 25 0 1 AM 3 AM 5 AM 9AM 3PM 9PM 11 PM 50
    58. Do I need Failover?• Can I build quickly?• Do I have a reliable cluster of servers?• Am I spread across data centers?• Is sooo 90’s....
    59. Telling some stories• Prototyping• Application Development• Maintaining Your Big Search Indexes
    60. One more thought...
    61. Measuring the impact of our algorithmschanges is just gettingharder with Big Data.
    62. Project SolrPanl
    63. Thank you! Questions? Nervous about • epugh@o19s.comspeaking up? Ask me on later! • @dep4b about ask • www.opensourceconnections.com

    ×