0
Big Search w/ Big Data      Principles          LuceneRevolution 2012  Eric Pugh | epugh@o19s.com | @dep4b
What is Big Search?
Who am i?•    Principal of OpenSource Connections    - Solr/Lucene Search Consultancy•   Member of Apache Software    Foun...
n!                   io              it         ed     d2n                         AUTHOR
AGILISTA
Telling some stories• Prototyping• Application Development• Maintaining Your Cluster
Not an intro to cloud    Computing• See Indexing Big Data on Amazon AWS by  Scott Stults @ 1:15 Thursday• See How is the G...
Not an intro to       SolrCloud!• See How SolrCloud Changes the User  Experience In a Sharded Environment by  Erick Ericks...
My Assumptions for       Client X• Big Data is any data set that is primarily at  rest due to the difficulty of working wit...
Telling some stories• Prototyping• Application Development• Maintaining Your Cluster
Boy meets Girl StoryMetadata            Ingest    Solr                       Solr           Pipeline     Solr             ...
Bash Rocks
Bash Rocks• Remote Solr stop/start scripts• Remote Indexer stop/start scripts• Performance Monitoring• Content Extraction ...
Make it easy to change       sharding
Make it easy to change             sharding	 public void run(Map options, List<SolrInputDocument> docs) throwsInstantiatio...
Separate JVM from Solr        Cores• Step 1: Fire up empty Solr’s on all the  servers (nohup &).• Step 2:Verify they start...
Go Wide Quickly
search1.o19s.comsearch1.o19s.com                      shard1                       shard1                        shard1   ...
Simple Pipeline•   Simple pipeline•   mv is atomic
Don’t Move Files• SCP across machines is slow/error prone• NFS share, single point of failure.• Clustered file system like ...
Can you test your    changes?
JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+Aggressi...
Run, don’t Walk
Telling some stories• Prototyping• Application Development• Maintaining Your Cluster
Grab some Data#!/bin/shSOURCE_SOLR=http://ec2-107-20-92-190.compute-1.amazonaws.com:8983/solr/core0/select?q=*%3A*&start=0...
Using Solr as a key/           value store    • thousands of queries per second without       real time get.http://localho...
Using Solr as key/value store               Solr Key/              Value Cache   Metadata                 Ingest     Solr ...
Using Solr as key/value store    • thousands of queries per second without       real time get.http://localhost:8983/solr/...
Push schema definition      to the application    • Not “schema less”    • Just different owner of schema!    • Schema may ...
Don’t do expensive    things in Solr• Tika content extraction aka Solr Cell• UpdateRequestProcessorChain
Don’t do expensive    things in Solr• Tika content extraction aka Solr Cell• UpdateRequestProcessorChain
Avro!• Supports serialization of data readable from  multiple languages• It’s smart XML• Handles forward and reverse versi...
Avro! Solr Key/Value Cache                         .avro Metadata      Ingest            Solr                             ...
No JavaBin                         /u                           G te                            p                         ...
No JavaBin              Solr Key/             Value CacheMetadata                Ingest     Solr                          ...
No JavaBin              Solr Key/  Solr 3.4             Value CacheMetadata                Ingest         Solr            ...
No JavaBin              Solr Key/  Solr 3.4             Value CacheMetadata                                    Solr 4     ...
No JavaBin              Solr Key/  Solr 3.4             Value CacheMetadata                                    Solr 4     ...
Telling some stories• Prototyping• Application Development• Maintaining Your Cluster
Upgrade Lucene           Indexes Easily    • Don’t reindex!    • Try out new features!                                    ...
Indexing is Easy and       Quick
CHEAP AND CHEERFUL    <       >
NRT versus BigData
The tension betweenscale and update rate                  Bad Place                    to Be  > 100,000,000               ...
The tension between    scale and update rate10 million   Bad Place   100’s of millions
Grim Reaper
Delayed Replication<requestHandler name="/replication" class="solr.ReplicationHandler" ><lst name="slave"> <str name="mast...
Enable/Disable<requestHandler name="/admin/ping" class="solr.PingRequestHandler"><lst name="invariants">  <str name="q">MY...
Enable/Disable• Solr-3301
Provisioning• Chef/Puppet• ZooKeeper• Have you versioned everything to build an  index?
TYPICAL ENVIRONMENT
FLEXIBLE ENVIRONMENT
Do I need Failover?• Can I build quickly?• Do I have a reliable cluster?• Am I spread across data centers?• Is sooo 90’s....
Telling some stories• Prototyping• Application Development• Maintaining Your Cluster
Some Other Thoughts
Don’t be Mesmerized
ScientificMethod
Thank you!• epugh@o19s.com• @dep4b• www.opensourceconnections.com
Big Search with Big Data Principles
Big Search with Big Data Principles
Upcoming SlideShare
Loading in...5
×

Big Search with Big Data Principles

3,434

Published on

Lessons I learned in indexing "Big Data" size dataset using Solr.

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,434
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
100
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • \n
  • Search was the original big data problem. Then Google search came along, and search wandered in the wilderness of internal Enterprise search and ecommerce search. But now search is back, but with a new cooler name &amp;#x201C;Big Data&amp;#x201D;. Search interfaces are the dominant metaphor for working with big data sets by non data scientists.\n
  • SOLR-284 back in July 07 was a first cut at a content extraction library before Tika came along.\n
  • \n
  • And I love Agile development processes. And I think of agile as business -&gt; requirements -&gt; development -&gt; testing -&gt; systems administration\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • And I don&amp;#x2019;t mean this as a shot against Hadoop, but with the right hardware, you can get a lot done in bash, with a bit of Java or Perl sprinkled in.\nThere is a lot of value in getting started today building large scaled out ingestors.\n
  • \n
  • Notice our property style? Made it easy to read in properties in both Bash and Java!\n
  • Try sharding at different sizes using Mod\nTry sharding by month, or week, or hour depending on your volume of data.\n
  • \n
  • \n
  • We had huge left over &amp;#x201C;enterprise&amp;#x201D; boxes with ginourmous amounts of ram and cpu\n\n
  • \n
  • \n
  • \n
  • The verbose:gc and +PrintGCDetails lets you grep for the frequency of partial versus full garbage collecitons. We rolled back from 3.4 to 3.1 based on this data on one project.\n
  • Again, horse racing two slaves can help. You can also pass in the connection information via jconsole command line which makes it easier to monitor a set of Solrs\n
  • \n
  • \n
  • i love working with CSV and Solr. The CSV writer type is great for moving data between solrs. (Don&amp;#x2019;t forget to store everything!)\n
  • \n
  • \n
  • \n
  • \n
  • You have many fewer Solrs then you do Indexer processors.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • dollar tree makes crap. Stores are always empty or missing items. You don&amp;#x2019;t want your indexing like that. Space shuttle costed 500 MILLIOn dollars to launch it every time. You don&amp;#x2019;t want your indexing process to be like launching the space shuttle.\n
  • \n
  • \n
  • \n
  • runs every hour.\nLooks at log files to determine if a solr cluster is misbehaving.\n
  • Hal 9000 misbheaved\nruns every hour.\nLooks at log files to determine if a solr cluster is misbehaving.\nEspecially if you are on cloud platform. They implement their servers on the cheapest commodity hardware \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Kaa the snake from The Jungle Book hynotizing Mowgli. \nDanah Boyd among others have said that Big Data sometimes throws out thousands of years \n\n
  • \n
  • Nathan Marz\n
  • \n
  • Transcript of "Big Search with Big Data Principles"

    1. 1. Big Search w/ Big Data Principles LuceneRevolution 2012 Eric Pugh | epugh@o19s.com | @dep4b
    2. 2. What is Big Search?
    3. 3. Who am i?• Principal of OpenSource Connections - Solr/Lucene Search Consultancy• Member of Apache Software Foundation• SOLR-284 UpdateRichDocuments (July 07)• Fascinated by the art of software development
    4. 4. n! io it ed d2n AUTHOR
    5. 5. AGILISTA
    6. 6. Telling some stories• Prototyping• Application Development• Maintaining Your Cluster
    7. 7. Not an intro to cloud Computing• See Indexing Big Data on Amazon AWS by Scott Stults @ 1:15 Thursday• See How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud by Seshu Simhadi @ 2:55 Thursday
    8. 8. Not an intro to SolrCloud!• See How SolrCloud Changes the User Experience In a Sharded Environment by Erick Erickson @ 2:55 Today• See Solr 4: The SolrCloud Architecture by Mark Miller @ 10:45 Tomorrow
    9. 9. My Assumptions for Client X• Big Data is any data set that is primarily at rest due to the difficulty of working with it.• Limited selection of tools available.• Aggressive timeline.• All the data must be searched per query.• On Solr 3.x line
    10. 10. Telling some stories• Prototyping• Application Development• Maintaining Your Cluster
    11. 11. Boy meets Girl StoryMetadata Ingest Solr Solr Pipeline Solr SolrContent Files
    12. 12. Bash Rocks
    13. 13. Bash Rocks• Remote Solr stop/start scripts• Remote Indexer stop/start scripts• Performance Monitoring• Content Extraction scripts (+Java)• Ingestor Scripts (+Java)• Artifact Deployment (CM)
    14. 14. Make it easy to change sharding
    15. 15. Make it easy to change sharding public void run(Map options, List<SolrInputDocument> docs) throwsInstantiationException, IllegalAccessException, ClassNotFoundException { IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); } }
    16. 16. Separate JVM from Solr Cores• Step 1: Fire up empty Solr’s on all the servers (nohup &).• Step 2:Verify they started cleanly.• Step 3: Create Cores (curl http:// search1.o19s.com:8983/solr/admin? action=create&name=run2)• Step 4: Create a “aggregator” core, passing in urls of Cores. (&property.shards=)
    17. 17. Go Wide Quickly
    18. 18. search1.o19s.comsearch1.o19s.com shard1 shard1 shard1 shard1 :8983 shard1 shard1 shard1 shard1 :8983 search2.o19s.com shard1 shard1 shard1 shard8 :8984 shard1 shard1 shard1 :8983 shard8 shard1 shard1 shard1 :8985 shard12 search3.o19s.com shard1 shard1 shard1 :8985 shard12 shard1 shard1 shard1 :8983 shard12
    19. 19. Simple Pipeline• Simple pipeline• mv is atomic
    20. 20. Don’t Move Files• SCP across machines is slow/error prone• NFS share, single point of failure.• Clustered file system like GFS (Global File System) can have “fencing” issues• HDFS shines here.• ZooKeeper shines here.
    21. 21. Can you test your changes?
    22. 22. JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+AggressiveHeap-XX:+DisableExplicitGC-XX:ParallelGCThreads=16-XX:+UseParallelOldGC
    23. 23. Run, don’t Walk
    24. 24. Telling some stories• Prototyping• Application Development• Maintaining Your Cluster
    25. 25. Grab some Data#!/bin/shSOURCE_SOLR=http://ec2-107-20-92-190.compute-1.amazonaws.com:8983/solr/core0/select?q=*%3A*&start=0&rows=500000&wt=csvTARGET_SOLR=http://localhost:8983/solr/us_patent_grant/update/csvwget -O output.csv $SOURCE_SOLRcurl http://localhost:8983/solr/us_patent_grant/update/csv?skipLines=1&commit=true&optimize=true --data-binary@output.csv -H Content-type:text/plain; charset=utf-8
    26. 26. Using Solr as a key/ value store • thousands of queries per second without real time get.http://localhost:8983/solr/run2_enrichment/select?q=id:DOC45242&fl=entities,html • ??? with real time get?http://localhost:8983/solr/run2_enrichment/get?id=DOC45242&fl=entities,html
    27. 27. Using Solr as key/value store Solr Key/ Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content Files
    28. 28. Using Solr as key/value store • thousands of queries per second without real time get.http://localhost:8983/solr/run2_enrichment/select?q=id:DOC45242&fl=entities,html • ??? with real time get? http://localhost:8983/solr/run2_enrichment/get? id=DOC45242&fl=entities,html
    29. 29. Push schema definition to the application • Not “schema less” • Just different owner of schema! • Schema may have common set of fields like id, type, timestamp, version • Nothing required.q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitor
    30. 30. Don’t do expensive things in Solr• Tika content extraction aka Solr Cell• UpdateRequestProcessorChain
    31. 31. Don’t do expensive things in Solr• Tika content extraction aka Solr Cell• UpdateRequestProcessorChain
    32. 32. Avro!• Supports serialization of data readable from multiple languages• It’s smart XML• Handles forward and reverse versions of an object• Compact and fast to read.
    33. 33. Avro! Solr Key/Value Cache .avro Metadata Ingest Solr Solr Pipeline Solr Solr Content Files
    34. 34. No JavaBin /u G te p iv / da e av m r e o!• Avoid Jarmaggeddon• Reflection? Ugh.
    35. 35. No JavaBin Solr Key/ Value CacheMetadata Ingest Solr Solr Pipeline Solr SolrContent Files
    36. 36. No JavaBin Solr Key/ Solr 3.4 Value CacheMetadata Ingest Solr Solr Pipeline Solr SolrContent Files
    37. 37. No JavaBin Solr Key/ Solr 3.4 Value CacheMetadata Solr 4 Ingest Solr Solr Pipeline Solr SolrContent Files
    38. 38. No JavaBin Solr Key/ Solr 3.4 Value CacheMetadata Solr 4 Ingest Solr Solr Pipeline Solr SolrContent Which SolrJ Files version do I use?
    39. 39. Telling some stories• Prototyping• Application Development• Maintaining Your Cluster
    40. 40. Upgrade Lucene Indexes Easily • Don’t reindex! • Try out new features! David Lylejava -cp lucene-core.jarorg.apache.lucene.index.IndexUpgrader [-delete-prior-commits] [-verbose] indexDir
    41. 41. Indexing is Easy and Quick
    42. 42. CHEAP AND CHEERFUL < >
    43. 43. NRT versus BigData
    44. 44. The tension betweenscale and update rate Bad Place to Be > 100,000,000 < 10,000,000
    45. 45. The tension between scale and update rate10 million Bad Place 100’s of millions
    46. 46. Grim Reaper
    47. 47. Delayed Replication<requestHandler name="/replication" class="solr.ReplicationHandler" ><lst name="slave"> <str name="masterUrl">http://localhost:8983/solr/replication</str> <str name="pollInterval">36:00:00</str></lst></requestHandler>
    48. 48. Enable/Disable<requestHandler name="/admin/ping" class="solr.PingRequestHandler"><lst name="invariants"> <str name="q">MY HARD QUERY</str> <str name="shards">http://search1.o19s.com:8983/solr/run2_1,http://search1.o19s.com:8983/solr/run2_2,http://search1.o19s.com:8983/solr/run2_2</lst><lst name="defaults"> <str name="echoParams">all</str></lst><str name="healthcheckFile">server-enabled.txt</str></requestHandler>
    49. 49. Enable/Disable• Solr-3301
    50. 50. Provisioning• Chef/Puppet• ZooKeeper• Have you versioned everything to build an index?
    51. 51. TYPICAL ENVIRONMENT
    52. 52. FLEXIBLE ENVIRONMENT
    53. 53. Do I need Failover?• Can I build quickly?• Do I have a reliable cluster?• Am I spread across data centers?• Is sooo 90’s....
    54. 54. Telling some stories• Prototyping• Application Development• Maintaining Your Cluster
    55. 55. Some Other Thoughts
    56. 56. Don’t be Mesmerized
    57. 57. ScientificMethod
    58. 58. Thank you!• epugh@o19s.com• @dep4b• www.opensourceconnections.com
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×