• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big Search with Big Data Principles
 

Big Search with Big Data Principles

on

  • 3,713 views

Lessons I learned in indexing "Big Data" size dataset using Solr.

Lessons I learned in indexing "Big Data" size dataset using Solr.

Statistics

Views

Total Views
3,713
Views on SlideShare
3,651
Embed Views
62

Actions

Likes
10
Downloads
93
Comments
0

5 Embeds 62

http://www.scoop.it 51
https://twitter.com 5
https://si0.twimg.com 4
http://www.onlydoo.com 1
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • Search was the original big data problem. Then Google search came along, and search wandered in the wilderness of internal Enterprise search and ecommerce search. But now search is back, but with a new cooler name “Big Data”. Search interfaces are the dominant metaphor for working with big data sets by non data scientists.\n
  • SOLR-284 back in July 07 was a first cut at a content extraction library before Tika came along.\n
  • \n
  • And I love Agile development processes. And I think of agile as business -> requirements -> development -> testing -> systems administration\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • And I don’t mean this as a shot against Hadoop, but with the right hardware, you can get a lot done in bash, with a bit of Java or Perl sprinkled in.\nThere is a lot of value in getting started today building large scaled out ingestors.\n
  • \n
  • Notice our property style? Made it easy to read in properties in both Bash and Java!\n
  • Try sharding at different sizes using Mod\nTry sharding by month, or week, or hour depending on your volume of data.\n
  • \n
  • \n
  • We had huge left over “enterprise” boxes with ginourmous amounts of ram and cpu\n\n
  • \n
  • \n
  • \n
  • The verbose:gc and +PrintGCDetails lets you grep for the frequency of partial versus full garbage collecitons. We rolled back from 3.4 to 3.1 based on this data on one project.\n
  • Again, horse racing two slaves can help. You can also pass in the connection information via jconsole command line which makes it easier to monitor a set of Solrs\n
  • \n
  • \n
  • i love working with CSV and Solr. The CSV writer type is great for moving data between solrs. (Don’t forget to store everything!)\n
  • \n
  • \n
  • \n
  • \n
  • You have many fewer Solrs then you do Indexer processors.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • dollar tree makes crap. Stores are always empty or missing items. You don’t want your indexing like that. Space shuttle costed 500 MILLIOn dollars to launch it every time. You don’t want your indexing process to be like launching the space shuttle.\n
  • \n
  • \n
  • \n
  • runs every hour.\nLooks at log files to determine if a solr cluster is misbehaving.\n
  • Hal 9000 misbheaved\nruns every hour.\nLooks at log files to determine if a solr cluster is misbehaving.\nEspecially if you are on cloud platform. They implement their servers on the cheapest commodity hardware \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Kaa the snake from The Jungle Book hynotizing Mowgli. \nDanah Boyd among others have said that Big Data sometimes throws out thousands of years \n\n
  • \n
  • Nathan Marz\n
  • \n

Big Search with Big Data Principles Big Search with Big Data Principles Presentation Transcript

  • Big Search w/ Big Data Principles LuceneRevolution 2012 Eric Pugh | epugh@o19s.com | @dep4b
  • What is Big Search?
  • Who am i?• Principal of OpenSource Connections - Solr/Lucene Search Consultancy• Member of Apache Software Foundation• SOLR-284 UpdateRichDocuments (July 07)• Fascinated by the art of software development
  • n! io it ed d2n AUTHOR
  • AGILISTA
  • Telling some stories• Prototyping• Application Development• Maintaining Your Cluster
  • Not an intro to cloud Computing• See Indexing Big Data on Amazon AWS by Scott Stults @ 1:15 Thursday• See How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud by Seshu Simhadi @ 2:55 Thursday
  • Not an intro to SolrCloud!• See How SolrCloud Changes the User Experience In a Sharded Environment by Erick Erickson @ 2:55 Today• See Solr 4: The SolrCloud Architecture by Mark Miller @ 10:45 Tomorrow
  • My Assumptions for Client X• Big Data is any data set that is primarily at rest due to the difficulty of working with it.• Limited selection of tools available.• Aggressive timeline.• All the data must be searched per query.• On Solr 3.x line
  • Telling some stories• Prototyping• Application Development• Maintaining Your Cluster
  • Boy meets Girl StoryMetadata Ingest Solr Solr Pipeline Solr SolrContent Files
  • Bash Rocks
  • Bash Rocks• Remote Solr stop/start scripts• Remote Indexer stop/start scripts• Performance Monitoring• Content Extraction scripts (+Java)• Ingestor Scripts (+Java)• Artifact Deployment (CM)
  • Make it easy to change sharding
  • Make it easy to change sharding public void run(Map options, List<SolrInputDocument> docs) throwsInstantiationException, IllegalAccessException, ClassNotFoundException { IndexStrategy indexStrategy = (IndexStrategy) Class.forName( "com.o19s.solr.ModShardIndexStrategy").newInstance(); indexStrategy.configure(options); for (SolrInputDocument doc:docs){ indexStrategy.addDocument(doc); } }
  • Separate JVM from Solr Cores• Step 1: Fire up empty Solr’s on all the servers (nohup &).• Step 2:Verify they started cleanly.• Step 3: Create Cores (curl http:// search1.o19s.com:8983/solr/admin? action=create&name=run2)• Step 4: Create a “aggregator” core, passing in urls of Cores. (&property.shards=)
  • Go Wide Quickly
  • search1.o19s.comsearch1.o19s.com shard1 shard1 shard1 shard1 :8983 shard1 shard1 shard1 shard1 :8983 search2.o19s.com shard1 shard1 shard1 shard8 :8984 shard1 shard1 shard1 :8983 shard8 shard1 shard1 shard1 :8985 shard12 search3.o19s.com shard1 shard1 shard1 :8985 shard12 shard1 shard1 shard1 :8983 shard12
  • Simple Pipeline• Simple pipeline• mv is atomic
  • Don’t Move Files• SCP across machines is slow/error prone• NFS share, single point of failure.• Clustered file system like GFS (Global File System) can have “fencing” issues• HDFS shines here.• ZooKeeper shines here.
  • Can you test your changes?
  • JVM tuning is black art-verbose:gc-XX:+PrintGCDetails-server-Xmx8G-Xms8G-XX:MaxPermSize=256m-XX:PermSize=256m-XX:+AggressiveHeap-XX:+DisableExplicitGC-XX:ParallelGCThreads=16-XX:+UseParallelOldGC
  • Run, don’t Walk
  • Telling some stories• Prototyping• Application Development• Maintaining Your Cluster
  • Grab some Data#!/bin/shSOURCE_SOLR=http://ec2-107-20-92-190.compute-1.amazonaws.com:8983/solr/core0/select?q=*%3A*&start=0&rows=500000&wt=csvTARGET_SOLR=http://localhost:8983/solr/us_patent_grant/update/csvwget -O output.csv $SOURCE_SOLRcurl http://localhost:8983/solr/us_patent_grant/update/csv?skipLines=1&commit=true&optimize=true --data-binary@output.csv -H Content-type:text/plain; charset=utf-8
  • Using Solr as a key/ value store • thousands of queries per second without real time get.http://localhost:8983/solr/run2_enrichment/select?q=id:DOC45242&fl=entities,html • ??? with real time get?http://localhost:8983/solr/run2_enrichment/get?id=DOC45242&fl=entities,html
  • Using Solr as key/value store Solr Key/ Value Cache Metadata Ingest Solr Solr Pipeline Solr Solr Content Files
  • Using Solr as key/value store • thousands of queries per second without real time get.http://localhost:8983/solr/run2_enrichment/select?q=id:DOC45242&fl=entities,html • ??? with real time get? http://localhost:8983/solr/run2_enrichment/get? id=DOC45242&fl=entities,html
  • Push schema definition to the application • Not “schema less” • Just different owner of schema! • Schema may have common set of fields like id, type, timestamp, version • Nothing required.q=intensity_i:[70 TO 0]&fq=TYPE:streetlamp_monitor
  • Don’t do expensive things in Solr• Tika content extraction aka Solr Cell• UpdateRequestProcessorChain
  • Don’t do expensive things in Solr• Tika content extraction aka Solr Cell• UpdateRequestProcessorChain
  • Avro!• Supports serialization of data readable from multiple languages• It’s smart XML• Handles forward and reverse versions of an object• Compact and fast to read.
  • Avro! Solr Key/Value Cache .avro Metadata Ingest Solr Solr Pipeline Solr Solr Content Files
  • No JavaBin /u G te p iv / da e av m r e o!• Avoid Jarmaggeddon• Reflection? Ugh.
  • No JavaBin Solr Key/ Value CacheMetadata Ingest Solr Solr Pipeline Solr SolrContent Files
  • No JavaBin Solr Key/ Solr 3.4 Value CacheMetadata Ingest Solr Solr Pipeline Solr SolrContent Files
  • No JavaBin Solr Key/ Solr 3.4 Value CacheMetadata Solr 4 Ingest Solr Solr Pipeline Solr SolrContent Files
  • No JavaBin Solr Key/ Solr 3.4 Value CacheMetadata Solr 4 Ingest Solr Solr Pipeline Solr SolrContent Which SolrJ Files version do I use?
  • Telling some stories• Prototyping• Application Development• Maintaining Your Cluster
  • Upgrade Lucene Indexes Easily • Don’t reindex! • Try out new features! David Lylejava -cp lucene-core.jarorg.apache.lucene.index.IndexUpgrader [-delete-prior-commits] [-verbose] indexDir
  • Indexing is Easy and Quick
  • CHEAP AND CHEERFUL < >
  • NRT versus BigData
  • The tension betweenscale and update rate Bad Place to Be > 100,000,000 < 10,000,000
  • The tension between scale and update rate10 million Bad Place 100’s of millions
  • Grim Reaper
  • Delayed Replication<requestHandler name="/replication" class="solr.ReplicationHandler" ><lst name="slave"> <str name="masterUrl">http://localhost:8983/solr/replication</str> <str name="pollInterval">36:00:00</str></lst></requestHandler>
  • Enable/Disable<requestHandler name="/admin/ping" class="solr.PingRequestHandler"><lst name="invariants"> <str name="q">MY HARD QUERY</str> <str name="shards">http://search1.o19s.com:8983/solr/run2_1,http://search1.o19s.com:8983/solr/run2_2,http://search1.o19s.com:8983/solr/run2_2</lst><lst name="defaults"> <str name="echoParams">all</str></lst><str name="healthcheckFile">server-enabled.txt</str></requestHandler>
  • Enable/Disable• Solr-3301
  • Provisioning• Chef/Puppet• ZooKeeper• Have you versioned everything to build an index?
  • TYPICAL ENVIRONMENT
  • FLEXIBLE ENVIRONMENT
  • Do I need Failover?• Can I build quickly?• Do I have a reliable cluster?• Am I spread across data centers?• Is sooo 90’s....
  • Telling some stories• Prototyping• Application Development• Maintaining Your Cluster
  • Some Other Thoughts
  • Don’t be Mesmerized
  • ScientificMethod
  • Thank you!• epugh@o19s.com• @dep4b• www.opensourceconnections.com