Scaling up solr 4.1 to power big search in social media analytics
Upcoming SlideShare
Loading in...5
×
 

Scaling up solr 4.1 to power big search in social media analytics

on

  • 2,082 views

Presented by Timothy Potter, Architect, Big Data Analytics, Dachis Group ...

Presented by Timothy Potter, Architect, Big Data Analytics, Dachis Group

My presentation focuses on how we implemented Solr 4.1 to be the cornerstone of our social marketing analytics platform. Our platform analyzes relationships, behaviors, and conversations between 30,000 brands and 100M social accounts every 15 minutes. Combined with our Hadoop cluster, we have achieved throughput rates greater than 8,000 documents per second. Our index currently contains more than 500,000,000 documents and is growing by 3 to 4 million documents per day.

The presentation will include details about:

Designing a Solr Cloud cluster for scalability and high-availability using sharding and replication with Zookeeper
Operations concerns like how to handle a failed node and monitoring
How we deal with indexing big data from Pig/Hadoop as an example of using the CloudSolrServer in SolrJ and managing searchers for high indexing throughput
Example uses of key features like real-time gets, atomic updates, custom hashing, and distributed facets. Attendees will come away from this presentation with a real-world use case that proves Solr 4.1 is scalable, stable, and is production ready. (note: we are in production on 18 nodes in EC2 with a recent nightly build off the branch_4x).

Statistics

Views

Total Views
2,082
Views on SlideShare
1,746
Embed Views
336

Actions

Likes
1
Downloads
39
Comments
0

4 Embeds 336

http://www.lucenerevolution.org 240
http://lucenerevolution.org 94
http://lucenerevolution.stephenz.com 1
http://lucenerevolution.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Scaling up solr 4.1 to power big search in social media analytics Scaling up solr 4.1 to power big search in social media analytics Presentation Transcript

  • Scaling Solr 4 to Power Big Search in Social MediaAnalyticsTimothy PotterArchitect, Big Data Analytics, Dachis Group / Co-author Solr In Action
  • ® 2011 Dachis Group.dachisgroup.com• Anyone running SolrCloud inproduction today?• Who is running pre-Solr 4 version inproduction?• Who has fired up Solr 4.x in SolrCloudmode?• Personal interest – who waspurchased Solr in Action in MEAP?Audience poll
  • ® 2011 Dachis Group.dachisgroup.com• Gain insights into the key design decisions you needto make when using Solr cloudWish I knew back then ...• Solr 4 feature overview in context• Zookeeper• Distributed indexing• Distributed search• Real-time GET• Atomic updates• A day in the life ...• Day-to-day operations• What happens if you lose a node?Goals of this talk
  • ® 2011 Dachis Group.dachisgroup.comOur business intelligence platform analyzes relationships, behaviors, andconversations between 30,000 brands and 100M social accounts every 15 minutes.About Dachis Group
  • ® 2011 Dachis Group.dachisgroup.com
  • ® 2011 Dachis Group.dachisgroup.com• In production on 4.2.0• 18 shards ~ 33M docs / shard, 25GB on disk per shard• Multiple collections• ~620 Million docs in main collection (still growing)• ~100 Million docs in 30-day collection• Inherent Parent / Child relationships (tweet and re-tweets)• ~5M atomic updates to existing docs per day• Batch-oriented updates• Docs come in bursts from Hadoop; 8,000 docs/sec• 3-4M new documents per day (deletes too)• Business Intelligence UI, low(ish) query volumeSolution Highlights
  • ® 2011 Dachis Group.dachisgroup.com• ScalabilityScale-out: sharding and replicationA little scale-up too: Fast disks (SSD), lots of RAM!• High-availabilityRedundancy: multiple replicas per shardAutomated fail-over: automated leader election• ConsistencyDistributed queries must return consistent resultsAccepted writes must be on durable storage• Simplicity - wipSelf-healing, easy to setup and maintain,able to troubleshoot• Elasticity - wipAdd more replicas per shard at any timeSplit large shards into two smaller onesPillars of my ideal search solution
  • ® 2011 Dachis Group.dachisgroup.comNuts and BoltsNice tag cloud wordle.net!
  • ® 2011 Dachis Group.dachisgroup.com1. Zookeeper needs at least 3 nodes to establish quorum with faulttolerance. Embedded is only for evaluation purposes, you need todeploy a stand-alone ensemble for production2. Every Solr core creates ephemeral “znodes” in Zookeeper whichautomatically disappear if the Solr process crashes3. Zookeeper pushes notifications to all registered “watchers” when aznode changes; Solr caches cluster state1. Zookeeper provides “recipes” for solving common problems facedwhen building distributed systems, e.g. leader election2. Zookeeper provides centralized configuration distribution, leaderelection, and cluster state notificationsZookeeper in a nutshell
  • ® 2011 Dachis Group.dachisgroup.com• Number and size of indexed fields• Number of documents• Update frequency• Query complexity• Expected growth• BudgetNumber of shards?Yay for shard splitting in 4.3 (SOLR-3755)!
  • ® 2011 Dachis Group.dachisgroup.comWe use Uwe Schindler’s advice on 64-bit Linux:<directoryFactory name="DirectoryFactory"class="${solr.directoryFactory:solr.MMapDirectoryFactory}"/>See: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.htmljava -Xmx4g ...(hint: rest of our RAM goes to the OS to load index in memory mapped I/O)Small cache sizes with aggressive eviction – spread GC penalty out over time vs. all at once every timeyou open a new searcher<filterCache class="solr.LFUCache" size="50"initialSize="50" autowarmCount="25"/>Index Memory Management
  • ® 2011 Dachis Group.dachisgroup.com• Not a master• Leader is a replica (handles queries)• Accepts update requests for the shard• Increments the _version_ on the new orupdated doc• Sends updates (in parallel) to allreplicasLeader = Replica + Addl’ Work
  • ® 2011 Dachis Group.dachisgroup.comDon’t let your tlog’s get too big – use “hard” commits with openSearcher=“false”Distributed IndexingView of cluster state from ZkShard 1LeaderNode 1 Node 2Shard 2LeaderShard 2ReplicaShard 1ReplicaZookeeperCloudSolrServer“smart client”Hash on docID123Set the _version_tlogtlogGet URLs of current leaders?452 shards with 1 replica each<autoCommit><maxDocs>10000</maxDocs><maxTime>60000</maxTime><openSearcher>false</openSearcher></autoCommit>8,000 docs / secto 18 shards
  • ® 2011 Dachis Group.dachisgroup.comSend query request to any nodeTwo-stage process1. Query controller sends query to allshards and merges resultsOne host per shard must be onlineor queries fail2. Query controller sends 2nd query toall shards with documents in themerged result set to get requestedfieldsSolr client applications built for 3.x donot need to change (our query code stilluses SolrJ 3.6)LimitationsJOINs / Grouping need custom hashingDistributed searchView of cluster state from ZkShard 1LeaderNode 1 Node 2Shard 2LeaderShard 2ReplicaShard 1ReplicaZookeeperCloudSolrServer13q=*:*Get URLs of all live nodes42Query controllerOr just a load balancer works tooget fields
  • ® 2011 Dachis Group.dachisgroup.comSearch by daily activity volumeDrive analysisthat measuresthe impact ofa social messageover time ...Company postsa tweet on Monday,how much activityaround that messageon Thursday?
  • ® 2011 Dachis Group.dachisgroup.comProblem: Find all documents that had activity on a specific day• tweets that had retweets or YouTube videos that had comments• Use Solr join support to find parent documents by matching on child criteriafq=_val_:"{!join from=echo_grouping_id_s to=id}day_tdt:[2013-05-01T00:00:00ZTO 2013-05-02T00:00:00Z}" ...... But, joins don’t work in distributed queries and is probably too slow anywaySolution: Index daily activity into multi-valued fields. Use real-time GET to lookupdocument by ID to get the current daily volume fieldsfq:daily_volume_tdtm(2013-05-02’)sort=daily_vol(daily_volume_s,2013-04-01,2013-05-01)+descdaily_volume_tdtm: [2013-05-01, 2013-05-02] <= doc has child signals on May 1 and 2daily_volume_ssm: 2013-05-01|99, 2013-05-02|88 <= stored only field, doc had 99 child signals on May 1, 88 on May 2daily_volume_s: 13050288|13050199 <= flattened multi-valued field for sorting using a custom ValueSourceAtomic updates and real-time get
  • ® 2011 Dachis Group.dachisgroup.comWill it work? Definitely!Search can be addicting to your organization, queries wetested for 6 months ago vs. what we have today are vastlydifferentBuy RAM – OOMs and aggressive garbage collectioncause many issuesGive RAM from ^ to the OS – MMapDirectoryNeed a disaster recovery process in addition to Solr cloudreplication; helps with migrating to new hardware tooUse Jetty ;-)Store all fields! Atomic updates are a life saverLessons learned
  • ® 2011 Dachis Group.dachisgroup.comSchema will evolve – we thought we understood our data model but have sinceadded at least 10 new fields and deprecated some tooPartition if you can! e.g. 30-day collectionWe dont optimize – segment merging works greatSize your staging environment so that shards have about as many docs and sameresources as prod. I have many more nodes in prod but my staging servers haveroughly the same number of docs per shard, just fewer shards.Don’t be afraid to customize Solr! It’s designed to be customized with plug-ins• ValueSource is very powerful• Check out PostFilters:{!frange l=1 u=1 cost=200 cache=false}imca(53313,employee)Lessons learned cont.
  • ® 2011 Dachis Group.dachisgroup.com• Backups.../replication?command=backup&location=/mnt/backups• MonitoringReplicas serving queries?All replicas report same number of docs?Zookeeper healthNew search warm-up time• Configuration update processOur solrconfig.xml changes frequently – see Solr’s zkCli.sh• Upgrade Solr process (it’s moving fast right now)• Recover failed replica process• Add new replica• Kill the JVM on OOM (from Mark Miller)-XX:OnOutOfMemoryError=/home/solr/on_oom.sh-XX:+HeapDumpOnOutOfMemoryErrorMinimum DevOps Reqts
  • ® 2011 Dachis Group.dachisgroup.comNodes will crash! (ephemeral znodes)Or, sometimes you just need to restart aJVM (rolling restarts to upgrade)Peer sync via update log (tlog)100 updates else ...Good ol’ Solr replication from leader toreplicaNode recovery
  • ® 2011 Dachis Group.dachisgroup.com• Moving to a near real-time streaming model using Storm• Buying more RAM per node• Looking forward to shard splitting as it hasbecome difficult to re-index 600M docs• Re-building the index with DocValues• Weve had shards get out of sync after major failure –resolved it by going back to raw data and doing a key by keycomparison of what we expected to be in the index and re-indexingany missing docs.• Custom hashing to put all docs for a specific brand in the sameshardRoadmap / Futures
  • ® 2011 Dachis Group.dachisgroup.comIf you find yourself in thissituation, buy more RAM!Obligatory lolcats slide
  • CONTACTTimothy Potterthelabdude@gmail.comtwitter: @thelabdude