The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics


Published on

Presented by M.C. Srivas | MapR. See conference video -

This session addresses the biggest issue facing Big Data – Search, Discovery and Analytics need to be integrated. While creating and maintaining separate SOLR and Hadoop clusters is time consuming, error prone and difficult to keep in synch, most Hadoop installations do not integrate with SOLR within the same cluster. Find out how to easily integrate these capabilities into a single cluster. The session will also touch on some of the technical aspects of Big Data Search including how to; protect against silent index corruption that permeates large distributed clusters, overcome the shard distribution problem by leveraging Hadoop to ensure accurate distributed search results, and provide real-time indexing for distributed search including support for streaming data capture. Srivas will also share relevant experiences from his days at Google where he ran one of the major search infrastructure teams where GFS, BigTable and MapReduce were used extensively.

Published in: Technology
  • Great Slides thanks
    Are you sure you want to  Yes  No
    Your message goes here
  • on 11. Sharded text indexing: it looks like there are only shard_count workers doing text-index inversion. Why would it be done this way? The MapReduce should be fed all documents. The Map stage maps terms in a document to an output line like (file:shardid)(key:termid)(key:docid). Then the reducer is run to compress that into an index per shard. This allows the greatest number of workers on the expensive part (the index inversion).
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

  1. 1. The Search Is Over:Integrating SOLR and Hadoop toSimplify Big Data Analytics©MapR Technologies - Confidential 1
  2. 2. Evolution of Search Documents •Models •Feature Selection User Content Interaction Relationships •Clicks •Page Rank, etc. •Ratings/Reviews •Organization •Learning to Rank •Social Graph Queries •Phrases •NLP©MapR Technologies - Confidential 2
  3. 3. Search Discovery and Analytics Search Analytics Discovery©MapR Technologies - Confidential 3
  4. 4. Data is Growing Quickly Business Analytics Requires a New Approach Data Volume Growing 44x 2010: 1.2 Zettabytes 2020: 35.2 Zettabytes IDC Digital Universe Study 2011 Data is Growing Faster than Moore’s Law ©MapR Technologies - Confidential 4Source: IDC Digital Universe Study, sponsored by EMC, May 2010
  5. 5. MapReduce: A Paradigm Shift Distributed computing platform – Large clusters – Commodity hardware Pioneered at Google – Bigtable and Google File System Commercially available as Hadoop ©MapR Technologies - Confidential 5
  6. 6. Hadoop Explosion©MapR Technologies - Confidential 6 6
  7. 7. How does Map/Reduce work?1. Map – Spread data across servers based on key/value pairs – Each node independently scans local data2. Servers produce Map results3. Reduce - combine/merge Map results4. Process complete or Map a new function Like shuffling multiple decks of playing cards ©MapR Technologies - Confidential 7
  8. 8. The Cost of Enterprise Storage SAN Storage NAS Filers Local Storage $2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte $1M gets: $1M gets: $1M gets: 0.5Petabytes 1 Petabyte 20 Petabytes 200,000 IOPS 400,000 IOPS 10,000,000 IOPS 1Gbyte/sec 2Gbyte/sec 800 Gbytes/sec©MapR Technologies - Confidential 8
  9. 9. Deep Object Store Billions and Billions of Files For some use cases it’s not the storage capacity it’s the number of objects – Messages – Attachments – Images – Recordings Provides a deep storage pool that is analytic ready – Store it until you need it – Derive secondary value from analytic processing Makes more sense to perform analytics on the data and send results over the network©MapR Technologies - Confidential 9 9
  10. 10. Problems with Integrating Solr with Hadoop Simple to integrate with Hadoop as a data source Difficult to integrate distributed search and scale SolrCloud simplifies Sharding and Replication coordination Integration limitations based on capabilities of large scale storage – High availability – Data protection – Ease of Access©MapR Technologies - Confidential 10
  11. 11. Sharded text Indexing Assign documents Index text to local disk to shards and then copy index to distributed file store Clustered Reducer index storage Input Map documents Copy to local disk Local typically disk required before Local Search index can be loaded disk Engine©MapR Technologies - Confidential 11
  12. 12. Problems with Solr and Hadoop Failure of search engine requires Failure of a reducer another download causes garbage to of the index from accumulate in the clustered storage. Clustered local disk Reducer index storage Input Map documents Local disk Local Search disk Engine©MapR Technologies - Confidential 12
  13. 13. Limitations of HDFS  HDFS is Append Only NAS appliance  Data Access is through the HDFS API A B  High Availability is a challenge NameNode  Single points of failure DataNode DataNode DataNode  Limited to 50-200 million files  Performance bottleneck DataNode DataNode DataNode DataNode DataNode DataNode©MapR Technologies - Confidential 13
  14. 14. Logs, Flume, aggregates incoming events to Solr –Requires Multi-Step, Batch Process Hadoop Application Cluster Server Application Server Application Server©MapR Technologies - Confidential 14
  15. 15. What’s Required for SDA? Ease of Data Access through Open Standards Search Large Scale, Reliable Storage Ease of Integration Analytics Discovery – Management ( REST) – Security (LDAP, NIS, Linux PAM…) – Analytics (NFS, ODBC, HDFS)©MapR Technologies - Confidential 15
  16. 16. Ease of Data Access HDFS ENTERPRISE API NFS Access©MapR Technologies - Confidential 16
  17. 17. Multiple Architectures Possible Export to the world – NFS gateway runs on selected gateway hosts Local server – NFS gateway runs on local host – Enables local compression and check summing Export to self – NFS gateway runs on all data nodes, mounted from localhost©MapR Technologies - Confidential 17
  18. 18. Data Access through Standard Protocols NFS NFS Server NFS Server NFS Server NFS Server Client©MapR Technologies - Confidential 18
  19. 19. NFS Access through a Local server Application NFS Server Client Cluster Nodes©MapR Technologies - Confidential 19
  20. 20. Universal export to self Cluster Nodes Task NFS Cluster Server Node©MapR Technologies - Confidential 20
  21. 21. Nodes are identical Task Task NFS NFS Cluster Server Node Cluster Server Node Task NFS Cluster Server Node©MapR Technologies - Confidential 21
  22. 22. Simplifies Solr Hadoop Integration Search Engine Reducer Input Map Clustered documents index storage Failure of a reducer Search engine is cleaned up by reads mirrored map-reduce index directly. framework©MapR Technologies - Confidential 22
  23. 23. How Does this Integration Happen? Elegantly simple Direct Integration a result of leveraging architectures Data in the Hadoop cluster is written to a Volume Solr Crawler discovers content being entered into Hadoop Accesses the data in the cluster through NFS Builds Search Index Users access Solr to find data directly into Hadoop©MapR Technologies - Confidential 23
  24. 24. Distributed Shard Indexing shard#1,doc doc1 1 doc2 shard#1,[doc3,doc1] shard#2,doc doc3 shard#2,[doc2] index/s1 2 shard#3, [doc5]index/s2 shard#1,doc … index/s3 3 shard#3,doc … Input Map 4 Combine Shuffle Reduce Output and sort shard#3,doc 5 Reduce …©MapR Technologies - Confidential 24 24
  25. 25. How Does this Work at Scale withDistributed Indices? MapReduce jobs analyze distributed, disparate data in a cluster In distributed indexing, the input is split arbitrarily into chunks and each chunk is handled separately. There can be many more chunks than there are shards to be created. Mapper assigns document to shard – Shard is usually hash of document id Reducer indexes all documents for a shard – Indexes created on local disk – On success, copy index to DFS Zookeeper is used to manage Solr instances A large Solr Search is distributed across multiple shards©MapR Technologies - Confidential 25
  26. 26. What about HA and Data Protection? Cluster Capabilities can Extend to Integrated Search and Discovery Reliable Compute Dependable Storage Automated re-replication  Business continuity with snapshots and mirrors Self-healing from HW and SW failures  Recover to a point in time Load balancing  End-to-end check summing Rolling upgrades  Strong consistency No lost jobs or data  Mirror across sites to meet 99999’s of uptime Recovery Time Objectives©MapR Technologies - Confidential 26
  27. 27. MapReduce failure to write the Index Highly Available JobTracker and TaskTracker ensures that any failures are recovered with state to completion MapReduce will clean up partially written indexes No administrator intervention required©MapR Technologies - Confidential 27
  28. 28. Solr Node Fails Other Solr nodes start serving shards that were being served by failed node©MapR Technologies - Confidential 28
  29. 29. Node Containing the Index Fails Data is already replicated across the cluster Zookeeper assigns Solr instance on the replicated node to the replicated shard©MapR Technologies - Confidential 29
  30. 30. Additional High Availability and Replication Snapshots are available Administrator sets frequency at the Volume Snapshots with automatic de-duplication Saves space by sharing blocks Redirect on write, fast with no performance or storage penalty Zero performance loss on writing to original Scheduled, or on-demand Easy recovery with drag and drop©MapR Technologies - Confidential 30
  31. 31. Mirroring Support in Hadoop Cluster Business Continuity and Efficiency Production Research Efficient design  Differential deltas are updated Datacenter 1 WAN Datacenter 2  Compressed and check-summed Easy to manage WAN Production  Scheduled or on-demand EC2  WAN, Remote Seeding  Consistent point-in-time©MapR Technologies - Confidential 31
  32. 32. Simplified NFS data flows for DistributedSearch Search Mirroring allows Engine exact placement of index data Reducer Input Map documents Search Engine Aribitrary levels of replication also possible Mirrors ©MapR Technologies - Confidential 32
  33. 33. Improving Search Relevancy Requires a continuous Feedback Loop Search – The quality of the search is influenced by the end-user selections Analytics Discovery – Fully automated process that improves with use – Does not require manual tags or classification©MapR Technologies - Confidential 33
  34. 34. Recommendations Often referred to as collaborative filtering Actors interact with items – observe successful interaction We want to suggest additional successful interactions Observations inherently very sparse©MapR Technologies - Confidential 34
  35. 35. Examples Customers buying books (Linden et al) Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix) Internet radio listeners not skipping songs (Musicmatch) Internet video watchers watching >30 s©MapR Technologies - Confidential 35
  36. 36. Examples Query for Friends results in links to Seinfeld Search for kittens, get results for baby otters©MapR Technologies - Confidential 36
  37. 37. Dyadic Structure Functional – Interaction: actor -> item* Relational – Interaction ⊆ Actors x Items Matrix – Rows indexed by actor, columns by item – Value is count of interactions Predict missing observations©MapR Technologies - Confidential 37
  38. 38. Fundamental Algorithmics Co-occurrence A is actors x items, K is items x items Product has general shape of matrix K tells us “users who interacted with x also interacted with y”©MapR Technologies - Confidential 38
  39. 39. Why not Expand it? Users enter queries (A) – (actor = user, item=query) Users view videos (B) – (actor = user, item=video) A’A gives query recommendation – “did you mean to ask for” B’B gives video recommendation – “you might like these videos”©MapR Technologies - Confidential 39
  40. 40. The punch-line B’A recommends videos in response to a query – (isn’t that a search engine?) – (not quite, it doesn’t look at content or meta-data)©MapR Technologies - Confidential 40
  41. 41. Real-life example Query: “Paco de Lucia” Conventional meta-data search results: – “hombres del paco” times 400 – not much else Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff©MapR Technologies - Confidential 41
  42. 42. Real-life example©MapR Technologies - Confidential 42
  43. 43. The Search for Relevancy Updating Search to Reflect Relevancy – Big Map Reduce jobs can use behaviorial traces in logs to improve results and identify Importance Search Analytics Discovery The power of this virtuous loop depends on ease of frictionless data access, high availability, performance©MapR Technologies - Confidential 43