SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
Upcoming SlideShare
Loading in...5

SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr



"Integrating Hadoop and Solr" - Yann Yu, Lucidworks

"Integrating Hadoop and Solr" - Yann Yu, Lucidworks



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr Presentation Transcript

  • Yann Yu Systems Engineer @ Lucidworks Who am I?
  • Lucidworks is Search. Technology Retail Financial Services IndustrialHealthcare
  • Why would you integrate Hadoop and Solr? (and how would you do that?)
  • • Open-source • Enterprise support • Cheap, scalable storage • Distributed computation • Farm animals for extensibility • Open-source, Lucene based • Enterprise support • Real-time queries • Full-text search • NoSQL capabilities • Repeatedly proven in production environments at massive scales
  • I have Hadoop, why do I need Solr? • NoSQL front-end to Hadoop: Enable fast, ad-hoc, search across structured and unstructured big data • Empower users of all technical ability to interact with, and derive value from, big data — all using a natural language search interface (no MapReduce, Pig, SQL, etc.) • Preliminary data exploration and analysis • Near real-time indexing and querying • Thousands of simultaneous, parallel requests • Share machine-learning insights created on Hadoop to a broad audience through an interactive medium Hadoop excels in storing and working with large amounts of data, but has difficulty with frequent, random access to it
  • I have Solr, why do I need Hadoop? • Least expensive storage solution in market • Leverage Hadoop processing power (MapReduce) to build indexes or send document updates to Solr • Store Solr indexes and transaction logs within HDFS • Augment Solr data by storing additional information for last- second retrieval in Hadoop As Solr indexes grow in size, the size and number of the machines hosting Solr must also grow, increasing index time and complexity
  • ? So what does this actually look like?
  • The enterprise storage situation today ⚒
  • Enterprise data deployment Lucidworks HDFS connector processes documents and sends to SolrCloud Enterprise documents are stored in HDFS Users make ad-hoc, full-text queries across the full content of all documents in Solr And retrieve source files directly from HDFS as necessary Standard document storage and search
  • • Documents can be migrated from other file storage systems via Flume or other scripts • MapReduce allows for batch processing of documents (e.g. OCR, NER, clustering, etc.) Sink documents into HDFS
  • Index document contents into Solr • The Lucidworks Hadoop connector parses content from files using many different tools • Tika, GrokIngest, CSV mapping, Pig, etc. • Content and data are added to fields in a Solr document • The resulting document is sent to Solr for indexing
  • • Users are empowered with ad-hoc, full-text search in Solr • Provides standard search tools such as autocomplete, more-like- this, spellchecking, faceting, etc. • Users only access HDFS as needed Enable users to search and access content
  • Log record search Machine generated log records are sent to Flume. Flume forwards raw log record to Hadoop for archiving. Flume simultaneously parses out data in record into a Solr document, forwarding resulting document to Solr Lucidworks SiLK exposes real-time statistics and analytics to end-users, as well as full-text search High volume indexing of many small records
  • Flume archives data in HDFS • Flume performs minimal work on log files and sends them directly into HDFS for archival • Under optimal circumstances, the log files are sized to the block size of HDFS
  • Flume submits records to Solr • Flume processes records, extracting strings, ints, dates, times, and other information into Solr fields • Once the Solr document is created, it is submitted to Solr for indexing • This process happens in real-time, allowing for near real-time search
  • Real-time analytics dashboard • Lucidworks SiLK allows users to create simple dashboards through a GUI • The Banana dashboard will issue queries to Solr, rendering the received data in tables, graphs, and other plots • Users can also perform full-text search across the data, allowing for extremely fine granularity
  • End Any questions? Find me at: yann.yu@lucidworks.com @yawnyou