Your SlideShare is downloading. ×
0
Yann Yu
Systems Engineer @ Lucidworks
Who am I?
Lucidworks is Search.
Technology Retail
Financial
Services
IndustrialHealthcare
Why would you integrate Hadoop and Solr?
(and how would you do that?)
• Open-source
• Enterprise support
• Cheap, scalable storage
• Distributed computation
• Farm animals for extensibility
• ...
I have Hadoop, why do I need Solr?
• NoSQL front-end to Hadoop: Enable fast, ad-hoc, search across
structured and unstruct...
I have Solr, why do I need Hadoop?
• Least expensive storage solution in market
• Leverage Hadoop processing power (MapRed...
?
So what does this actually look like?
The enterprise storage situation today
⚒
Enterprise data deployment
Lucidworks HDFS connector
processes documents and
sends to SolrCloud
Enterprise documents
are s...
• Documents can be migrated from other file
storage systems via Flume or other scripts
• MapReduce allows for batch process...
Index document contents into Solr
• The Lucidworks Hadoop
connector parses content from
files using many different tools
• ...
• Users are empowered with ad-hoc,
full-text search in Solr
• Provides standard search tools
such as autocomplete, more-li...
Log record search
Machine generated log records
are sent to Flume.
Flume forwards raw log record
to Hadoop for archiving.
...
Flume archives data in HDFS
• Flume performs minimal work on log
files and sends them directly into
HDFS for archival
• Und...
Flume submits records to Solr
• Flume processes records, extracting
strings, ints, dates, times, and other
information int...
Real-time analytics dashboard
• Lucidworks SiLK allows users to create
simple dashboards through a GUI
• The Banana dashbo...
End
Any questions?
Find me at:
yann.yu@lucidworks.com
@yawnyou
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
Upcoming SlideShare
Loading in...5
×

SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

613

Published on

"Integrating Hadoop and Solr" - Yann Yu, Lucidworks

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
613
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
24
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of " SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr"

  1. 1. Yann Yu Systems Engineer @ Lucidworks Who am I?
  2. 2. Lucidworks is Search. Technology Retail Financial Services IndustrialHealthcare
  3. 3. Why would you integrate Hadoop and Solr? (and how would you do that?)
  4. 4. • Open-source • Enterprise support • Cheap, scalable storage • Distributed computation • Farm animals for extensibility • Open-source, Lucene based • Enterprise support • Real-time queries • Full-text search • NoSQL capabilities • Repeatedly proven in production environments at massive scales
  5. 5. I have Hadoop, why do I need Solr? • NoSQL front-end to Hadoop: Enable fast, ad-hoc, search across structured and unstructured big data • Empower users of all technical ability to interact with, and derive value from, big data — all using a natural language search interface (no MapReduce, Pig, SQL, etc.) • Preliminary data exploration and analysis • Near real-time indexing and querying • Thousands of simultaneous, parallel requests • Share machine-learning insights created on Hadoop to a broad audience through an interactive medium Hadoop excels in storing and working with large amounts of data, but has difficulty with frequent, random access to it
  6. 6. I have Solr, why do I need Hadoop? • Least expensive storage solution in market • Leverage Hadoop processing power (MapReduce) to build indexes or send document updates to Solr • Store Solr indexes and transaction logs within HDFS • Augment Solr data by storing additional information for last- second retrieval in Hadoop As Solr indexes grow in size, the size and number of the machines hosting Solr must also grow, increasing index time and complexity
  7. 7. ? So what does this actually look like?
  8. 8. The enterprise storage situation today ⚒
  9. 9. Enterprise data deployment Lucidworks HDFS connector processes documents and sends to SolrCloud Enterprise documents are stored in HDFS Users make ad-hoc, full-text queries across the full content of all documents in Solr And retrieve source files directly from HDFS as necessary Standard document storage and search
  10. 10. • Documents can be migrated from other file storage systems via Flume or other scripts • MapReduce allows for batch processing of documents (e.g. OCR, NER, clustering, etc.) Sink documents into HDFS
  11. 11. Index document contents into Solr • The Lucidworks Hadoop connector parses content from files using many different tools • Tika, GrokIngest, CSV mapping, Pig, etc. • Content and data are added to fields in a Solr document • The resulting document is sent to Solr for indexing
  12. 12. • Users are empowered with ad-hoc, full-text search in Solr • Provides standard search tools such as autocomplete, more-like- this, spellchecking, faceting, etc. • Users only access HDFS as needed Enable users to search and access content
  13. 13. Log record search Machine generated log records are sent to Flume. Flume forwards raw log record to Hadoop for archiving. Flume simultaneously parses out data in record into a Solr document, forwarding resulting document to Solr Lucidworks SiLK exposes real-time statistics and analytics to end-users, as well as full-text search High volume indexing of many small records
  14. 14. Flume archives data in HDFS • Flume performs minimal work on log files and sends them directly into HDFS for archival • Under optimal circumstances, the log files are sized to the block size of HDFS
  15. 15. Flume submits records to Solr • Flume processes records, extracting strings, ints, dates, times, and other information into Solr fields • Once the Solr document is created, it is submitted to Solr for indexing • This process happens in real-time, allowing for near real-time search
  16. 16. Real-time analytics dashboard • Lucidworks SiLK allows users to create simple dashboards through a GUI • The Banana dashboard will issue queries to Solr, rendering the received data in tables, graphs, and other plots • Users can also perform full-text search across the data, allowing for extremely fine granularity
  17. 17. End Any questions? Find me at: yann.yu@lucidworks.com @yawnyou
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×