USE CASE DIAGNOSIS: WHEN IS SOLR
REALLY THE BEST TOOL?

Michael Hausenblas
Twitter: @mhausenblas

Chief Data Engineer EMEA...
Agenda
• 
• 
• 
• 
• 

Solr in the Big Data ecosystem
Polyglot Persistence
Common (Big Data) use cases
A checklist
When no...
processing
storage

Apache Pig

Apache Zookeeper
Polyglot Persistence
$ ls -al

$ tail –f some.log
$ nc localhost 80

awk 'BEGIN { FS = "," }
/2013-[[:digit:]]+-[[:digit:]]+/ { print $3 }’
sam...
Polyglot Persistence—Backdrop
• 

Michael Stonebraker and Ugur Çetintemel—2005
"One Size Fits All": An Idea Whose Time Has...
Polyglot Persistence—Key Points
• 

Use different datastores for different needs

• 

Can apply within an application or c...
Common (Big Data)
use cases
Where are we coming from?
• 
• 
• 
• 
• 

Keyword search
Spellcheck & autosuggest
Ranking
Faceted search
Spatial search
Use case:
search-based
recommendation
Search-based recommendation (credit card issuer)
• 

Given
–  customer purchase history
–  merchant designations
–  mercha...
Analyze with MapReduce

complete	
  
history	
  

Co-­‐occurrence	
  
(Mahout)	
  

Item	
  meta-­‐data	
  

SolR	
  
SolR...
Deploy with search system

user	
  
history	
  

Web	
  >er	
  

Item	
  meta-­‐data	
  

SolR	
  
SolR	
  
Solr	
  
Index...
Use case:
log analysis
Log analysis
• 

Given
–  Receive 200,000+ log lines per second

• 

Goal
–  Want to do multi-field search
–  Want to sear...
Data Ingestion and Indexing

incoming	
  data	
  

Ka@a	
  

SolR	
  
SolR	
  
Text	
  
Indexer	
  
Indexer	
  
analysis	
...
Search

Query	
  

Solr	
  
search	
  

Web	
  >er	
  
SolR	
  
SolR	
  
Solr	
  
Indexer	
  
Indexer	
  
search	
  

Raw	...
A checklist
Question you may want to ask …
• 

What is the volume of your data* (few GB? up to PB?)

• 

How are your query characteri...
Key qualifiers
• 

Want exploratory interface rather than aggregates in a dashboard

• 

Data are sparse symbol sets like ...
When not to use Solr …
Red Flags
• 

You need strong consistency?

• 

JOINS, anyone?

• 
• 
• 

reme
mber
:	
  one
fit	
  all
	
  size	
  d
—too
...
Let’s stay in touch …

• 

Twitter:
@mhausenblas
@MapR

MapR	
  Nordics	
  
MapR	
  UK	
  
MapR	
  HQ	
  
San	
  Jose,	
  ...
2013 11-07 lsr-dublin_m_hausenblas_when solr is best
Upcoming SlideShare
Loading in...5
×

2013 11-07 lsr-dublin_m_hausenblas_when solr is best

725

Published on

Presented by Michael Hausenblas, Chief Data Engineer, , MapR Technologies

This session will present an overview of common big data use cases in the form of a set of questions that can be used to determine what kind of problem you really have. From the answers to these questions, you can quickly find out about what technologies are likely to be most productive, useful and easy to apply.This analysis will also allow you to discern cases where Solr is not a good fit, but where augmentation with other big data systems like HBase leads to feasible architectures. Conversely, you will see cases where Solr can be the hero by filling the gaps that big data systems alone are destined to fail.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
725
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

2013 11-07 lsr-dublin_m_hausenblas_when solr is best

  1. 1. USE CASE DIAGNOSIS: WHEN IS SOLR REALLY THE BEST TOOL? Michael Hausenblas Twitter: @mhausenblas Chief Data Engineer EMEA, MapR Technologies
  2. 2. Agenda •  •  •  •  •  Solr in the Big Data ecosystem Polyglot Persistence Common (Big Data) use cases A checklist When not to use Solr …
  3. 3. processing storage Apache Pig Apache Zookeeper
  4. 4. Polyglot Persistence
  5. 5. $ ls -al $ tail –f some.log $ nc localhost 80 awk 'BEGIN { FS = "," } /2013-[[:digit:]]+-[[:digit:]]+/ { print $3 }’ sample.csv tool box one-size-fits-all
  6. 6. Polyglot Persistence—Backdrop •  Michael Stonebraker and Ugur Çetintemel—2005 "One Size Fits All": An Idea Whose Time Has Come and Gone •  Martin Fowler—2011 Polyglot Persistence1 •  Eric Brewer—2012 Ricon Keynote—Advancing Distributed Systems2 1) http://martinfowler.com/bliki/PolyglotPersistence.html 2) https://speakerdeck.com/eric_brewer/ricon-2012-keynote
  7. 7. Polyglot Persistence—Key Points •  Use different datastores for different needs •  Can apply within an application or cross-enterprise •  Encapsulating data access yields loosely coupled components •  Find sweet spot between dev/op complexity and flexibility
  8. 8. Common (Big Data) use cases
  9. 9. Where are we coming from? •  •  •  •  •  Keyword search Spellcheck & autosuggest Ranking Faceted search Spatial search
  10. 10. Use case: search-based recommendation
  11. 11. Search-based recommendation (credit card issuer) •  Given –  customer purchase history –  merchant designations –  merchant special offers •  Goal –  Improve existing recommender system –  Throughput important
  12. 12. Analyze with MapReduce complete   history   Co-­‐occurrence   (Mahout)   Item  meta-­‐data   SolR   SolR   Solr   Indexer   Indexer   indexing   Index   shards  
  13. 13. Deploy with search system user   history   Web  >er   Item  meta-­‐data   SolR   SolR   Solr   Indexer   Indexer   search   Index   shards  
  14. 14. Use case: log analysis
  15. 15. Log analysis •  Given –  Receive 200,000+ log lines per second •  Goal –  Want to do multi-field search –  Want to search on log lines with <30 second delay before search
  16. 16. Data Ingestion and Indexing incoming  data   Ka@a   SolR   SolR   Text   Indexer   Indexer   analysis   Solr   indexer   Real-­‐>me   Raw   documents   Older  index   shards   Live  index   shard   >me-­‐sharded  Solr  indexes  
  17. 17. Search Query   Solr   search   Web  >er   SolR   SolR   Solr   Indexer   Indexer   search   Raw   documents   Older  index   shards   Live  index   shard  
  18. 18. A checklist
  19. 19. Question you may want to ask … •  What is the volume of your data* (few GB? up to PB?) •  How are your query characteristics? –  full scans –  look-ups –  multiple passes over large parts –  continuous queries •  What’s (more) important: throughput or latency? *)  Note:  as  long  as  Moore's  law  s>ll  holds,  these  figures  obviously  change  on  a  yearly  if  not  monthly  basis.  
  20. 20. Key qualifiers •  Want exploratory interface rather than aggregates in a dashboard •  Data are sparse symbol sets like words or recommendation indicators •  Small-ish return sets are OK, especially if facets are good enough •  Near-real-time is good enough
  21. 21. When not to use Solr …
  22. 22. Red Flags •  You need strong consistency? •  JOINS, anyone? •  •  •  reme mber :  one fit  all  size  d —too Want (complex) transactions? l  belt oes  n  appr ot   oach! OLTP, streaming (but: near-real-time)   Graphs?
  23. 23. Let’s stay in touch … •  Twitter: @mhausenblas @MapR MapR  Nordics   MapR  UK   MapR  HQ   San  Jose,  US   MapR  DACH   MapR  Japan   MapR  SE  &  Benelux   MapR  Hyderbad   •  We’re hiring! MapR  Korea  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×