Hi :)
"Big data is like teenage sex:
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is...
Solr for Analytics
Metrics Aggregations at Sematext

Otis Gospodnetić @otisg
OG - Pro

… since circa 2000

…. since 2007
Needle Addiction
Watermelon Addiction
Bread Love
Nuts!
Addiction Manifestation #1

Gathering
Search
Analytics

✓
✓
✓
Search Analytics
Addiction Manifestation #2

Gathering
Search
Analytics

✓
✓
✓
SPM
Addiction Manifestation #3

Gathering
Search
Analytics

✓
✓
✓
Logsene
The Surface
Under the Hood
?
Under the Hood
Stripped

<dimensionList name="solr_caches_metrics_cube"
formula="
system*(month+week+day+hour+five_minutes+minute)*solr_c...
Schema
<field name="id"

type="string" indexed="true" stored="true" required="true" multiValued="false"/>

<field name="ba...
Append on Write; Compact Later?
Compaction… maybe?
github.com / sematext / HBaseHUT

Ideally
Merge on Read
Merge on Read
Metric Retrieval
params={
shard.keys=2349_201!&
collection=partialPreaggCollection&wt=javabin&qt=select&
&start=0&
q=dimLi...
Write Friendly Present

SOLR-4816 - CloudSolrServer
SOLR-5232 - Streaming updates between
nodes
SOLR-3249 - Javabin betwee...
4.4 vs. 4.5

4.4 - no routing
vs.
4.5 routing via CloudSolrServer
Faster AFS flushes
Faster indexing: 3.5 M vs. 4 M docs i...
DocValues
<field name="id"
/>

type="string" indexed="true" stored="true"

required="true"

docValues="true"

<dynamicFiel...
!DocValues vs. DocValues

With DocValues:
… smaller heap
… more numerous, but smaller GCs
… roughly equal query latency
… ...
!DocValues vs. DocValues
Analytics Friendly Future

LUCENE-5189 - Numeric DocValues Updates
SOLR-2894 - Distributed Pivot Facets
SOLR-5302 - Analyt...
FIN. Questions?

sematext.com/about/jobs
20% off: LR2013SPM20 /spm
/logsene
/search-analytics
otis@sematext.com
@otisg

@s...
Filter Collection
Upcoming SlideShare
Loading in …5
×

Solr for Analytics

7,181 views

Published on

Aggregating Metrics with Solr at Sematext

Published in: Technology, Education
1 Comment
8 Likes
Statistics
Notes
No Downloads
Views
Total views
7,181
On SlideShare
0
From Embeds
0
Number of Embeds
2,944
Actions
Shares
0
Downloads
75
Comments
1
Likes
8
Embeds 0
No embeds

No notes for slide

Solr for Analytics

  1. 1. Hi :)
  2. 2. "Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it..." - Dan Ariely, Duke University
  3. 3. Solr for Analytics Metrics Aggregations at Sematext Otis Gospodnetić @otisg
  4. 4. OG - Pro … since circa 2000 …. since 2007
  5. 5. Needle Addiction
  6. 6. Watermelon Addiction
  7. 7. Bread Love
  8. 8. Nuts!
  9. 9. Addiction Manifestation #1 Gathering Search Analytics ✓ ✓ ✓ Search Analytics
  10. 10. Addiction Manifestation #2 Gathering Search Analytics ✓ ✓ ✓ SPM
  11. 11. Addiction Manifestation #3 Gathering Search Analytics ✓ ✓ ✓ Logsene
  12. 12. The Surface
  13. 13. Under the Hood
  14. 14. ?
  15. 15. Under the Hood
  16. 16. Stripped <dimensionList name="solr_caches_metrics_cube" formula=" system*(month+week+day+hour+five_minutes+minute)*solr_cache_name*host*solr_core_name"/>
  17. 17. Schema <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/> <field name="baseRowKey" type="binary" indexed="false" stored="true" required="true" multiValued="false"/> <field name="dimListId" type="int" <field name="metricsData" type="binary" indexed="false" stored="true" required="true" multiValued="false"/> <dynamicField name="dim_str_*" /> indexed="true" stored="true" required="true" multiValued="false"/> type="string" indexed="true" stored="false" multiValued="true" docValues="true" <dynamicField name="dim_num_*" type="tlong" indexed="true" stored="false" multiValued="true" docValues="true" /> <dimensionList name="solr_caches_metrics_cube" formula="system*(month+week+day+hour+five_minutes+minute)*solr_cache_name*host*solr_core_name"/>
  18. 18. Append on Write; Compact Later?
  19. 19. Compaction… maybe?
  20. 20. github.com / sematext / HBaseHUT Ideally
  21. 21. Merge on Read
  22. 22. Merge on Read
  23. 23. Metric Retrieval params={ shard.keys=2349_201!& collection=partialPreaggCollection&wt=javabin&qt=select& &start=0& q=dimListId:201& fq=dim_num_6:[1383570000000+TO+1383591600000]& fq=dim_str_0:MjM0OQ%3D%3D& fq=dim_str_18: (MDgwYTllNzgtNTZkMy00ODhlLWE2NGUtZmM2ODE5NGNkNGNk)& version=2& rows=99999} hits=74 status=0 QTime=41
  24. 24. Write Friendly Present SOLR-4816 - CloudSolrServer SOLR-5232 - Streaming updates between nodes SOLR-3249 - Javabin between nodes SOLR-???? - Aggregating URP
  25. 25. 4.4 vs. 4.5 4.4 - no routing vs. 4.5 routing via CloudSolrServer Faster AFS flushes Faster indexing: 3.5 M vs. 4 M docs in 55 min Lower heap: circa 10%
  26. 26. DocValues <field name="id" /> type="string" indexed="true" stored="true" required="true" docValues="true" <dynamicField name="dim_str_*" /> type="string" indexed="true" stored="false" multiValued="true" docValues="true" <dynamicField name="dim_num_*" type="tlong" indexed="true" stored="false" multiValued="true" docValues="true" /> Forward index data structure … in memory and/or on disk … compressible … off-heap See https://cwiki.apache.org/confluence/display/solr/DocValues http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/
  27. 27. !DocValues vs. DocValues With DocValues: … smaller heap … more numerous, but smaller GCs … roughly equal query latency … faster faceting, sorting, and grouping (but we are not using that in SPM with Solr) … didn’t see slow down in indexing (hmmm)
  28. 28. !DocValues vs. DocValues
  29. 29. Analytics Friendly Future LUCENE-5189 - Numeric DocValues Updates SOLR-2894 - Distributed Pivot Facets SOLR-5302 - Analytics Component SOLR-4465 - Configurable Collectors SOLR-5045 - Pluggable Analytics SOLR-5027 - Field Collapsing PostFilter
  30. 30. FIN. Questions? sematext.com/about/jobs 20% off: LR2013SPM20 /spm /logsene /search-analytics otis@sematext.com @otisg @sematext
  31. 31. Filter Collection

×