Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.

5,587 views

Published on

Presented at Lucene/Solr Revolution 2014

Published in: Software
  • Be the first to comment

Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.

  1. 1. Search Analytics Component Steven Bower ©2014 Bloomberg L.P.
  2. 2. Bloomberg • Largest provider of financial news and information • Our strength is quickly and accurately delivering data, news and analytics • Creating high performance and accurate information retrieval systems is core to our strength
  3. 3. Bloomberg Search Team • Search infrastructure • Develop and support search as a service platform • Support for other search applications within the company • Consultancy • Provide design consultancy/support to application teams • Promote search best practices/standardization throughout the company • Machine learning • Develop machine learning techniques to improve relevancy • Create natural language processors to answer questions • Unified search • Create information retrieval tools to organize and connect the vast and varied datasets provided to our clients
  4. 4. Our Challenge
  5. 5. Our Approach • Use Search/Solr as it provides flexible search/filtering over large, fast moving, result sets • Initially used StatsComponent, but quickly ran into limitations • Wanted to push the bounds of analytics capabilities in Solr/Lucene • Needed a pluggable framework to perform complex calculations/aggregations on numerical time-series data • DocValues provided high performance columnar access to fields in the index (without un-inversion cost)
  6. 6. DocValues • DocValues provide high performance columnar access to fields in the index • No un-inversion cost • Increased storage footprint • Helps achieve NRT • Values live off-heap in memory map
  7. 7. Analytics Component • New component from the ground up • Designed/Implemented by the Bloomberg Search Team over summer of 2013 • Initial implementation was built using DocValues API directly, but moved to FieldCache • Refactored existing faceting implementation to support analytics • Created simple prefix notation for statistical expressions • Available as a Solr Contrib module in Solr 5.x or patches for 4.8+ on SOLR-5302
  8. 8. Features • Flexible/Extendable framework for adding additional statistics/faceting • Supports Multiple Analytics Requests per query execution • Multiple statistic calculations per request • Multiple facets per request • Each request can facet statistics over different fields and ranges
  9. 9. Features - Faceting • Field Faceting • Support for int, long, float, double, date, string fields • Support for multi-value fields • Support for limit, offset and mincount • Support for sorting of stats-facets by any statistic (i.e. sort by mean) • Range faceting • Numeric types and dates • Dynamically calculate range/gap based on calculated statistics • Support for query faceting of stats • Use calculated statistics to generate facet queries
  10. 10. Features – Map Operators • Basic Math • neg(<expr>) • add(<expr>,...) • mult(<expr>,...) • div(<expr>,<expr>) • pow(<expr>,<expr>) • log(<expr>,<expr>) • Constants • const_num(<number>) • const_date(<date>) • const_str(<string>) • Date Math • date_math(<date expr>,<date op>,...) • String operations • rev(<expr>) • concat(<expr>,...) • Field • <field> • Missing Values • miss(<expr>,<value>)
  11. 11. Features – Reduction Operators • Statistical • min(<expr>) • max(<expr>) • sum(<expr>) • count(<expr>) • miss(<expr>) • unique(<expr>) • Complex • sumofsquares(<expr>) • mean(<expr>) • stddev(<expr>) • median(<expr>) • percentile(<expr>)
  12. 12. Examples • Weighted Average • Calculate weighted average of field_a with field_b as the weight div( mean( mult(field_a, field_b) ), sum(field_b) ) • Variance • Calculate the variance of field_a pow( stddev(field_a), const_num(2) )
  13. 13. Examples • T-Score • Calculate a t-score where ## is the value and all values in your sample are stored in field_a. div( add( const_num(##), neg( mean(field_a) ) ), div( stddev(field_a), pow( count(field_a), const_num(.5) ) ) )
  14. 14. How We Use It • Segment, aggregate and analyze financial data quickly • Aggregate time series data across multiple fields to render charts • Created flexible diagnostic tools/ visualizations to analyze Solr performance
  15. 15. Future Plans • Multi-shard support • Pivot Facet Support • Statistics on Multi-value fields • To support unique() • Filter result set based upon calculated statistics • Generalize facet implementation
  16. 16. Links and Questions? Analytics Component h"ps://issues.apache.org/jira/browse/SOLR-­‐5302 More About Bloomberg h"p://www.bloomberglabs.com/

×