Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Search + Big Data: It's (still) All About the User- Grant Ingersoll


Published on

See conference video -

Apache Hadoop has rapidly become the primary framework of choice for enterprises that need to store, process and manage large data sets. It helps companies to derive more value from existing data as well as collect new data, including unstructured data from server logs, social media channels, call center systems and other data sets that present new opportunities for analysis. This keynote will provide insight into how Apache Hadoop is being leveraged today and how it evolving to become a key component of tomorrow's enterprise data architecture. This presentation will also provide a view into the important intersection between Apache Hadoop and search.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Search + Big Data: It's (still) All About the User- Grant Ingersoll

  1. 1. Search + Big Data:It’s (still) All About the UserGrant Ingersoll, Chief Scientist – Lucid Imagination October 19, 2011
  2. 2. Promise and Reality “Data is increasingly digital air: the oxygen we breathe and the carbon dioxide that we exhale. Itcan be a source of both sustenance and pollution.” Six Provocations for Big Data by Danah Boyd and Kate Crawford “The truth is, I spend most of my time trying toreduce the size of my data so it can be analyzed.” Hilary Mason, Chief Scientist, Bitly @ Strata
  3. 3. Pragmatism
  4. 4. Evolution Documents • Models • Feature Selection User Interaction Content • Clicks Relationships • Ratings/ • Page Rank, etc. Reviews • Organization • Learning to Rank • Social Graph Queries • Phrases • NLP
  5. 5. Minding the Intersection Search Analytics Discovery
  6. 6. Benefits§  End users •  Better relevance/conversion •  Serendipity •  Better/faster insight§  Business: •  ROI •  Awareness across organization •  Enablement •  Agility
  7. 7. Needs§  Fast, efficient, scalable search§  Large scale, cost effective storage§  Processing Power: •  Large scale distributed for whole data consumption •  Streaming/In Memory for real time needs •  Ability to learn§  Willingness to ask questions
  8. 8. The Good News
  9. 9. Search§  Good scalable, search a given •  Talks: Chitouras, Sturlese, Binns, Miller§  Custom Relevancy via function queries, boosts§  Explore other relevance models •  Talks: Muir, Pugh •  Lucene/Solr trunk has pluggable scoring (BM25, etc.)§  NRT for timeliness •  Talks: Busch
  10. 10. DiscoveryFacets •  Talks: Yonik •  Classification, TaxonomyClustering •  Talk: Frank S.Suggestions •  Auto-suggest, Spelling, More Like This, Related Searches, search trailsVisualization
  11. 11. Analytics
  12. 12. Analytics for End UsersOffline Online •  Popularity/Click •  Trends/Stats •  Link Analysis •  Search Trails •  Social/Personal •  Recommendations •  Spellchecking weights •  Location •  Collocations STORM
  13. 13. Analytics for Internal UsersOffline Online •  Top X •  Trends •  Zero results •  MRR, MAP •  Operational alerts •  User segmentation (QPS, •  Location, conversions DPS, etc) •  Ad hoc Analysis GIRAPH
  14. 14. What’s Missing?§  The glue is up to you (us?) •  Lucene Index -> Pig/Others •  Mahout -> Pig/Others •  Mahout -> Lucene/Solr •  Logs -> Pig/Others§  Nice to have: •  More in-index functionality (that performs) §  Aggregations §  Arbitrary stats §  Complex Joins
  15. 15. What’s Next?“I can have all the data I want to have – but I still have to communicate it to our players. It has to get into their minds. And they have to utilize it. ” Brad Stevens, Head Basketball Coach, Butler University in Oct. ‘11 McKinsey Quarterly
  16. 16. Thanks!§§  @gsingers§§
  17. 17. Lucene Ecosystem Spark Storm Giraph
  18. 18. Lucene Ecosystem Spark Storm Giraph