Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16

472 views

Published on

Smarter Search With Spark-Solr: Search gets smarter when you know more about your documents and their relationship to each other (think: PageRank) and the users (i.e. popularity), in addition to what you already know about their content (text search). It also gets smarter when you know more about your users (personalization) and both their affinity for certain kinds of content and their similarities to each other (collaborative filtering recommenders).

Building all of these pieces typically requires a big mix of batch workloads to do log processing, as well as training machine-learned models to use during realtime querying, and are highly domain specific, but many techniques are fairly universal: we will discuss how Spark can interface with a Solr Cloud cluster to efficiently perform many of the pieces to this puzzle in one relatively self-contained package (no HDFS/S3, all data stored in Solr!), and introduce “spark-solr” – an open-source JVM library to facilitate this.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16

  1. 1. A Practical Data Science Workbench: spark-solr Jake Mannix @pbrane Lead Data Engineer, Lucidworks
  2. 2. $ whoami Now: Lucidworks, Office of the CTO: applied ML / data engineering R&D Previously: • Allen Institute for AI: Semantic Search on academic research publications • Twitter: account search, user interest modeling, content recommendations • LinkedIn: profile search, generic entity-to-entity recommender systems Prehistory: • other software companies, algebraic topology, particle cosmology
  3. 3. Cold Start Imagine you jumped into a new Data Lake…
  4. 4. • What is the “Minimum Viable Big Data Science Toolkit”? • DB? Distributed FS? NoSQL store? • ML libraries / frameworks (scripting? notebook? REPL?) • text analysis or graph libraries? • dataviz package? • hosting layer (for models and/or POC apps)? Cold Start
  5. 5. • Spark and Solr for Data Engineering • Why Solr? • Why Spark? • Example rapid turnaround workflow: Searchhub • data exploration • clustering: unsupervised ML • classification: supervised ML • recommenders: collaborative filtering + content-based + “mixed-mode” Overview
  6. 6. Practical Data Science with Spark and Solr Why does Solr need Spark? Why does Spark need Solr?
  7. 7. Why does Spark need Solr? Typical Hadoop / Spark data-engineering task, start with some data on HDFS: $ hdfs dfs -ls /user/jake/mail/lucene-solr-user/2015 … -rw-r--r-- 1 jake staff 63043884 Feb 4 18:22 part-00001.lzo -rw-r--r-- 1 jake staff 79770856 Feb 4 18:22 part-00002.lzo -rw-r--r-- 1 jake staff 72108179 Feb 4 18:22 part-00003.lzo -rw-r--r-- 1 jake staff 12150481 Feb 4 18:22 part-00004.lzo Now what? What’s in these files?
  8. 8. Solr gives you: • random access data store • full-text search • fast aggregate statistics • just starting out: no HDFS / S3 necessary! • world-class multilingual text analytics: • no more: tokens = str.toLowerCase().split(“s+“) • relevancy / ranking • realtime REST service layer / web console
  9. 9. • Apache Lucene • Grouping and Joins • Streaming parallel SQL • Stats, expressions, transformations and more • Lang. Detection • Extensible • Massive Scale/Fault tolerance Solr Key Features • Full text search (Info Retr.) • Facets/Guided Nav galore! • Lots of data types • Spelling, auto-complete, highlighting • Cursors • More Like This • De-duplication
  10. 10. Why Spark for Solr? • spark-shell: a Big Data REPL with all your fave JVM libs! • Build the index in parallel very, very quickly • Aggregations • Boosts, stats, iterative global computations • Offline compute to update index with additional info (e.g. PageRank, popularity) • Whole corpus analytics and ML: clustering, classification, CF, rankers • General-purpose distributed computation • Joins with other storage (Cassandra, HDFS, DB, HBase)
  11. 11. Why do data engineering with Solr and Spark? SolrSpark • Data exploration and visualization • Easy ingestion and feature selection • Powerful ranking features • Quick and dirty classification and clustering • Simple operation and scaling • Stats and math built in • General purpose batch/streaming compute engine Whole collection analysis! • Fast, large scale iterative algorithms • Advanced machine learning: MLLib, Mahout, Deep Learning4j • Lots of integrations with other big data systems and together: http://github.com/lucidworks/spark-solr
  12. 12. • Free Data ! ASF mailing-list archives + github + JIRA • https://github.com/lucidworks/searchhub • Index it into Solr • Explore a bit deeper: unsupervised Spark ML • Exploit labels: predictive analytics • Build a recommender, mix & match with search Example workflow: Searchhub TM
  13. 13. • Initial exploration of ASF mailing-list archives • index into Solr: just need to turn your records into json • facet: • fields with low cardinality or with sensible ranges • document size histogram • projects, authors, dates • find: broken fields, automated content, expected data missing, errors • now: load into a spark RDD via SolrRDD: Searchhub: Initial Exploration
  14. 14. • try other text analyzers: (no more str.split(“w+”)! ) Smarter Text Analysis in Spark ref: Lucidworks blog on LuceneTextAnalyzer by Steve Rowe
  15. 15. • Unsupervised machine learning with MLLib or Mahout: • clustering documents with KMeans • extract topics with Latent Dirichlet Allocation • learn word vectors with Word2Vec • Write the results back to solr: Searchhub: Exploratory Data Science
  16. 16. • can also do something more like real Data Science: Searchhub Classification: “Many Newsgroups”
  17. 17. Recommender Systems with Spark and Solr
  18. 18. • Recommender Systems • content-based: • mail-thread as “item”, head msgs grouped by replier as “user” profile • search query of users against items to recommend • collaborative-filtering: • users replying to a head msg “rate” them +-tively • train a Spark-ML ALS RecSys model • both can generate item-item similarity models Spark+Solr RecSys
  19. 19. • With top-K closest items by both CF and Content: • store them back into a Solr collection! • fetch your (or generic user’s) recent items • query them: • “q=(cf:123^1.1 cf:39^2.3 cf:93^0.7)^alpha (ct:912^2.9 ct:123^1.8 ct:99^2.2)^(1-alpha)” Experimenting with mixed-mode Recommenders
  20. 20. Resources • spark-solr: http://github.com/lucidworks/spark-solr • searchhub: http://github.com/lucidworks/searchhub • Company: http://www.lucidworks.com • Our blog: http://www.lucidworks.com/blog • Fusion: http://www.lucidworks.com/products/fusion • Twitter: @pbrane

×