Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Practical Machine Learning for Smarter Search with Solr and Spark

Given at a #cloudnativenerds MeetUp in Mainz, Germany on Apr 21, 2016

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

Practical Machine Learning for Smarter Search with Solr and Spark

  1. 1. Practical Machine Learning for Smarter Search with Solr and Spark Jake Mannix @pbrane Lead Data Engineer, Lucidworks
  2. 2. $ whoami Now: Lucidworks, Office of the CTO: applied ML / data engineering R&D Previously: • Allen Institute for AI: Semantic Search on academic research publications • Twitter: account search, user interest modeling, content recommendations • LinkedIn: profile search, generic entity-to-entity recommender systems Prehistory: • other software companies, algebraic topology, particle cosmology
  3. 3. • Why Spark and Solr for Data Engineering? • Quick intro to Solr • Quick intro to Spark • Example: ManyNewsgroups • data exploration • clustering: unsupervised ML • classification: supervised ML • recommender: collaborative filtering + content-based • search ranking Overview
  4. 4. Practical Data Science with Spark and Solr Why does Solr need Spark? Why does Spark need Solr?
  5. 5. Why do data engineering with Solr and Spark? Solr Spark • Data exploration and visualization • Easy ingestion and feature selection • Powerful ranking features • Quick and dirty classification and clustering • Simple operation and scaling • Stats and math built in • General purpose batch/streaming compute engine Whole collection analysis! • Fast, large scale iterative algorithms • Advanced machine learning: MLLib, Mahout, Deep Learning4j • Lots of integrations with other big data systems
  6. 6. Why does Spark need Solr? Typical Hadoop / Spark data-engineering task, start with some data on HDFS: $ hdfs dfs -ls /user/jake/mail/lucene-solr-user/2015 … -rw-r--r-- 1 jake staff 6304388 Feb 4 18:22 part-00001.lzo -rw-r--r-- 1 jake staff 7977085 Feb 4 18:22 part-00002.lzo -rw-r--r-- 1 jake staff 7210817 Feb 4 18:22 part-00003.lzo -rw-r--r-- 1 jake staff 1215048 Feb 4 18:22 part-00004.lzo Now what? What’s in these files?
  7. 7. Solr gives you: • random access data store • full-text search • fast aggregate statistics • just starting out: no HDFS / S3 necessary! • world-class multilingual text analytics: • no more: tokens = str.toLowerCase().split(“s+“) • relevancy / ranking • realtime HTTP service layer
  8. 8. • Apache Lucene • Grouping and Joins • Stats, expressions, transformations and more • Lang. Detection • Extensible • Massive Scale/Fault tolerance Solr Key Features • Full text search (Info Retr.) • Facets/Guided Nav galore! • Lots of data types • Spelling, auto-complete, highlighting • Cursors • More Like This • De-duplication
  9. 9. Why Spark for Solr? • Spark-shell: a Big Data REPL with all your fave JVM libs! • Build the index in parallel very, very quickly! • Aggregations • Boosts, stats, iterative global computations • Offline compute to update index with additional info (e.g. PageRank, popularity) • Whole corpus analytics and ML: clustering, classification, CF, rankers • General-purpose distributed computation • Joins with other storage (Cassandra, HDFS, DB, HBase)
  10. 10. Spark Key Features • General purpose, high powered cluster computing system • Modern, faster alternative to MapReduce • 3x faster w/ 10x less hardware for Terasort • Great for iterative algorithms • APIs for Java, Scala, Python and R • Rich set of add-on libraries for machine learning, graph processing, integrations with SQL and other systems • Deploys: Standalone, Hadoop YARN, Mesos, AWS, Docker, …
  11. 11. • Initial exploration of ASF mailing-list archives • Index it into Solr • Explore a bit deeper: unsupervised Spark ML • Exploit labels: predictive analytics Example: Many NewsGroups
  12. 12. • Initial exploration of ASF mailing-list archives • index into Solr: just need to turn your records into json • facet: • fields with low cardinality or with sensible ranges • document size histogram • projects, authors, dates • find: broken fields, automated content, expected data missing, errors • now: load into a spark RDD via SolrRDD: Many NewsGroups: Initial Exploration
  13. 13. • cleanup/filtering via spark DataFrame operations: • create thread groups: Many NewsGroups: Initial Exploration
  14. 14. • try other text analyzers: (no more str.split(“w+”)! ) Many NewsGroups: Initial Exploration ref: Lucidworks blog on LuceneTextAnalyzer by Steve Rowe
  15. 15. • Unsupervised machine learning: • clustering documents with KMeans • extract topics with Latent Dirichlet Allocation • learn word vectors with Word2Vec Many NewsGroups: Exploratory Data Science
  16. 16. • Vectorize and run KMeans: Many NewsGroups: Exploratory Data Science
  17. 17. • Build topic models with LDA: Many NewsGroups: Exploratory Data Science
  18. 18. • Build word vector representations with Word2Vec: Many NewsGroups: Exploratory Data Science
  19. 19. • Now for some real Data Science: Many NewsGroups: Supervised Learning
  20. 20. • What else could you do? • Try other classification algs, cross-validate to pick! • Recommender Systems • content-based: • mail-thread as “item”, head msgs grouped by replier as “user” profile • search query of users against items to recommend • collaborative-filtering: • users replying to a head msg “rate” them +-tively • train a Spark ML ALS RecSys model • Train search rankers in click logs Many NewsGroups: Next steps?
  21. 21. Resources • spark-solr: https://github.com/Lucidworks/spark-solr • Company: http://www.lucidworks.com • Our blog: http://www.lucidworks.com/blog • Apache Solr: http://lucene.apache.org/solr • Apache Spark: http://spark.apache.org • Fusion: http://www.lucidworks.com/products/fusion • Twitter: @pbrane

    Be the first to comment

    Login to see the comments

  • thanhtran81

    Jul. 11, 2017
  • KrystaBouzek

    Jul. 14, 2017
  • t3c

    Feb. 10, 2018

Given at a #cloudnativenerds MeetUp in Mainz, Germany on Apr 21, 2016

Views

Total views

566

On Slideshare

0

From embeds

0

Number of embeds

3

Actions

Downloads

24

Shares

0

Comments

0

Likes

3

×