Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

812 views

Published on

Andrew recently joined Lucidworks to head up their Advisory practice, and is a Committer and PMC member on the Apache Mahout project.

Abstract summary

Apache Mahout: Distributed Matrix Math for Machine Learning:
Machine learning and statistics tools like R and Scikit-learn are declarative, flexible, and extensible, but they scale poorly. “Big Data” tools such as Apache Spark, Apache Flink, and H2O distribute well, but have rudimentary functionality for machine learning and are not easily extensible. In this talk we present Apache Mahout, which provides a Scala-based, R-like DSL for doing linear algebra on distributed systems, letting practitioners quickly implement algorithms on distributed matrices. We will highlight new features in version 0.13 including the hybrid CPU/GPU-optimized engine, and a new framework for user-contributed methods and algorithms similar to R’s CRAN.

We will cover some history of Mahout, introduce the R-Like Scala DSL, provide an overview of how Mahout is able to operate on matrices distributed across multiple computers, and how it takes advantage of GPUs on each computer in a cluster creating a hybrid distributed/GPU-accelerated environment; then demonstrate the kinds of normally complex or unfeasible problems users can easily solve with Mahout; show an integration which allows Mahout to leverage the visualization packages of projects such as R, Python, and D3; and lastly explain how to develop algorithms and submit them to the Mahout project for other users to use.

Published in: Technology
  • Be the first to comment

Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

  1. 1. Apache Mahout Distributed Matrix Math for Machine Learning
  2. 2. About Me • Senior Director of Data Science at Lucidworks (Apache Solr/Lucene, Fusion search tools) • Formerly Chief Data Scientist, Technical Lead of Data Science Practice at Accenture • Committer and PMC Member, Apache Mahout • On Twitter @akm • Email at akm@apache.org, andrew.musselman@lucidworks.com • Adversarial Learning podcast with @joelgrus at http://adversariallearning.com
  3. 3. Apache Mahout Recent Trends in 0.12/0.13 • Simplify and improve performance of distributed matrix-math programming • Provide flexible computation options for software and hardware • Enable easier and quicker new algorithm development • Allow polyglot programming and plotting in notebooks via Apache Zeppelin
  4. 4. Introduction to Apache Mahout Apache Mahout is an environment for creating scalable, performant, machine-learning applications Apache Mahout provides: • Mathematically expressive Scala DSL • A collection of pre-canned math and statistics algorithms • Interchangeable distributed engines • Interchangeable native solvers (JVM, CPU, GPU, CUDA, or custom)
  5. 5. Feature Highlights in Recent Releases • v 0.13.1, Soon — CUDA Solvers, Apache Spark 2.1/Scala 2.11 support • New web site platform, May 2017 — Moved from ASF CMS system to Markdown and Jekyll; allows documentation pull requests to be merged in and published automatically • v 0.13.0, Apr 2017 — GPU/CPU Solvers, algorithm framework • v 0.12.2, Nov 2016 — Apache Zeppelin integration for notebooks and visualization • v 0.12.0, Apr 2016 — Apache Flink backend support • New Mahout book, Feb 2016 — ‘Apache Mahout: Beyond MapReduce’ by Dmitriy Lyubimov and Andrew Palumbo • v 0.10.0 - Apr 2015 - Mahout-Samsara vector-math DSL, MapReduce jobs soft- deprecated, Spark backend support
  6. 6. Topic Overview • Mahout-Samsara: Declarative, R-like, domain-specific language (DSL) for matrix math • Backend-agnostic programming • Apache Zeppelin notebooks • Algorithm development framework (modeled after sk-learn) • Solve on available CPU cores, single or multiple GPUs, or in the JVM • Next steps, and how to get involved
  7. 7. Mahout-Samsara
  8. 8. Mahout-Samsara MapReduce is dead; long live the little clip-art blue man!
  9. 9. Mahout-Samsara • Mahout-Samsara is an easy-to-use domain-specific language (DSL) for large-scale machine learning on distributed systems like Apache Spark and Flink • Uses Scala as programming/scripting environment • Algebraic expression optimizer for distributed linear algebra • Provides a translation layer to distributed engines • Support for Spark and Flink DataSets, RDDs • System-agnostic, R-like DSL; actual formula from (d)spca: val G = B %*% B.t - C - C.t + (ksi dot ksi) * (s_q cross s_q)
  10. 10. Mahout-Samsara • Mahout-Samsara computes C = A’A via row-outer-product formulation: • Executes in a single pass over row-partitioned A Example of an algebraic optimization • Logical optimization • Optimizer rewrites plan to use logical operator for Transpose-Times-Self matrix multiplication • Single pass: multiply partitioned rows by themselves as transposed columns • Computation of A’A: val C = A.t %*% A • Naïve execution • 1st pass: transpose A (requires repartitioning of A) • 2nd pass: multiply result with A (expensive, potentially requires repartitioning again)
  11. 11. Mahout-Samsara • Mahout-Samsara computes C = A’A via row-outer-product formulation: • Executes in a single pass over row-partitioned A Example of an algebraic optimization
  12. 12. Mahout-Samsara • Mahout-Samsara computes C = A’A via row-outer-product formulation: • Executes in a single pass over row-partitioned A Example of an algebraic optimization
  13. 13. Mahout-Samsara • Mahout-Samsara computes C = A’A via row-outer-product formulation: • Executes in a single pass over row-partitioned A Example of an algebraic optimization
  14. 14. Mahout-Samsara • Mahout-Samsara computes C = A’A via row-outer-product formulation: • Executes in a single pass over row-partitioned A Example of an algebraic optimization
  15. 15. Backend-Agnostic Programming
  16. 16. Backend-Agnostic Programming
  17. 17. Apache Zeppelin Notebooks
  18. 18. Apache Zeppelin Notebooks • Notebooks for polyglot programming with all types of data • Plotting with R and Python off of computed data from other tools in the same notebook • Share variables between interpreters • For more: https://zeppelin.apache.org • Mahout interpreter for Zeppelin released June 2016 • Post by Trevor Grant on how to use it at https://rawkintrevo.org/2016/05/19/visualizing- apache-mahout-in-r-via-apache-zeppelin-incubating • https://mahout.apache.org/docs/0.13.1-SNAPSHOT/tutorials/misc/mahout-in-zeppelin/
  19. 19. Apache Zeppelin Notebooks Add the Mahout Interpreter
  20. 20. Apache Zeppelin Notebooks Add the Mahout Interpreter, click “Create”
  21. 21. Apache Zeppelin Notebooks Example usage
  22. 22. Apache Zeppelin Notebooks Example usage
  23. 23. Apache Zeppelin Notebooks Hand results to R for plotting
  24. 24. Algorithm Development Framework
  25. 25. Algorithm Development Framework • Patterned after R and Python (sk-learn) APIs • Fitter populates a Model, which contains the parameter estimates, fit statistics, a summary, and has a predict() method • https://rawkintrevo.org/2017/05/02/introducing-pre-canned-algorithms-apache-mahout • https://mahout.apache.org/docs/0.13.1-SNAPSHOT/tutorials/misc/contributing-algos
  26. 26. Solve on CPU, GPU, or JVM
  27. 27. Solve on CPU, GPU, or JVM Current architecture with native CPU and GPU support and unreleased jCUDA bindings
  28. 28. Solve on CPU, GPU, or JVM Initial benchmarking on latest release
  29. 29. Solve on CPU, GPU, or JVM Initial benchmarking on latest release • Sparse MMul at geometry of 1000 x 1000 %*% 1000 x 1000 density = 0.2, with 5 runs Mahout JVM Sparse multiplication time: 1501 ms
 Mahout jCUDA Sparse multiplication time: 49 ms 30x speedup • Sparse MMul at geometry of 1000 x 1000 %*% 1000 x 1000 density = .02, with 5 runs Mahout JVM Sparse multiplication time: 34 ms Mahout jCUDA Sparse multiplication time: 4 ms 8.5x speedup
 • Sparse MMul at geometry of 1000 x 1000 %*% 1000 x 1000 density = .002, with 5 runs Mahout JVM Sparse multiplication time: 1 ms Mahout jCUDA Sparse multiplication time: 1 ms 0x speedup
  30. 30. Solve on CPU, GPU, or JVM • jCUDA work is still in a branch, will be in master in the next couple months • Currently the modes of compute are JVM, CPU (using all available cores), and single GPU • Multi-GPU is next priority • Currently multiplication takes place in different solvers based on matrix shape (banding, triangularity, etc.) • Directing location for data and compute based on shape and density is another priority • Watch this space for other speedups Next steps
  31. 31. How to Use Mahout and Get Involved
  32. 32. How to Use Mahout and Get Involved Web: https://mahout.apache.org Source code, PRs welcome: https://github.com/apache/mahout Mailing lists: https://mahout.apache.org/community/mailing- lists.html Download, install, embed: https://mahout.apache.org/ downloads.html
  33. 33. Thank You Q&A h2ps://mahout.apache.org h2ps://github.com/apache/mahout

×