Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017

269 views

Published on

In times of huge amounts of heterogeneous data available, processing and extracting knowledge requires more and more efforts on building complex software architectures. In this context, Apache Spark provides a powerful and efficient approach for large-scale data processing. This talk will briefly introduce a powerful machine learning library (MLlib) along with a general overview of the Spark framework, describing how to launch applications within a cluster. In this way, a demo will show how to simulate a Spark cluster in a local machine using images available on a Docker Hub public repository. In the end, another demo will show how to save time using unit tests for validating jobs before running them in a cluster.

Published in: Technology
  • Be the first to comment

Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017

  1. 1. Building Machine Learning applications locally with Spark 21/06/2017 Joel Pinho Lucas
  2. 2. Agenda • Problems and Motivation • Spark and MLlib overview • Launching applications in a Spark cluster • Simulating a Spark cluster using Docker • Demo: deploying a Spark cluster in a local machine • Unit tests for Spark jobs 2
  3. 3. 3 • How to setup a Spark cluster (infra + configuration)? • Test and/or Debug a Spark job • All team should have the same environment
  4. 4. 4 • Lightweight cluster • One machine • Same environment for all team • Deployed easily in any platform Run Spark Locally with docker
  5. 5. 5 • Easy to develop (API in Java, Scala, Python, R) • High Quality algorithms http://spark.apache.org/mllib/ • Fast to run • Lazy evaluation • In memory Storage
  6. 6. 6 http://spark.apache.org/docs/2.1.0/cluster-overview.html Spark Execution Model
  7. 7. Cluster Types • Standalone • Apache Mesos • HadoopYarn 7
  8. 8. 8 Starting a Cluster Manually Manually Submitting an Application
  9. 9. Choose your Docker Image (or build your own and share) 9
  10. 10. Some available Spark Docker Images 10 • https://github.com/big-data-europe/docker-spark • https://hub.docker.com/r/internavenue/centos-spark/ • https://github.com/sequenceiq/docker-spark • https://github.com/epahomov/docker-spark • https://www.anchormen.nl/spark-docker/ • https://github.com/gettyimages/docker-spark • https://hub.docker.com/r/bigdatauniversity/spark/
  11. 11. http://github.com/joelplucas/docker-spark 11
  12. 12. Example to Run • MLlib's FP-Growth algorithm • Data from the digital publishing domain • Problem: to find frequent patterns from navigation profiles • Write results in MongoDB http://github.com/joelplucas/fpgrowth-spark-example 12
  13. 13. The Dataset 13
  14. 14. Unit Testing using Spark Testing Base • Launched in Strata NYC 2015 by Holden Karau (and maintained by the community) • Supports unit tests in Java, Scala and Python 14
  15. 15. Q&A - Contact ‣ Linkedin: http://br.linkedin.com/in/joelplucas/ ‣ Email: joelpl@gmail.com 15

×