Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Similar to Spark Under the Hood - Meetup @ Data Science London(20)

Advertisement

More from Databricks(20)

Advertisement

Spark Under the Hood - Meetup @ Data Science London

  1. Under the Hood Meetup @ Data Science London Aug 27, 2015
  2. Who are we? Sameer Farooqui Doug Bateman Jon Bates •  Dir of Training @ NewCircle •  Spark Trainer for Databricks •  800+ trainings on Java, Python, Android, Hibernate, Spring, etc •  Trainer @ Databricks •  150+ trainings on Hadoop, C*, HBase, Couchbase, NoSQL, etc •  Data Scientist •  Consultant for Databricks •  EdX assistant instructor on Scalable ML w/ Spark
  3. Agenda: Talks Sameer Farooqui Doug Bateman Jon Bates 15  mins:   •  Intro & Spark Overview 25  mins:   •  Power Plant Demo •  ETL + Linear Regression 25  mins:   •  Iris Flower Demo •  Model Parallel w/ sci-kit learn
  4. Agenda: Q & A 30  mins   +   •  Consulting Architect for Cloudera •  Cluster setup, Security/Kerberos, Hive, Impala, HBase, Spark •  Based in Germany •  R, Sci-Kit Learn, Spark, Mahout, HBase, Hive, Pig •  Senior Data Scientist @ Big Data Partnership + Spark Trainer for DB •  Based in London Stephane Rion Lars Francke
  5. Who are you? 1) I have used Spark hands on before… 2) I have more than 1 year hands on experience with ML…
  6. 6 Spark Core Spark Streaming Spark SQL MLlib GraphX
  7. 7 Spark Core DataFrames ML Pipelines Spark Streaming Spark SQL MLlib GraphX
  8. 8 {JSON} Data Sources Spark Core DataFrames ML Pipelines Spark Streaming Spark SQL MLlib GraphX
  9. 9 {JSON} Data Sources Spark Core DataFrames ML Pipelines Spark Streaming Spark SQL MLlib GraphX
  10. 10 Goal: unified engine across data sources, workloads and environments
  11. Spark – 100% open source and mature Used in production by over 500 organizations. From fortune 100 to small innovators
  12. Apache Spark: Large user community MapReduce YARN HDFS Storm Spark 0 1000 2000 3000 4000 Commits in the past year
  13. 0 20 40 60 80 100 120 140 2011 2012 2013 2014 2015 Contributors per Month to Spark Most active project in big data 13
  14. Large-Scale Usage Largest cluster: 8000 nodes Largest single job: 1 petabyte Top streaming intake: 1 TB/hour 2014 on-disk 100 TB sort record
  15. 15 On-Disk Sort Record: Time to sort 100TB Source: Daytona GraySort benchmark, sortbenchmark.org 2100 machines2013 Record: Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes
  16. 2014: an Amazing Year for Spark Total contributors: 150 => 500 Lines of code: 190K => 370K 500+ active production deployments 16
  17. The Databricks team contributed more than 75% of the code added to Spark in the past year
  18. Overview of ML Algorithms Prediction: •  Regression •  Classification Tokenizer, HashingTF, IDF, Word2Vec,Nomalizer, StandardScaler LinearRegression, DecisionTree, SVM,LogisticRegression, NaiveBayes, DecisionTree Feature Transformation: Recommendation: ALS Clustering: KMeans, GaussianMixtureEM, LDA
  19. Overview of ML Algorithms Other: •  Statistics •  Linear Algebra •  Optimization Correlation, ChiSqTest, Statistics, MultivariateOnlineSummarizer RowMatrix, EigenValueDecomposition, Matrix, Vector GradientDescent, LBFGS
  20.   Spark  Driver       Executor    Task   Task   Executor    Task   Task   Executor    Task   Task   Executor    Task   Task   Spark Physical Cluster
  21. Spark Data Model Error,  ts,  msg1   Warn,  ts,  msg2   Error,  ts,  msg1     RDD / DataFrame with 4 partitions Info,  ts,  msg8   Warn,  ts,  msg2   Info,  ts,  msg8     Error,  ts,  msg3   Info,  ts,  msg5   Info,  ts,  msg5     Error,  ts,  msg4   Warn,  ts,  msg9   Error,  ts,  msg1     logLinesRDD  
  22. Spark Data Model item-­‐1   item-­‐2     item-­‐3   item-­‐4   item-­‐5   item-­‐6   item-­‐6   item-­‐8   item-­‐9   item-­‐10   Ex RD DRD D Ex RD DRD D Ex RD D more  par((ons  =  more  parallelism  
  23. Power Plant Demo
  24. Use Case: predict power output given a set of readings from various sensors in a gas-fired power generation plant Schema Definition: AT  =  Atmospheric  Temperature  in  C   V  =  Exhaust  Vacuum  Speed   AP  =  Atmospheric  Pressure   RH  =  RelaCve  Humidity   PE  =  Power  Output  (value  we  are  trying  to  predict)  
  25. 1.  ETL     2.  Explore + Visualize Data 3.  Apply Machine Learning Steps:
  26. Iris Flower Demo
  27. Use Case: Link          legacy  code            to  Spark
  28. Different ways to parallelize ML •  Model Parallelism •  Divide & Conquer •  Data Parallelism
  29. Model Parallelism •  Model stored across workers •  Communicate data to all workers •  Examples: •  Grid search •  Cross validation •  Ensemble
  30. Divide & Conquer •  Minimizes communication •  Leads to approximate solutions
  31. Data Parallelism •  Data stored across workers •  Communicate model to all workers •  Examples: •  MLLib Linear models •  Matrix outer products
  32. Scalability Rules 1st Rule of thumb Computation & Storage should be linear (in n, d ) 2nd Rule of thumb Perform parallel and in-memory computation 3rd Rule of thumb Minimize Network Communication
  33. Agenda: Q & A 30  mins   Stephane Rion Lars Francke Sameer Farooqui Doug Bateman Jon Bates
Advertisement