Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Intro @ analytics big data summit

A quick intro to Apache Spark

  • Be the first to comment

Spark Intro @ analytics big data summit

  1. 1. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Apache Spark : Fast and Easy Data Processing Sujee Maniyam Founder / Principal Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com
  2. 2. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Outline r  Background r  Spark Architecture r  Demo
  3. 3. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark r  Fast & Expressive Cluster computing engine r  Compatible with Hadoop r  Came out of Berkeley AMP Lab r  Now Apache project r  Version 1.2 just released (Dec 2014)
  4. 4. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Timeline Hadoop & Spark
  5. 5. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Hypo-meter J
  6. 6. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Job Trends
  7. 7. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Comparison With Hadoop Hadoop Spark Distributed Storage + Distributed Compute Distributed Compute Only MapReduce framework Generalized computation Usually data on disk (HDFS) On disk / in memory Not ideal for iterative work Great at Iterative workloads (machine learning ..etc) Batch process - Upto 2x -10x faster for data on disk - Upto 100x faster for data in memory Compact code Java, Python, Scala supported Shell for ad-hoc exploration
  8. 8. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Hadoop + Yarn : Universal OS for Cluster Computing HDFS YARN Batch (mapreduce) Streaming (storm, S4) In-memory (spark) Storage Cluster Management Applications
  9. 9. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Vs Hadoop r  Spark is ‘easier’ than Hadoop r  ‘friendlier’ for data scientists / analysts r  Interactive shell r fast development cycles r adhoc exploration r  API supports multiple languages r Java, Scala, Python r  Great for small (Gigs) to medium (100s of Gigs) data
  10. 10. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Vs. Hadoop Fast!
  11. 11. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Is Spark Replacing Hadoop? r  Right now, Spark runs on Hadoop / YARN r Complimentary r  Can be seen as generic MapReduce r  Spark is really great if data fits in memory (few hundred gigs), r  Spark can be used as compute platform with various storage types (see next slide)
  12. 12. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark & Pluggable Storage Spark (compute engine) HDFS Amazon S3 Cassandra ???
  13. 13. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Hadoop & Spark Future ???
  14. 14. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Outline r  Background r  à Spark Architecture r  Demo
  15. 15. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Eco-System Spark Core Spark SQL Spark Streaming ML lib Schema / sql Real Time Machine Learning
  16. 16. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Core r  Distributed compute engine r  Sort / shuffle algorithms r  Handles node failures -> re-computes missing pieces
  17. 17. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark SQL r  Adhoc / interactive analytics r  ETL workflow r  Reporting r  Natively understands r JSON : {“name”: “mark”, “age”: 40} r Parquet : on disk columnar format, heavily compressed
  18. 18. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark SQL Quick Example {"name":"mark", "age":50, "sex":"M"}, {"name":"mary", "age":45, "sex":"F"}, {"name":"brian", "age":15, "sex":"M"}, {"name":"nancy", "age":17, "sex":"F"} // … setup … val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") è  [brian, nancy]
  19. 19. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Streaming r  Process data streams in real time, in micro- batches r  Low latency, high throughput (1000s events / sec) r  Stock ticks / sensor data (connected devices / IoT – Internet of Things) Streaming Sources Storage
  20. 20. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Machine Learning (ML Lib) r  Machine learning at scale r  Out of the box ML capabilities ! r  Java / Scala / Python language support r  Lots of common algorithms are supported r  Classification / Regressions r  Linear models (linear R, logistic regression, SVM) r  Decision trees r  Collaborative filtering (recommendations) r  K-Means clustering r  More to come
  21. 21. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Cluster Deployment 1)  Standalone 2)  Yarn 3)  Mesos Client Node
  22. 22. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Cluster Architecture r  Multiple ‘applications’ can run at the same time r  Driver (or ‘main’) launches an application r  Each application gets its own ‘executor’ r  Isolated (runs in different JVMs) r  Also means data can not be shared across applications r  Cluster Managers: r  multiple cluster managers are supported r  1) Standalone : simple to setup r  2) YARN : on top of Hadoop r  3) Mesos : General cluster manager (AMP lab)
  23. 23. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Data Model : RDD r  Resilient Distributed Dataset (RDD) r  Can live in r Memory (best case scenario) r Or on disk (FS, HDFS, S3 …etc) r  Each RDD is split into multiple partitions r  Partitions may live on different nodes r  Partitions can be computed in parallel on different nodes
  24. 24. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Partitions Explained 1G file 64 M 64 M 64 M 64 M Task Task Task Task Result parallelism
  25. 25. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. RDD : Loading r  Use Spark context to load RDDs from disk / external storage val sc = new SparkContext(…) val f = sc.textFile(“/data/input1.txt”) // single file sc.textFile(“/data/”) // load all files under dir sc.textFile(“/data/*.log”) // wild card matching
  26. 26. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. RDD Operations r  Two kinds of operations on RDDs r 1) Transformations r Create a new RDD from existing ones (e.g. Map) r 2) Actions r E.g. Returns the results to clients (e.g. Reduce) r  Transformations are lazy.. Actions force transformations
  27. 27. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Transformations / Actions RDD 1 RDD 2 Client RDD 3 Transformation 1 (map) Action1 (collect)
  28. 28. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. RDD Transformations Transformation Description Example filter Filters through each record (aka grep) f.filter( line => line.contains(“ERROR”)) union Merges two RDDs rdd1.union(rdd2) …see docs …
  29. 29. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. RDD Actions Action Description Example count() Counts all records in an rdd f.count() first() Extract the first record f.first () take(n) Take first N lines f.take(10) collect() Gathers all records for RDD. All data has to fit in memory of ONE machine (don’t use for big data sets) f.collect() …. See documentation ..
  30. 30. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. RDD : Saving r  saveAsTextFile () and saveAsSequenceFile() f.saveAsTextFile(“/output/directory”) // a directory r  Output usually is a directory r RDDs will be saved as multiple files in the dir r Each partition à one output file
  31. 31. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Caching of RDDs r  RDDs can be loaded from disk and computed r Hadoop mapreduce model r  Also RDDs can be cached in memory r  Subsequent operations are much faster f.persist() // on disk or memory f.cache() // memory only r  In memory RDDs are great for iterative workloads r Machine learning algorithms
  32. 32. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Demo Time !
  33. 33. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Demo : Spark-shell r  Invoke spark shell r  Load a data set r  Do basic operations (count / filter)
  34. 34. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Demo: RDD Caching r  From Spark shell r  Load an RDD r  Demonstrate the difference between cached and non-cached
  35. 35. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Demo : Map Reduce r  Quick Word count val input = sc.textFile(“…”) val counts = input.flatMap( line => line.split(“ “)). map(word => (word, 1)). reduceByKey(_+_)
  36. 36. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Thanks ! Sujee Maniyam sujee@elephantscale.com http://elephantscale.com Expert consulting & training in Big Data
  37. 37. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Credits r  http://spark.apache.org/ r  http://www.strategictechplanning.com r  Tuningpp.com r  Kidzworld.com

×