Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark Introduction @ University College London

1,339 views

Published on

Spark presentation at University College London.

Published in: Devices & Hardware
  • Be the first to comment

Apache Spark Introduction @ University College London

  1. 1. 1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved. Intro to Apache Spark Training – University College London September 8th, 2014 Suhas Gogate : Architect, Pivotal Hadoop Engg Shivram Mani: Lead Engineer, Pivotal Hadoop Engg.
  2. 2. 2© Copyright 2013 Pivotal. All rights reserved. About Me: Suhas (https://www.linkedin.com/in/vgogate)  Since 2008, active in Hadoop infrastructure and ecosystem components – Worked with lead Hadoop technology based companies – Yahoo, Netflix, Hortonworks, EMC-Greenplum/Pivotal  Founder and PMC member/committer of the Apache Ambari project  Contributed Apache “Hadoop Vaidya” – Performance diagnostics for M/R  Prior to Hadoop, – IBM Almaden Research (2000-2008), CS Software & Storage systems. – In early days (1993) of my career, worked with a team that built first Indian super computer, PARAM (Transputer based MPP system) at Center for Development of Advance Computing (CDAC, Pune)
  3. 3. 3© Copyright 2013 Pivotal. All rights reserved. About Me: Shivram (https://www.linkedin.com/in/shivrammani)  Since 2009, active user of Hadoop – Yahoo, EMC-Greenplum/Pivotal  Built the Pivotal Command Center (Cluster Configuration/Management)  Lead developer for – Pivotal Extension Framework – Unified Storage System  Prior to Hadoop, – Yahoo Web Search Federation – Yahoo Vertical Search Relevance
  4. 4. 4© Copyright 2013 Pivotal. All rights reserved. Abstract Apache Spark is one of the most exciting and talked about ASF projects today, but how should enterprise architects view it, and what type of impact might it have on our platforms? This talk will introduce Spark and its core concepts, the ecosystem of services on top of it, types of problems it can solve, similarities and differences from Hadoop, deployment topologies, and possible uses in enterprise. Concepts will be illustrated with a variety of demos covering: the programming model, the development experience, “realistic” infrastructure simulation with local virtual deployments, and Spark cluster monitoring tools.
  5. 5. 5© Copyright 2013 Pivotal. All rights reserved. Day 1 (Sept 8th, 2014) – Agenda  What is Spark, – What does it have to do with Big Data/Hadoop?  Spark Programming Model  Spark Internals: – Execution, Shuffles, Tasks, Stages  Spark Deployment models  Demo & Hands-on exercise  Q/A
  6. 6. 6© Copyright 2013 Pivotal. All rights reserved. 6© Copyright 2013 Pivotal. All rights reserved. What Is Spark?
  7. 7. 7© Copyright 2013 Pivotal. All rights reserved. What is Spark?  Distributed Compute Engine for analysis of large data sets, like Hadoop M/R – Inspired by deficiencies in Hadoop M/R batch processing ▪ Data Replication, Serialization, Disk I/O etc. – Effectively uses distributed cluster memory for faster computations  A common framework primarily designed for following types of workloads, – Iterative graph processing algorithms (Google Pregel) – Iterative machine learning algorithms like Page Rank, K-means clustering, Logistic regression etc (HALoop) – Interactive data mining – run multiple ad-hoc queries on the same data set – Along with “batch” workloads like Hadoop M/R on data in memory  Implementation of Resilient Distributed Dataset (RDD) in Scala  Similar scalability and fault tolerance as Hadoop Map/Reduce – Although uses different fault tolerance model of lineage to reconstitute data instead of replication  Programmatic interface via API or Interactive – Scala, Java7/8, Python
  8. 8. 8© Copyright 2013 Pivotal. All rights reserved. Spark is also …  Came out of AMPLab project at UCB  An ASF Top Level project – https://spark.apache.org/ – https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira- projects-plugin:summary-panel  An active community of ~100-200 contributors across 25-35 companies – More active than Hadoop MapReduce – 1000 people (the max) attended Spark Summit – http://spark-summit.org  Hadoop Compatible
  9. 9. 9© Copyright 2013 Pivotal. All rights reserved. Spark is not …  An OLTP data store  A “permanent” data store  Or an application cache It’s also not Mature enough compared to Hadoop – This is a good thing. Lots of room to grow.
  10. 10. 10© Copyright 2013 Pivotal. All rights reserved. Spark is not Hadoop, but is compatible  Often better than Hadoop – M/R is fine for “Data Parallel”, but awkward for some workloads – Low latency dispatch, Iterative, Streaming  Natively accesses Hadoop data – Data Locality  Spark just another YARN job – Utilizes current investments in Hadoop – Brings Spark to the Data  It’s not OR … it’s AND!
  11. 11. 11© Copyright 2013 Pivotal. All rights reserved. Improvements over Map/Reduce  Efficiency – General Execution Graphs (not just map->reduce->store) – In memory  Usability – Rich APIs in Scala, Java, Python – Interactive Can Spark be the R for Big Data?
  12. 12. 12© Copyright 2013 Pivotal. All rights reserved. Short History  2009 Started as research project at UCB  2010 Open Sourced  January 2011 AMPLab Created  October 2012 0.6 – Java, Stand alone cluster, maven  June 21 2013 Spark accepted into ASF Incubator  Feb 27 2014 Spark becomes top level ASF project  May 30 2014 Spark 1.0  August 5th, 2014 Spark 1.0.2
  13. 13. 13© Copyright 2013 Pivotal. All rights reserved. 13© Copyright 2013 Pivotal. All rights reserved. Spark Programming Model RDDs in Detail
  14. 14. 14© Copyright 2013 Pivotal. All rights reserved. Spark Program Model (Scala)  A driver program runs user’s main function, execute set of parallel operations (transformations and actions), on a collection of elements called Resilient, Distributed Dataset (RDD) val conf = new SparkConf().setAppName(appName).setMaster(master) val sc = new SparkContext(conf) val file = sc.textFile(“hdfs://.../logfile”) val err_lines = file.filter(_.contains(“ERROR”)) val err_msgs = err_lines.map(MyFunc.extractMessage) err_msgs.cache() val errFile = err_msgs.saveAsTextFile(“hdfs://…/errlogfile”) err_msgs.count() RDDs Actions Transformation s
  15. 15. 15© Copyright 2013 Pivotal. All rights reserved. Resilient Distributed Dataset (RDD)  A new Data Type supported under spark framework extension to – Scala, Java, Python  A read-only collection of records partitioned across cluster nodes – Does not support insert/update/delete of records from RDD  Created by – Reading file(s) from HDFS – Parallelizing existing collections (lists, arrary, maps etc.) – By executing transformations on existing RDDs  RDDs can be persisted w/ following options – In memory – Serialized or Non-serialized object (optionally replicated) – On Disk – Serialized (optionally replicated) – In memory file system like Tachyon  RDDs store lineage information – Support coarse grain recovery of whole partition upon node failure
  16. 16. 16© Copyright 2013 Pivotal. All rights reserved. Two Categories of Operations on RDD  Transforms – Create from stable storage (hdfs, tachyon, etc.) – Generate RDD from other RDD (map, filter, groupBy) – Lazy Operations that build a DAG of Tasks – Once Spark knows your transformations it can build a plan  Actions – Return a result or write to storage (count, collect, save, etc.) – Actions cause the DAG to execute (like Apache PIG)
  17. 17. 17© Copyright 2013 Pivotal. All rights reserved. Transformation and Actions  Transformations – map – filter – flatMap – sample – groupByKey – reduceByKey – union – join – sort  Actions – count – collect – reduce – lookup – save
  18. 18. 18© Copyright 2013 Pivotal. All rights reserved. Spark Shared Variables (btw’n Tasks & Driver)  Broadcast variables – Read only variable cached once on each node (not shipped w/ each task) – Multiple tasks can refer to it that runs on the node – Broadcast variable should be used in the program after it is created – Original variable should not be modified after broadcast val broadcastVar = sc.broadcast(Array(1, 2, 3))  Accumulators – Accumulator variables are like counters in M/R (i.e. tasks can add to it) – Tasks can not read the value of accum variable, only driver program can read it scala> val accum = sc.accumulator(0) scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) scala> accum.value  Local variables of the transformation function are shipped along with the function to each node – They can not be accessed by driver program
  19. 19. 19© Copyright 2013 Pivotal. All rights reserved. RDDs from External & Internal data sets  Parallelize existing collection to create RDD val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data, num_partitions)  Create RDD from external data source – Support, local FS, HDFS, Cassandra, Hbase, Amazon S3 – Support, Text files, Sequence files, Hadoop InputFormats – Local FS file path should be available and same on all the nodes – Reading directories are supported – textFile() by default makes one partition for each HDFS file block scala> val distFile = sc.textFile("data.txt”, num_partitions)
  20. 20. 20© Copyright 2013 Pivotal. All rights reserved. RDD Persistence  Persisting the result RDD after bunch of transformations is recommended – This allows further actions on this result RDD much faster (no computations again) – RDD.cache() – in memory persistence  RDD Persistence options – MEMORY_ONLY ▪ Store as deserialized Java objects. Does not fit in memory, some partitions will be recomputed – MEMORY_AND_DISK ▪ Store as deserialized Java objects. Does not fit in memory, store additional partitions on disk – MEMORY_ONLY_SER ▪ Store as serialized Java objects (one byte array per partition). More space-efficient – MEMORY_AND_DISK_SER ▪ Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk – DISK_ONLY ▪ Store the RDD partitions only on disk. – MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.: ▪ Same as the levels above, but replicate each partition on two cluster nodes. – OFF_HEAP (experimental) : ▪ Store RDD in serialized format in Tachyon, Reduce GC overhead, Share RDDs across Apps
  21. 21. 21© Copyright 2013 Pivotal. All rights reserved. How to choose RDD Persistence level  Persistent levels trade-offs between memory usage and CPU efficiency.  If your RDDs fit comfortably with the default storage level (MEMORY_ONLY)  If not, try using MEMORY_ONLY_SER and selecting a fast serialization library – Spilling to disk is costly (Spark by default uses Java serialization) – Python uses by default Pickle library – Scala/Java can use Kryo library, faster than default Java serialization  Use the replicated storage levels if you want fast fault recovery  In environments with high amounts of memory or multiple applications, the experimental OFF_HEAP mode has several advantages: – It allows multiple executors to share the same pool of memory in Tachyon. – It significantly reduces garbage collection costs. – Cached data is not lost if individual executors crash.
  22. 22. 22© Copyright 2013 Pivotal. All rights reserved. Spark Configuration  Spark properties – Control most application parameters and can be set by using a SparkConf object, or through Java system properties – Precedence ▪ SparkConf in driver program, ▪ spark submit options, ▪ Default conf file in spark conf directory – http://spark.apache.org/docs/latest/configuration.html  Environment variables – Can be used to set per-machine settings, such as the IP address, through the conf/spark-env.sh script on each node.  Logging – can be configured through log4j.properties
  23. 23. 23© Copyright 2013 Pivotal. All rights reserved. Spark Application Monitoring  Web Interface – Includes ▪ A list of scheduler stages and tasks ▪ A summary of RDD sizes and memory usage ▪ Environmental information. ▪ Information about the running executors – Provide web-ui for running application at ▪ http://<driver-node>:4040 – Provide ability to view application information after it is finished ▪ Set spark.eventLog.enabled = true ▪ Set spark.eventLog.dir = file:///tmp/spark-events ▪ Run History Server: ./sbin/start-history-server.sh file:///tmp/spark-events
  24. 24. 24© Copyright 2013 Pivotal. All rights reserved. 24© Copyright 2013 Pivotal. All rights reserved. How Spark Runs DAGs, shuffle’s, tasks, stages, etc.
  25. 25. 25© Copyright 2013 Pivotal. All rights reserved. Sample
  26. 26. 26© Copyright 2013 Pivotal. All rights reserved. What happens  Create RDDs  Pipeline operations as much of possible – When a results doesn’t depend on other results, we can pipeline – But, when data needs to be reorganized, no longer pipeline  Stage is a merged operation  Each stage gets a set of tasks  Task is data and computation
  27. 27. 27© Copyright 2013 Pivotal. All rights reserved. RDDs and Stages
  28. 28. 28© Copyright 2013 Pivotal. All rights reserved. Tasks
  29. 29. 29© Copyright 2013 Pivotal. All rights reserved. Stages running  Number of partitions matter for concurrency  Rule of thumb is at least 2x number of cores
  30. 30. 30© Copyright 2013 Pivotal. All rights reserved. The Shuffle  Redistributes data among partitions – Hash keys into buckets – Pull not push – Writes to intermediate files to disk – Becoming plugable  Optimizations: – Avoided when possible, if ”data is already properly" partitioned – Partial aggregation reduces data movement
  31. 31. 31© Copyright 2013 Pivotal. All rights reserved. Other thought’s on Memory  By default Spark owns 90% of the memory  Partitions don’t have to fit in memory, but some things do – EG: values for large sets in groupBy’s must fit in memory  Shuffle memory is 30% – If it goes over that, it’ll spill the data to disk – Shuffle always writes to disk  Turn on compression to keep objects serialized – Saves space, but takes compute to serialize/de-serialize
  32. 32. 32© Copyright 2013 Pivotal. All rights reserved. 32© Copyright 2013 Pivotal. All rights reserved. Spark Deployment modes
  33. 33. 33© Copyright 2013 Pivotal. All rights reserved. Spark Topology/Deployment modes  Local – Great for Dev  Spark Cluster (master/slaves) – Improving rapidly  Cluster Resource Managers – YARN – MESOS
  34. 34. 34© Copyright 2013 Pivotal. All rights reserved. Spark Application Architecture
  35. 35. 35© Copyright 2013 Pivotal. All rights reserved. Yarn Cluster Mode
  36. 36. 36© Copyright 2013 Pivotal. All rights reserved. Yarn Client Mode
  37. 37. 37© Copyright 2013 Pivotal. All rights reserved. Comparison of Deployment Modes
  38. 38. 38© Copyright 2013 Pivotal. All rights reserved. 38© Copyright 2013 Pivotal. All rights reserved. Spark Hands-on
  39. 39. 39© Copyright 2013 Pivotal. All rights reserved. Spark Hands-on  How to Build Spark ?  Spark on Dev environment (mac)  Spark on AWB  Pyspark & Lamda Functions  Spark Examples using Pyspark  Web UI/Debugging
  40. 40. 40© Copyright 2013 Pivotal. All rights reserved. Spark Code/Build  Spark Git Repsitory – https://github.com/apache/spark  Download Spark – git clone https://github.com/apache/spark  Build Spark – Maven – Shell script
  41. 41. 41© Copyright 2013 Pivotal. All rights reserved. Build Spark with Maven  Setting up Maven’s Memory Usage export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M - XX:ReservedCodeCacheSize=512m”  Specifying the Hadoop Version # Apache Hadoop 2.2.X mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package # Apache Hadoop 2.3.X mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package # Apache Hadoop 2.4.X mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
  42. 42. 42© Copyright 2013 Pivotal. All rights reserved. Build Spark using Script  Use make distribution script  Abstraction using maven – ./make-distribution.sh --skip-java-test -Pyarn -Phadoop-2.2 - Phadoop.version=2.2.0 – Creates dist/ folder containing spark artifacts – -- tgz creates spark distribution
  43. 43. 43© Copyright 2013 Pivotal. All rights reserved. Spark on AWB  Access – ssh manis2@acs04.analyticsworkbench.com -p 45326  Topology – https://portal.analyticsworkbench.com/projects/awbhome/wiki/Cluste r_Topology – Spark Admin: http://access3.ic.analyticsworkbench.com:4040  Directory – /usr/share/spark  Spark using Yarn Cluster
  44. 44. 44© Copyright 2013 Pivotal. All rights reserved. Using Spark Submit  Suitable for yarn-cluster mode  export SPARK_SUBMIT_CLASSPATH= /usr/share/spark/jars/spark-assembly-1.0.0- hadoop2.2.0-gphd-3.0.1.0.jar  export HADOOP_CONF_DIR=/etc/gphd/hadoop/conf  ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 1g --executor-memory 2g --executor-cores 1 lib/spark-examples*.jar 10 
  45. 45. 45© Copyright 2013 Pivotal. All rights reserved. Using Spark Shell  Suitable for yarn-client mode  Interactive shell suitable for debugging  export SPARK_SUBMIT_CLASSPATH= /usr/share/spark/jars/spark-assembly- 1.0.0-hadoop2.2.0-gphd-3.0.1.0.jar  bin/pyspark –master yarn-client --num-executors 2
  46. 46. 46© Copyright 2013 Pivotal. All rights reserved. Spark Logging  Spark Logs contains the following information – # of Partitions – Size of tasks – Nodes tasks are running – Progress of tasks  Yarn Container Logs – Available locally – Aggregated on HDFS
  47. 47. 47© Copyright 2013 Pivotal. All rights reserved. Spark Web Interface  Web UI: http://<driver-node>:4040  One port for each application (aka SparkContext)  Shows the following information – List of scheduler stages & tasks – Summary of RDD sizes & memory usage – Environmental information – Information about running executors  Provide ability to view application information after it is finished – Set spark.eventLog.enabled = true – Set spark.eventLog.dir = file:///tmp/spark-events – Run History Server: ./sbin/start-history-server.sh file:///tmp/spark-events
  48. 48. 48© Copyright 2013 Pivotal. All rights reserved. Spark Examples  Line with most words  Lines with a particular word  Word count  Sorting  PageRank
  49. 49. 49© Copyright 2013 Pivotal. All rights reserved. 49© Copyright 2013 Pivotal. All rights reserved. Berkley Data Stack - Related Projects Things that use Spark Core
  50. 50. 50© Copyright 2013 Pivotal. All rights reserved. Berkley Data Analytics Stack (BDAS) Support  Batch  Streaming  Interactive Make it easy to compose them https://amplab.cs.berkeley.edu/software/
  51. 51. 51© Copyright 2013 Pivotal. All rights reserved. Spark SQL  Lib in Spark Core that models RDDs as relations – SchemaRDD  Replaces Shark – Lighter weight version with no code from Hive  Import/Export in different Storage formats – Parquet, learn schema from existing Hive warehouse  Takes columnar storage from Shark
  52. 52. 52© Copyright 2013 Pivotal. All rights reserved. Spark Streaming  Extend Spark to do large scale stream processing – 100s of nodes with second scale end to end latency  Simple, batch like API with RDDs  Single semantics for both real time and high latency  Other features – Window-based Transformations – Arbitrary join of streams
  53. 53. 53© Copyright 2013 Pivotal. All rights reserved. Streaming (cont)  Input is broken up into Batches that become RDDs  RDD’s are composed into DAGs to generate output  Raw data is replicated in-memory for FT
  54. 54. 54© Copyright 2013 Pivotal. All rights reserved. GraphX (Alpha)  Graph processing library – Replaces Spark Bagel  Graph Parallel not Data Parallel – Reason in the context of neighbors – GraphLab API  Graph Creation => Algorithm => Post Processing – Existing systems mainly deal with the Algorithm and not interactive – Unify collection and graph models
  55. 55. 55© Copyright 2013 Pivotal. All rights reserved. MLbase  Machine Learning toolset – Library and higher level abstractions  General tool in space is MatLab – Difficult for end users to learn, debug, scale solutions  Starting with MLlib – Low level Distributed Machine Learning Library  Many different Algorithms – Classification, Regression, Collaborative Filtering, etc.
  56. 56. 56© Copyright 2013 Pivotal. All rights reserved. 56© Copyright 2013 Pivotal. All rights reserved. Thanks!
  57. 57. 57© Copyright 2013 Pivotal. All rights reserved. Data Science Platform IMDG Cluster Manager RDDM/R Application Platform Stream Server MPP SQL Data Lake / HDFS / Virtual Storage App Data Platform SQL Objects JSON GemFireX D ...ETC Hadoop HDFS Isilon App Dev / Ops YARN Mesos MLbaseStreaming Legacy Systems Legacy Data Scientists/AnalystsData Sources End Users SparkSQL
  58. 58. 58© Copyright 2013 Pivotal. All rights reserved. 58© Copyright 2013 Pivotal. All rights reserved. Backup Slides
  59. 59. 59© Copyright 2013 Pivotal. All rights reserved. PHD General Solution Pipeline Streaming Ingest GemFire (IMDB) Machine data Stream message Source RabbitMQ Transport HDFS Sink GemFire Tap SQL REST API Analytics – Counters and Gauges Message Transformer Analytics Taps HDFS Dashboard
  60. 60. 60© Copyright 2013 Pivotal. All rights reserved. PHD Where’s Spark? Streaming Ingest GemFire (IMDB) Machine data Stream message Source Transport HDFS Sink GemFire Tap SQL REST API Analytics – Counters and Gauges Message Transformer Analytics Taps HDFS Dashboard
  61. 61. 61© Copyright 2013 Pivotal. All rights reserved. Slides by Shivram  How to download/build and run spark examples – Build w/ specific version of Hadoop (PHD?) ▪ http://spark.apache.org/docs/latest/building-with-maven.html  Spark cluster deployment modes – YARN, & Singlenode ( EC2, Mesos, Standalone) – http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and- yarn-app-models/ – http://spark.apache.org/docs/latest/running-on-yarn.html – Submit apps vs interactive shell  Explain simple spark example and run it ▪ run-example, spark-shell/pyspark, spark-submitt, ▪ http://spark.apache.org/docs/latest/quick-start.html
  62. 62. 62© Copyright 2013 Pivotal. All rights reserved. Deployment modes (+YARN slide)  Local – Great for Dev  Spark Cluster (master/slaves) – Improving rapidly  Cluster Resource Managers – YARN – MESOS
  63. 63. 63© Copyright 2013 Pivotal. All rights reserved.  Intro to Spark based Projects – Spark SQL – Spark Streaming – MLBase – GraphX

×