Successfully reported this slideshow.
Your SlideShare is downloading. ×

Simplifying Big Data Analytics with Apache Spark

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 45 Ad
Advertisement

More Related Content

Advertisement

Similar to Simplifying Big Data Analytics with Apache Spark (20)

More from Databricks (20)

Advertisement

Simplifying Big Data Analytics with Apache Spark

  1. 1. Simplifying Big Data Analysis with Apache Spark Matei Zaharia April 27, 2015
  2. 2. What is Apache Spark? Fast and general cluster computing engine interoperable with Apache Hadoop Improves efficiency through: –  In-memory data sharing –  General computation graphs Improves usability through: –  Rich APIs in Java, Scala, Python –  Interactive shell Up to 100× faster 2-5× less code
  3. 3. Spark Core Spark Streaming real-time Spark SQL structured data MLlib machine learning GraphX graph A General Engine …  
  4. 4. A Large Community MapReduce YARN HDFS Storm Spark 0 500 1000 1500 2000 2500 3000 3500 4000 4500 MapReduce YARN HDFS Storm Spark 0 100000 200000 300000 400000 500000 600000 700000 800000 Commits in past year Lines of code changed in past year
  5. 5. About Databricks Founded by creators of Spark and remains largest contributor Offers a hosted service, Databricks Cloud –  Spark on EC2 with notebooks, dashboards, scheduled jobs
  6. 6. This Talk Introduction to Spark Built-in libraries New APIs in 2015 –  DataFrames –  Data sources –  ML Pipelines
  7. 7. Why a New Programming Model? MapReduce simplified big data analysis But users quickly wanted more: –  More complex, multi-pass analytics (e.g. ML, graph) –  More interactive ad-hoc queries –  More real-time stream processing All 3 need faster data sharing in parallel apps
  8. 8. Data Sharing in MapReduce iter. 1 iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . HDFS read Slow due to data replication and disk I/O
  9. 9. iter. 1 iter. 2 . . . Input What We’d Like Distributed memory Input query 1 query 2 query 3 . . . one-time processing 10-100× faster than network and disk
  10. 10. Spark Model Write programs in terms of transformations on datasets Resilient Distributed Datasets (RDDs) –  Collections of objects that can be stored in memory or disk across a cluster –  Built via parallel transformations (map, filter, …) –  Automatically rebuilt on failure
  11. 11. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . tasks results Cache 1 Cache 2 Cache 3 Base RDDTransformed RDD Action Full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
  12. 12. Fault Tolerance RDDs track the transformations used to build them (their lineage) to recompute lost data messages = textFile(...).filter(lambda s: “ERROR” in s) .map(lambda s: s.split(“t”)[2]) HadoopRDD path = hdfs://… FilteredRDD func = lambda s: … MappedRDD func = lambda s: …
  13. 13. 0 1000 2000 3000 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s later iterations 1 s Example: Logistic Regression
  14. 14. 14 On-Disk Performance Time to sort 100TB Source: Daytona GraySort benchmark, sortbenchmark.org 2100 machines2013 Record: Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes
  15. 15. Supported Operators map filter groupBy union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup flatMap take first partitionBy pipe distinct save ...
  16. 16. // Scala: val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...); lines.filter(s -> s.contains(“ERROR”)).count(); Spark in Scala and Java
  17. 17. User Community Over 500 production users Clusters up to 8000 nodes, processing 1 PB/day Single jobs over 1 PB 17
  18. 18. This Talk Introduction to Spark Built-in libraries New APIs in 2015 –  DataFrames –  Data sources –  ML Pipelines
  19. 19. Spark Core Spark Streaming real-time Spark SQL structured data MLlib machine learning GraphX graph Built-in Libraries
  20. 20. Key Idea Instead of having separate execution engines for each task, all libraries work directly on RDDs Caching + DAG model is enough to run them efficiently Combining libraries into one program is much faster 20
  21. 21. Represents tables as RDDs Tables = Schema + Data Spark SQL c = HiveContext(sc) rows = c.sql(“select text, year from hivetable”) rows.filter(lambda r: r.year > 2013).collect() From Hive: {“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }} c.jsonFile(“tweets.json”).registerTempTable(“tweets”) c.sql(“select text, user.name from tweets”) From JSON: tweets.json
  22. 22. Time Input Spark Streaming
  23. 23. RDDRDDRDDRDDRDDRDD Represents streams as a series of RDDs over time sc.twitterStream(...) .map(lambda t: (t.username, 1)) .reduceByWindow(“30s”, lambda a, b: a + b) .print() Spark Streaming Time
  24. 24. Vectors, Matrices = RDD[Vector] Iterative computation MLlib points = sc.textFile(“data.txt”).map(parsePoint) model = KMeans.train(points, 10) model.predict(newPoint)
  25. 25. Represents graphs as RDDs of vertices and edges GraphX
  26. 26. Represents graphs as RDDs of vertices and edges GraphX
  27. 27. Hive Impala(disk) Impala(mem) Spark(disk) Spark(mem) 0 10 20 30 40 50 ResponseTime(sec) SQL Mahout GraphLab Spark 0 10 20 30 40 50 60 ResponseTime(min) ML Performance vs. Specialized Engines Storm Spark 0 5 10 15 20 25 30 35 Throughput(MB/s/node) Streaming
  28. 28. Combining Processing Types // Load data using SQL points = ctx.sql(“select latitude, longitude from hive_tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)
  29. 29. Combining Processing Types Separate engines: . . . HDFS read HDFS write prepare HDFS read HDFS write train HDFS read HDFS write apply HDFS write HDFS read prepare train apply Spark: HDFS Interactive analysis
  30. 30. This Talk Introduction to Spark Built-in libraries New APIs in 2015 –  DataFrames –  Data sources –  ML Pipelines
  31. 31. 31 Main Directions in 2015 Data Science Making it easier for wider class of users Platform Interfaces Scaling the ecosystem
  32. 32. From MapReduce to Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  33. 33. Beyond MapReduce Experts 33 Early adopters Data Scientists Statisticians R users PyData … users understands MapReduce & functional APIs
  34. 34. Data Frames De facto data processing abstraction for data science (R and Python) Google Trends for “dataframe”
  35. 35. From RDDs to DataFrames 35
  36. 36. 36 Spark DataFrames Collections of structured data similar to R, pandas Automatically optimized via Spark SQL –  Columnar storage –  Code-gen. execution df = jsonFile(“tweets.json”) df[df[“user”] == “matei”] .groupBy(“date”) .sum(“retweets”) 0 5 10 Python Scala DataFrame RunningTime
  37. 37. Optimization via Spark SQL 37 DataFrame expressions are relational queries, letting Spark inspect them Automatically perform expression optimization, join algorithm selection, columnar storage, compilation to Java bytecode
  38. 38. 38 Machine Learning Pipelines tokenizer = Tokenizer() tf = HashingTF(numFeatures=1000) lr = LogisticRegression() pipe = Pipeline([tokenizer, tf, lr]) model = pipe.fit(df) tokenizer TF LR modelDataFrame High-level API similar to SciKit-Learn Operates on DataFrames Grid search to tune params across a whole pipeline
  39. 39. 39 Spark R Interface Exposes DataFrames and ML pipelines in R Parallelize calls to R code df = jsonFile(“tweets.json”)  summarize(                            group_by(                             df[df$user == “matei”,],     “date”),   sum(“retweets”))  Target: Spark 1.4 (June)
  40. 40. 40 Main Directions in 2015 Data Science Making it easier for wider class of users Platform Interfaces Scaling the ecosystem
  41. 41. 41 Data Sources API Allows plugging smart data sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources Spark {JSON}
  42. 42. 42 Allows plugging smart data sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources SELECT * FROM mysql_users u JOIN hive_logs h WHERE u.lang = “en” Spark {JSON} SELECT * FROM users WHERE lang=“en” Data Sources API
  43. 43. 43 Current Data Sources Built-in: Hive, JSON, Parquet, JDBC Community: CSV, Avro, ElasticSearch, Redshift, Cloudant, Mongo, Cassandra, SequoiaDB List at spark-packages.org
  44. 44. 44 Goal: unified engine across data sources, workloads and environments
  45. 45. To Learn More Downloads & docs: spark.apache.org Try Spark in Databricks Cloud: databricks.com Spark Summit: spark-summit.org 45

×