Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Composable Parallel Processing in Apache Spark and Weld


Published on

The main reason people are productive writing software is composability -- engineers can take libraries and functions written by other developers and easily combine them into a program. However, composability has taken a back seat in early parallel processing APIs. For example, composing MapReduce jobs required writing the output of every job to a file, which is both slow and error-prone. Apache Spark helped simplify cluster programming largely because it enabled efficient composition of parallel functions, leading to a large standard library and high-level APIs in various languages. In this talk, I'll explain how composability has evolved in Spark's newer APIs, and also present a new research project I'm leading at Stanford called Weld to enable much more efficient composition of software on emerging parallel hardware (multicores, GPUs, etc).

Speaker: Matei Zaharia

Published in: Software

Composable Parallel Processing in Apache Spark and Weld

  1. 1. Composable Parallel Processing in Apache Spark and Weld Matei Zaharia @matei_zaharia
  2. 2. The main way developers are productive is by composing existing libraries
  3. 3. Early Big Data APIs Did not support efficient composition! • Specialized systems for each workload (SQL, ML, etc) • Slow data sharing (HDFS files)
  4. 4. Spark Goals Unified engine and API for big data processing • General engine: supports batch, interactive & streaming apps • Composable APIs: functional programming in Scala, Java, Python • ML, graph algorithms, etc are just functions on RDDs
  5. 5. This Talk Composability Original Spark API Structured APIs in Spark 2.0 Weld runtime at Stanford
  6. 6. Original Spark API Resilient Distributed Datasets (RDDs) • Distributed collections with functional operations lines = spark.textFile(“hdfs://...”) // RDD[String] points = => parsePoint(line)) // RDD[Point] points.filter(p => p.x > 100).count() Efficient composition: • Scheduler pipelines evaluation across operators • In-memory data sharing via Java objects
  7. 7. How Well Did It Work? Users really appreciate unification Functional API caused some challenges, which we’re tackling
  8. 8. Libraries Built on Spark SQL Streaming MLlib Spark Core (RDDs) GraphX
  9. 9. Which Libraries Do People Use? 58% 58% 62% 69% MLlib + GraphX Spark Streaming DataFrames Spark SQL 75% of users use more than one component
  10. 10. Community Packages
  11. 11. Combining Libraries // Load data using SQL val points = ctx.sql(“select latitude, longitude from tweets”) // Train a machine learning model val model = KMeans.train(points, 10) // Apply it to a stream ctx.twitterStream(...) .map(t => (model.predict(t.location), 1)) .reduceByWindow(“5s”, (a, b) => a+b)
  12. 12. Combining Libraries Separate frameworks: … HDFS read HDFS write clean HDFS read HDFS write train HDFS read HDFS write query HDFS HDFS read clean train query Spark: Interactive analysis
  13. 13. Main Challenge: Functional API Looks high-level, but hides many semantics of computation • Functions are arbitrary blocks of Java bytecode • Data stored is arbitrary Java objects Users can mix APIs in suboptimal ways
  14. 14. Example Problem pairs = => (word, 1)) groups = pairs.groupByKey(), vs) => (k, vs.sum)) Materializes all groups as Seq[Int] objects Then promptly aggregates them
  15. 15. Challenge: Data Representation Java objects often many times larger than underlying fields class User(name: String, friends: Array[Int]) new User(“Bobby”, Array(1, 2)) User 0x… 0x… String 3 0 1 2 Bobby 5 0x… int[] char[] 5
  16. 16. This Talk Composability Original Spark API Structured APIs in Spark 2.0 Weld runtime at Stanford
  17. 17. Structured APIs New APIs for structured data (limited table-like data model) • Spark SQL (analysts), DataFrames and Datasets (programmers) Support similar optimizations to databases while retaining Spark’s programmability SIGMOD 2015
  18. 18. Structured API Execution Logical Plan Physical Plan Catalog Optimizer RDDs … Data Source API SQL Data Frames Code Generator Datasets
  19. 19. Example: DataFrames DataFrames hold rows with a known schema and offer relational operations on them through a DSL users = spark.sql(“select * from users”) ca_users = users[users[“state”] == “CA”] ca_users.count() ca_users.groupBy(“name”).avg(“age”) u: Expression AST
  20. 20. Why DataFrames? Based on the popular data frame API in R & Python • Spark is the first to make this a declarative API Much higher programmability than SQL (run in a “real” PL) Google trends for “data frame”
  21. 21. What Structured APIs Enable 1. Compact binary representation • Columnar, compressed format for caching; rows for processing 2. Optimization across operators (join ordering, pushdown, etc) 3. Runtime code generation
  22. 22. Performance 24 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Time for aggregation benchmark (s)
  23. 23. Optimization Example events =“/logs”) stats = events.join(users) .groupBy(“loc”,“status”) .avg(“duration”) errors = stats.where( stats.status == “ERR”) DataFrame API Optimized Plan Specialized Code SCAN logs SCAN users JOIN AGG FILTER while(logs.hasNext) { e = if(e.status == “ERR”) { u = users.get(e.uid) key = (u.loc, e.status) sum(key) += e.duration count(key) += 1 } } ...
  24. 24. Example: Datasets case class User(name: String, id: Int) case class Message(user: User, text: String) dataframe =“log.json”) // DataFrame messages =[Message] // Dataset[Message] users = messages.filter(m => m.text.contains(“Spark”)) // Dataset[Message] .map(m => m.user) // Dataset[User] counts = messages.groupBy(“”) .count() // Dataset[(String, Int)] Enable static typing of data frame contents
  25. 25. Uptake Structured APIs were released in 2015, but already see high use: 89% of users use DataFrames in our 2016 survey 88% of users use SQL SQL & Python are the top languages on Databricks
  26. 26. New APIs on Structured Spark Data Sources ML Pipelines GraphFrames Structured Streaming
  27. 27. Data Sources Common way for Datasets and DataFrames to access storage • Apps can migrate across Hive, Cassandra, JSON, Avro, … • Structured semantics allows query federation into data sources, something not possible with original Spark Spark SQL users(users(“age”) > 20) select * from users
  28. 28. Examples JSON: JDBC: Together: select, text from tweets { “text”: “hi”, “user”: { “name”: “bob”, “id”: 15 } } tweets.json select age from users where lang = “en” select t.text, u.age from tweets t, users u where = and u.lang = “en” Spark SQL {JSON} select id, age from users where lang=“en”
  29. 29. Structured Streaming High-level streaming API based on DataFrames / Datasets • Event time, windowing, stateful operations Supports end-to-end continuous apps • Atomic interactions with storage • Batch & ad-hoc queries on same data • Query evolution at runtime Batch Job Ad-hoc Queries Input Stream Atomic Output Continuous Application Static Data Batch Jobs >_
  30. 30. Structured Streaming API Incrementalize an existing DataFrame/Dataset/SQL query logs =“json”).open(“hdfs://logs”) logs.groupBy(“userid”, “hour”).avg(“latency”) .write.format(”parquet”) .save(“s3://...”) Example batch job:
  31. 31. Structured Streaming API Incrementalize an existing DataFrame/Dataset/SQL query logs = ctx.readStream.format(“json”).load(“hdfs://logs”) logs.groupBy(“userid”, “hour”).avg(“latency”) .writeStream.format(”parquet") .start(“s3://...”) Example as streaming:
  32. 32. Query Planning Scan Files Aggregate Write to MySQL Scan New Files Stateful Aggregate Update MySQL Batch Plan Incremental Plan Catalyst transformation
  33. 33. Early Experience Running in our analytics pipeline since second half of 2016 Powering real-time metrics for MTV and Nickelodeon Monitoring 1000s of WiFi access points
  34. 34. This Talk Composability Original Spark API Structured APIs in Spark 2.0 Weld runtime at Stanford
  35. 35. Weld Motivation With continued changes in hardware, your machine is now a distributed system, and memory is the new HDFS The traditional interface for composing libraries in single- machine apps is increasingly inefficient!
  36. 36. Traditional Library Composition Functions that exchange data through memory buffers (e.g. C calls) data = pandas.parse_csv(string) filtered = pandas.dropna(data) avg = numpy.mean(filtered) parse_csv dropna mean 5-30x slowdowns in NumPy, Pandas, TensorFlow, etc
  37. 37. Our Solution machine learning SQL graph algorithms CPU GPU … … Common IR Runtime API Optimizer Weld runtime
  38. 38. Weld Runtime API Lazy evaluation to collect work across functions Works across libraries, languages, etc data = lib1.f1(), el => lib3.f2(el) ) User Application Weld Runtime Combined IR program Optimized machine code 11011 10011 10101 IR fragment for each function Weld API f1 map f2 Data in memory
  39. 39. Weld IR Small, powerful design inspired by “monad comprehensions” Parallel loops: iterate over a dataset Builders: declarative objects for producing results • E.g. append items to a list, compute a sum • Can be implemented differently on different hardware Captures relational algebra, linear algebra, functional APIs, and composition thereof
  40. 40. Examples Implement functional operators using builders def map(data, f): builder = new vecbuilder[int] for x in data: merge(builder, f(x)) result(builder) def reduce(data, zero, func): builder = new merger[zero, func] for x in data: merge(builder, x) result(builder)
  41. 41. Example Optimization: Fusion squares = map(data, x => x * x) sum = reduce(data, 0, +) bld1 = new vecbuilder[int] bld2 = new merger[0, +] for x in data: merge(bld1, x * x) merge(bld2, x) Loops can be merged into one pass over data
  42. 42. Results: Existing Frameworks TPC-H Logistic RegressionVector Sum 0 5 10 15 20 25 30 35 40 45 TPC-H Q1 TPC-H Q6 Runtime[secs] Workload SparkSQL Weld 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 Runtime[secs] NP NExpr Weld 0.1 1 10 100 1000 LR (1T) LR (12T) Runtime[secs;log10] Workload TF Hand-opt Weld 1 Core 12 Cores
  43. 43. Results: Cross-Library Optimization 0.01 0.1 1 10 100 Current Weld, no CLO Weld, CLO Weld, 12 core Running Time [sec; log10] Pandas + NumPy Workflow CLO = cross-library optimizationOpen source:
  44. 44. Conclusion Developers are productive by composing libraries, but hardware trends mean we must rethink the way we do this • Data movement dominates, from clusters down to 1 node Apache Spark and Weld are two examples of new composition interfaces that retain high programmability
  45. 45. ORGANIZED BY SPARK SUMMIT 2017 JUNE 5 – 7 | MOSCONE CENTER | SAN FRANCISCO Save 15% with promo code “dataeng”