Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

11,806 views

Published on

A technical overview of Spark’s DataFrame API. First, we’ll review the DataFrame API and show how to create DataFrames from a variety of data sources such as Hive, RDBMS databases, or structured file formats like Avro. We’ll then give example user programs that operate on DataFrames and point out common design patterns. The second half of the talk will focus on the technical implementation of DataFrames, such as the use of Spark SQL’s Catalyst optimizer to intelligently plan user programs, and the use of fast binary data structures in Spark’s core engine to substantially improve performance and memory use for common types of operations.

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

  1. 1. Spark DataFrames: Simple and Fast Analytics on Structured Data Michael Armbrust Spark Summit Amsterdam 2015 - October, 28th
  2. 2. Graduated from Alpha in 1.3 • Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) SQLAbout Me and 2 0 100 200 300 # Of Commits Per Month 0 50 100 150 200 # of Contributors 2
  3. 3. 3 SELECT COUNT(*) FROM hiveTable WHERE hive_udf(data) • Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQLqueries,optionally alongside or replacing existing Hive deployments SQLAbout Me and Improved multi-version support in 1.4
  4. 4. 4 • Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQLqueries,optionally alongside or replacing existing Hive deployments • Connectexisting BI tools to Spark through JDBC SQLAbout Me and
  5. 5. • Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQLqueries,optionally alongside or replacing existing Hive deployments • Connectexisting BI tools to Spark through JDBC • Bindingsin Python,Scala, Java, and R 5 SQLAbout Me and
  6. 6. • Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQLqueries,optionally alongside or replacing existing Hive deployments • Connectexisting BI tools to Spark through JDBC • Bindingsin Python,Scala, Java, and R • @michaelarmbrust • Creator of Spark SQL @databricks 6 SQLAbout Me and
  7. 7. The not-so-secret truth... 7 is about more than SQL. SQL
  8. 8. 8 Create and run Spark programs faster: SQL • Write less code • Read less data • Let the optimizer do the hard work
  9. 9. DataFrame noun – [dey-tuh-freym] 9 1. A distributed collection of rows organized into named columns. 2. An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas). 3. Archaic: Previously SchemaRDD (cf. Spark< 1.3).
  10. 10. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df = sqlContext.read .format("json")   .option("samplingRatio",  "0.1")   .load("/home/michael/data.json") df.write .format("parquet")   .mode("append")   .partitionBy("year")   .saveAsTable("fasterData") 10
  11. 11. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df = sqlContext.read .format("json")   .option("samplingRatio",  "0.1")   .load("/home/michael/data.json") df.write .format("parquet")   .mode("append")   .partitionBy("year")   .saveAsTable("fasterData") read and write   functions create new builders for doing I/O 11
  12. 12. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: Builder methods specify: • Format • Partitioning • Handling of existing data df = sqlContext.read .format("json")   .option("samplingRatio",  "0.1")   .load("/home/michael/data.json") df.write .format("parquet")   .mode("append")   .partitionBy("year")   .saveAsTable("fasterData") 12
  13. 13. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: load(…), save(…) or saveAsTable(…)   finish the I/O specification df = sqlContext.read .format("json")   .option("samplingRatio",  "0.1")   .load("/home/michael/data.json") df.write .format("parquet")   .mode("append")   .partitionBy("year")   .saveAsTable("fasterData") 13
  14. 14. Read Less Data: Efficient Formats • Compact binary encoding with intelligent compression (delta, RLE, etc) • Each column stored separately with an index that allows skipping of unread columns • Support for partitioning (/data/year=2015) • Data skipping using statistics (column min/max, etc) 14 is an efficient columnar storage format: ORC
  15. 15. Write Less Code: Data Source API Spark SQL’s Data Source API can read and write DataFrames usinga variety of formats. 15 {  JSON } Built-In External JDBC and more… Find more sources at http://spark-packages.org/ ORCplain text*
  16. 16. ETL Using Custom Data Sources sqlContext.read .format("com.databricks.spark.jira") .option("url",  "https://issues.apache.org/jira/rest/api/latest/search") .option("user",  "marmbrus") .option("password",  "*******") .option("query",  """ |project  =  SPARK  AND   |component  =  SQL  AND   |(status  =  Open  OR  status  =  "In  Progress"  OR  status  =  Reopened)""".stripMargin) .load() 16 .repartition(1) .write .format("parquet") .saveAsTable("sparkSqlJira") Load  data  from                                (Spark’s  Bug  Tracker)   using  a  custom  data  source. Write  the  converted  data  out  to   a                                              table  stored  in  
  17. 17. Write Less Code: High-Level Operations Solve common problems conciselyusing DataFrame functions: • Selecting columnsand filtering • Joining different data sources • Aggregation (count, sum, average, etc) • Plotting resultswith Pandas 17
  18. 18. Write Less Code: Compute an Average private IntWritable one  = new IntWritable(1) private IntWritable output   = new IntWritable() proctected void map( LongWritable key, Text  value, Context   context)   { String[]   fields   = value.split("t") output.set(Integer.parseInt(fields[1])) context.write(one,   output) } IntWritable one  = new IntWritable(1) DoubleWritable average   = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context   context)   { int sum   = 0 int count   = 0 for(IntWritable value   : values)   { sum  += value.get() count++ } average.set(sum   / (double)   count) context.Write(key,   average) } data   = sc.textFile(...).split("t") data.map(lambda x:  (x[0],   [x.[1],   1]))   .reduceByKey(lambda x,  y:  [x[0]   + y[0],   x[1]   + y[1]])   .map(lambda x:  [x[0],   x[1][0]   / x[1][1]])   .collect() 18
  19. 19. Write Less Code: Compute an Average Using RDDs data  = sc.textFile(...).split("t") data.map(lambda x:  (x[0],  [int(x[1]),  1]))   .reduceByKey(lambda x,  y:  [x[0]  + y[0],  x[1]  + y[1]])   .map(lambda x:  [x[0],  x[1][0]  / x[1][1]])   .collect() Using DataFrames sqlCtx.table("people")   .groupBy("name")   .agg("name",  avg("age"))   .map(lambda …)   .collect()   Full API Docs • Python • Scala • Java • R 19 Using SQL SELECT name,  avg(age) FROM people GROUP BY  name
  20. 20. Not Just Less Code, Faster Too! 20 0 2 4 6 8 10 RDDScala RDDPython DataFrameScala DataFramePython DataFrameR DataFrameSQL Time to Aggregate 10 million int pairs (secs)
  21. 21. Plan Optimization & Execution 21 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQLshare the same optimization/execution pipeline
  22. 22. Seamlessly Integrated Intermix DataFrame operations with custom Python, Java, R, or Scala code zipToCity =  udf(lambda zipCode:  <custom  logic  here>) def add_demographics(events): u  = sqlCtx.table("users") events   .join(u,  events.user_id == u.user_id)   .withColumn("city",  zipToCity(df.zip)) Augments any DataFrame that contains user_id 22
  23. 23. Optimize Full Pipelines Optimization happensas late as possible, therefore Spark SQL can optimize even across functions. 23 events  = add_demographics(sqlCtx.load("/data/events",  "json")) training_data = events   .where(events.city == "Amsterdam")   .select(events.timestamp)   .collect()  
  24. 24. 24 def add_demographics(events): u  = sqlCtx.table("users") #  Load  Hive  table events   .join(u,  events.user_id == u.user_id)   #  Join  on  user_id .withColumn("city",  zipToCity(df.zip))   #  Run  udf to  add  city  column events  = add_demographics(sqlCtx.load("/data/events",  "json"))   training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()   Logical Plan filter bycity join events file users table expensive only join relevent users Physical Plan join scan (events) filter bycity scan (users) 24
  25. 25. 25 def add_demographics(events): u  = sqlCtx.table("users")                                          #  Load  partitioned Hive  table events   .join(u,  events.user_id == u.user_id)   #  Join  on  user_id .withColumn("city",  zipToCity(df.zip))            #  Run  udf to  add  city  column Optimized Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events  = add_demographics(sqlCtx.load("/data/events",  "parquet"))   training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()   Logical Plan filter bycity join events file users table Physical Plan join scan (events) filter bycity scan (users) 25
  26. 26. Machine Learning Pipelines 26 tokenizer = Tokenizer(inputCol="text",  outputCol="words”) hashingTF = HashingTF(inputCol="words",  outputCol="features”) lr = LogisticRegression(maxIter=10,  regParam=0.01) pipeline  = Pipeline(stages=[tokenizer,  hashingTF,  lr]) df = sqlCtx.load("/path/to/data") model  = pipeline.fit(df) ds0 ds1 ds2 ds3tokenizer hashingTF lr.model lr Pipeline Model
  27. 27. • 100+ native functionswith optimized codegen implementations – String manipulation – concat,   format_string,  lower,  lpad – Data/Time – current_timestamp,   date_format,  date_add,  … – Math – sqrt,  randn,   … – Other – monotonicallyIncreasingId,   sparkPartitionId,   … 27 Rich Function Library from pyspark.sql.functions import * yesterday  = date_sub(current_date(),  1) df2  = df.filter(df.created_at > yesterday) import org.apache.spark.sql.functions._ val yesterday = date_sub(current_date(),  1) val df2 = df.filter(df("created_at")  > yesterday) Added  in   Spark  1.5
  28. 28. Optimized Execution with Project Tungsten Compact encoding, cacheaware algorithms, runtime code generation 28
  29. 29. The overheads of JVM objects “abcd” 29 • Native: 4 byteswith UTF-8 encoding • Java: 48 bytes java.lang.String object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 4 (object header) ... 4 4 (object header) ... 8 4 (object header) ... 12 4 char[] String.value [] 16 4 int String.hash 0 20 4 int String.hash32 0 Instance size: 24 bytes (reported by Instrumentation API) 12 byte object header 8 byte hashcode 20 bytes data + overhead
  30. 30. 6 “bricks” Tungsten’s Compact Encoding 30 0x0 123 32L 48L 4 “data” (123, “data”, “bricks”) Null bitmap Offset to data Offset to data Field lengths “abcd” with Tungsten encoding: ~5-­‐6  bytes  
  31. 31. Runtime Bytecode Generation 31 df.where(df("year")  > 2015) GreaterThan(year#234,  Literal(2015)) bool filter(Object  baseObject)  { int offset  = baseOffset + bitSetWidthInBytes + 3*8L; int value  =  Platform.getInt(baseObject,  offset); return value34  > 2015; } DataFrame Code / SQL Catalyst Expressions Low-level bytecode JVM intrinsic JIT-ed to pointer arithmetic Platform.getInt(baseObject,  offset);
  32. 32. • Type-safe: operate on domain objectswith compiled lambda functions • Fast: Code-generated encodersfor fast serialization • Interoperable: Easily convert DataFrame ßà Dataset withoutboilerplate 32 Coming soon: Datasets val df = ctx.read.json("people.json") //  Convert  data  to  domain  objects. case class Person(name:  String,  age:  Int) val ds: Dataset[Person]  = df.as[Person] ds.filter(_.age  > 30) //  Compute  histogram  of  age  by  name. val hist =  ds.groupBy(_.name).mapGroups { case (name,  people:  Iter[Person])  => val buckets =  new Array[Int](10)             people.map(_.age).foreach {  a  => buckets(a  / 10)  += 1 }                   (name,  buckets) } Preview  in   Spark  1.6
  33. 33. 33 Create and run your Spark programs faster: SQL • Write less code • Read less data • Let the optimizer do the hard work Questions? Committer Office Hours Weds, 4:00-5:00 pm Michael Thurs, 10:30-11:30 am Reynold Thurs, 2:00-3:00 pm Andrew

×