Successfully reported this slideshow.
Your SlideShare is downloading. ×

New directions for Apache Spark in 2015

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 16 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to New directions for Apache Spark in 2015 (20)

More from Databricks (20)

Advertisement

Recently uploaded (20)

New directions for Apache Spark in 2015

  1. 1. New Directions for Spark in 2015 Matei Zaharia February 20, 2015
  2. 2. What is Apache Spark? Fast and general engine for big data processing with libraries for SQL, streaming, advanced analytics Most active open source project in big data 2
  3. 3. Founded by the creators of Spark in 2013 Largest organization contributing to Spark –  3/4 of the code in 2014 End-to-end hosted service, Databricks Cloud About Databricks 3
  4. 4. 2014: an Amazing Year for Spark Total contributors: 150 => 500 Lines of code: 190K => 370K 500 active production deployments 4
  5. 5. Contributors per Month to Spark 0 20 40 60 80 100 2011 2012 2013 2014 2015 5
  6. 6. Contributors per Month to Spark 0 20 40 60 80 100 2011 2012 2013 2014 2015 Most active project at Apache 6
  7. 7. 7 On-Disk Sort Record: Time to sort 100TB 2100 machines2013 Record: Hadoop 2014 Record: Spark Source: Daytona GraySort benchmark, sortbenchmark.org 72 minutes 207 machines 23 minutes
  8. 8. Distributors Applications 8
  9. 9. 9 New Directions in 2015 Data Science High-level interfaces similar to single-machine tools Platform Interfaces Plug in data sources and algorithms
  10. 10. 10 DataFrames Similar API to data frames in R and Pandas Automatically optimized via Spark SQL Coming in Spark 1.3 df = jsonFile(“tweets.json”) df[df[“user”] == “matei”] .groupBy(“date”) .sum(“retweets”) 0 5 10 Python Scala DataFrame RunningTime
  11. 11. 11 R Interface (SparkR) Arrives in Spark 1.4 (June) Exposes DataFrames, RDDs, and ML library in R df = jsonFile(“tweets.json”)  summarize(                            group_by(                             df[df$user == “matei”,],     “date”),   sum(“retweets”)) 
  12. 12. 12 Machine Learning Pipelines High-level API inspired by SciKit-Learn Featurization, evaluation, model tuning tokenizer = Tokenizer() tf = HashingTF(numFeatures=1000) lr = LogisticRegression() pipe = Pipeline([tokenizer, tf, lr]) model = pipe.fit(df) tokenizer TF LR modelDataFrame
  13. 13. 13 External Data Sources Platform API to plug smart data sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources Spark {JSON}
  14. 14. 14 External Data Sources Platform API to plug smart data sources into Spark Returns DataFrames usable in Spark apps or SQL Pushes logic into sources SELECT * FROM mysql_users u JOIN hive_logs h WHERE u.lang = “en” Spark {JSON} SELECT * FROM users WHERE lang=“en”
  15. 15. 15 Goal: one engine for all data sources, workloads and environments
  16. 16. To Learn More Two free massive online courses on Spark: databricks.com/moocs 16 Try Databricks Cloud: databricks.com

×