Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Community Update - Spark Summit San Francisco 2015

8,385 views

Published on

Spark Community Update by Matei Zaharia and Patrick Wendell

Spark Community Update - Spark Summit San Francisco 2015

  1. 1. Spark Community Update Matei Zaharia & Patrick Wendell June 15th, 2015
  2. 2. A Great Year for Spark Most active open source project in data processing New language: R Many new features & community projects
  3. 3. Community Growth June 2014 June 2015 total contributors 255 730 contributors/month 75 135 lines of code 175,000 400,000
  4. 4. Community Growth June 2014 June 2015 total contributors 255 730 contributors/month 75 135 lines of code 175,000 400,000 Mostly in libraries
  5. 5. Users 1000+ companies … Distributors + Apps 50+ companies …
  6. 6. Large-Scale Usage Largest cluster: 8000 nodes Largest single job: 1 petabyte Top streaming intake: 1 TB/hour 2014 on-disk sort record
  7. 7. Open Source Ecosystem Applications Environments Data Sources
  8. 8. Current Spark Components Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engine across diverse workloads & environments
  9. 9. Major Directions in 2015 Data Science Similar interfaces to single-node tools Platform APIs Growing the ecosystem
  10. 10. Data Science DataFrames: popular API for data transformation Machine Learning Pipelines: inspired by scikit-learn R Language 1.3 1.4 1.4
  11. 11. Platform APIs {JSON} Data Sources •  Uniform interface to diverse sources (DataFrames + SQL) Spark Packages •  Community site with 70+ libraries •  spark-packages.org … Spark Your app
  12. 12. Ongoing Engine Improvements Project Tungsten •  Code generation, binary processing, off-heap memory DAG visualization & debugging tools
  13. 13. Spark’s 1.4 Release
  14. 14. R Language Support 14   R API based on Spark’s DataFrames
  15. 15. An R Runtime for Big Data 15   Spark’s scale Thousands of machines and cores Spark’s performance Runtime optimizer, code generation, memory management
  16. 16. Access to Spark’s I/O Packages #  Dataframe  from  JSON   >  people  <-­‐  read.df("people.json",  "json")     #  ...  from  MySQL   >  people  <-­‐  read.df("jdbc:mysql://sql01",  "jdbc")     #  ...  from  Hive   >  people  <-­‐  read.table("orders")   16   CSV ElasticSearch Avro Cassandra Parquet MongoDB SequoiaDB HBase
  17. 17. ML Pipelines 17   Data Frame tokenizer hashingTF lr Pipeline Model   //  create  pipeline   tok  =  Tokenizer(in="text",  out="words”)   tf  =  HashingTF(in=“words”,  out="features”)   lr  =  LogisticRegression(maxIter=10,  regParam=0.01)   pipeline  =  Pipeline(stages=[tok,  tf,  lr])   Data Frame //  train  pipeline   df  =  sqlCtx.table(”training”)   model  =  pipeline.fit(df)     //  make  predictions   df  =  sqlCtx.read.json("/path/to/test”)   model.transform(df)      .select(“id”,  “text”,  “prediction”)    
  18. 18. 18   ML Pipelines Stable API with hooks for third party pipeline components Feature transformers New algorithms VectorAssembler String/VectorIndexer OneHotEncoder PolynomialExpansion …. GLM with elastic-net Tree classifiers Tree regressors OneVsRest …
  19. 19. 19   Performance and Project Tungsten Managed memory for aggregations Memory efficient shuffles 1.3 1.4 Customer Pipeline Latency
  20. 20. What’s Coming in Spark 1.5+? Project Tungsten: Code generation, improved sort + aggregation Spark Streaming: Flow control, optimized state management ML: Single machine solvers, scalability to many features SparkR: Integration with Spark’s machine learning APIs 20  
  21. 21. Join Us Today at Office Hours! 21   Area 1:00-1:45 Spark Core, YARN Spark Streaming 1:45-2:30 Spark SQL 3:00-3:40 Spark Ops 3:40-4:15 Spark SQL 4:30-5:15 Spark Core, PySpark Spark MLlib 5:15-6:00 Spark MLlib Databricks booth (A1) More tomorrow…
  22. 22. Thanks!

×