Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Summit EU 2015: Matei Zaharia keynote

12,908 views

Published on

2015 was a year of continued growth for Spark, with numerous additions to the core project and very fast growth of use cases across the industry. In this talk, I’ll look back at how the Spark community is has grown and changed in 2015, based on a large Apache Spark user survey conducted by Databricks. We see some interesting trends in the diversity of runtime environments (which are increasingly not just Hadoop); the types of applications run on Spark; and the types of users, now that features like R support and DataFrames are available in Spark. I’ll also cover the ongoing work in the upcoming releases of Spark to support new use cases.

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Spark Summit EU 2015: Matei Zaharia keynote

  1. 1. How Spark Usage is Evolving in 2015 Matei Zaharia October 28,2015
  2. 2. A Great Year for Spark Most active open source projectin big data New language: R Widespread industry support& adoption
  3. 3. Community Growth 2014 2015 Summit Attendees 2014 2015 Meetup Members 2014 2015 Developers Contributing 3900 1100 42K 12K 350 600
  4. 4. Meetup Groups: January 2015 source: meetup.com
  5. 5. Meetup Groups: October 2015 source: meetup.com
  6. 6. What Spark Provides Generalenginewith libraries for many data analysis tasks Accessto diverse data sources Simple, unified API SQLStreaming ML Graph … Major focus in past 2 years Data sourceAPI added 2015
  7. 7. What Changed in 2015?
  8. 8. Databricks Survey 1400 respondentsfrom 840 companies Threetrends: 1) Diverse applications 2) More runtime environments 3) More types of users
  9. 9. Industries Using Spark Other Software (SaaS, Web, Mobile) Consulting (IT) Retail, e-Commerce Advertising, Marketing, PR Banking, Finance Health, Medical, Pharmacy, Biotech Carriers, Telecommunications Education Computers, Hardware 29.4% 17.7% 14.0% 9.6% 6.7% 6.5% 4.4% 4.4% 3.9% 3.5%
  10. 10. Top Applications 29% 36% 40% 44% 52% 68% Faud Detection / Security User-Facing Services Log Processing Recommendation Data Warehousing BusinessIntelligence
  11. 11. Spark Components Used 58% 58% 62% 69% MLlib + GraphX Spark Streaming DataFrames Spark SQL 75% of users use more than one component
  12. 12. Diverse Runtime Environments Hadoop: combined compute + storage HDFS MapReduce Spark: independent of storage layer Spark HDFS SQL e.g. Oracle NoSQL e.g. Cassandra
  13. 13. Diverse Runtime Environments 2014 2015 Hadoop Use a little Use a lot Hadoop 61% 31% NoSQL Proprietary SQL 46% 34% 43% 36% 37% 21%
  14. 14. Diverse Runtime Environments HOW RESPONDENTS ARE RUNNING SPARK 51% on a public cloud MOST COMMON SPARK DEPLOYMENT ENVIRONMENTS (CLUSTER MANAGERS) 48% 40% 11% Standalone mode YARN Mesos Cluster Managers
  15. 15. Diversity of Users 84% 38% 38% 71% 31% 58% 18% LanguagesUsed: 2014 LanguagesUsed: 2015
  16. 16. Fastest Growing Components +280% increase in Windowsusers +56% production use of Streaming +380% production use of SQL
  17. 17. Are We Done? No! Development is faster than ever. Biggest technical changein 2015 was DataFrames • Movesmany computationsonto the relational Spark SQL optimizer Enables both newAPIs and more optimization, which is now happening throughProject Tungsten
  18. 18. Traditional Spark DataFrames RDDs DataFrames Opaque Java objects User code Storage DataFrame API SQL Schema- aware cache Structured data sources Java functions Expressions Optimizer Query pushdown
  19. 19. Coming in Spark 1.6 Dataset API: typed interface over DataFrames / Tungsten • Common ask from developerswho saw DataFrames case class Person(name: String, age: Int) val dataframe = read.json(“people.json”) val ds: Dataset[Person] = dataframe.as[Person] ds.filter(p => p.name.startsWith(“M”)) .groupBy(“name”) .avg(“age”)
  20. 20. Other Upcoming Features DataFrame integration with GraphXand Streaming More Tungstenfeatures: faster in-memory cache,SSD storage, better code generation Data sourcesfor Streaming See Reynold’s talk tomorrow for details
  21. 21. Dank je! Enjoy Spark Summit

×