Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BDM26: Spark Summit 2014 Debriefing


Published on

Highlights of the most interesting topics discussed at the Spark Summit 2014 in San Francisco, California

Published in: Software
  • Be the first to comment

BDM26: Spark Summit 2014 Debriefing

  1. 1. Spark Summit 2014 Debriefing David Lauzon Presented at Big Data Montreal #26 on July 8th 2014
  2. 2. Plan ● Spark Summit 2014 summary ● Tachyon ● BlinkDB ● Databricks Cloud
  3. 3. Disclaimer I haven’t use Spark yet I haven’t validated all the info gathered in this presentation Try it out for yourself :-)
  4. 4. Spark’s Role in the Big Data Ecosystem Matei Zaharia (CTO, Databricks)
  5. 5. “Spark is now the most active project in the Hadoop ecosystem”
  6. 6. “The goal of Spark is to be a unified platform and standard library for big data apps”
  7. 7. native driver
  8. 8. What’s Next for BDAS? Mike Franklin (Director, UC Berkeley AMPLab)
  9. 9. LAYERS Application Data Processing Resource Management Data Management
  10. 10. BDAS Summary (1/2) Spark Core General purpose low level low latency processing engine. Supports: HDFS API, Amazon S3 API, and Hive metadata Shark Replaces Hive’s execution engine from MapReduce by Spark Spark Streaming Competitor to Storm. Inputs from Kafka, Flume, Twitter, TCP sockets MLlib MLlib = low level machine library running on Spark. MLbase (in dev) Competitor to Mahout, runs on top of MLlib. GraphX (in dev) Enable users to interactively build, transform, and reason about graph structured at scale
  11. 11. BDAS Summary (2/2) BlinkDB (alpha) SQL Queries with Bounded Errors and Bounded Response Times on Very Large Data SparkR (alpha) Run R on top of Spark Tachyon A reliable in-memory distributed file system providing a HDFS compatible API. Can persist data to HDFS, Amazon S3, LocalFS, etc. Mesos Cluster resource manager, multi-tenancy
  12. 12. Spark and the future of big data applications Eric Baldeschwieler (Tech Advisor)
  13. 13. Big Data Application Model
  14. 14. Spark’s current (v1.0) challenges Better job scheduling tools Increase focus on ETL R bindings Extend SparkSQL to run on more data stores Add more machine learning algorithms Basics: stability, profiling & debugging, error reporting, logging, etc.
  15. 15. Spark’s current (v1.0) challenges Better stability Profiling & debugging Error reporting Logging
  16. 16. The Future of Spark Patrick Wendell (Databricks)
  17. 17. Timeline and: ● join optimisations ● MLib: from 15 to 30 algorithms ● Core internal API for pluggable implementations
  18. 18. The Emergence of the Enterprise Data Hub Mike Olson (Chief Strategy Officer, Cloudera)
  19. 19. (a vision of the future)
  20. 20. This means that sooner or later ... Hadoop MapReduce
  21. 21. Spark meets Genomics: Helping Fight the Big C with the Big D David Patterson (AMP Lab, UC Berkeley)
  22. 22. SNAP: Scalable Nucleotide Alignment Program => A new genome aligner based on Spark that is 10-100X faster and simultaneously more accurate than existing tools based on MapReduce or other algorithms [1] [1]
  23. 23. SNAP helps save a life [1] A teenager was hospitalized for 5 weeks without successful diagnosis He developed brain seizures and was placed in a medically induced coma With a sample of his spinal fluid and the use of Snap, a rare infectious bacterium was found Boy was treated, and discharged 4 weeks later [1]
  24. 24. Databricks Update and Announcing Databricks Cloud Ion Stoica (CEO, Databricks)
  25. 25. even RedHat Fedora
  26. 26. New: Databricks Cloud Platform
  27. 27. Databricks Platform
  28. 28. Databricks Workspace: Notebooks
  29. 29. Databricks Workspace: Dashboards
  30. 30. Databricks Cloud Demo The following video extract integrates: ● Databricks Workspace ● Databricks Platform ● Spark Streaming ● Spark SQL ● Spark MLLib
  31. 31. Databricks Cloud Demo 14min extract: Full video:
  32. 32. Databricks Cloud Great tool for data scientists
  33. 33. Conclusion
  34. 34. Conclusion Most interesting Spark related projects: ● SparkSQL ● BlinkDB ● Tachyon ● Databricks Cloud