Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Unified Big Data Processing with Apache Spark

0 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.

Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.

Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.

Published in: Technology
  • Fibroids Miracle! Cure Uterine Fibroids , end your PCOS related symptoms and regain your natural inner balance ... Guaranteed! -- Discover how Amanda Leto has taught thousands of women worldwide to achieve Uterine Fibroids freedom faster than they ever thought possible... Even if you've never succeeded at curing your Uterine Fibroids before... Right here you've found the Uterine Fibroids freedom success system you've been looking for! ◆◆◆ http://ishbv.com/fibroids7/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Unified Big Data Processing with Apache Spark

  1. 1. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /apache-spark-big-data InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  2. 2. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  3. 3. Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia
  4. 4. What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing Most active open source project in big data
  5. 5. About Databricks Founded by the creators of Spark in 2013 Continues to drive open source Spark development, and offers a cloud service (Databricks Cloud) Partners to support Spark with Cloudera, MapR, Hortonworks, Datastax
  6. 6. Spark Community MapReduce YARN HDFS Storm Spark 2000 1500 1000 500 0 MapReduce YARN HDFS Storm Spark 350000 300000 250000 200000 150000 100000 50000 0 Commits Lines of Code Changed Activity in past 6 months
  7. 7. Community Growth 100 75 50 25 0 Contributors per Month to Spark 2010 2011 2012 2013 2014 2-3x more activity than Hadoop, Storm, MongoDB, NumPy, D3, Julia, …
  8. 8. Overview Why a unified engine? Spark execution model Why was Spark so general? What’s next
  9. 9. History: Cluster Programming Models 2004
  10. 10. MapReduce A general engine for batch processing
  11. 11. Beyond MapReduce MapReduce was great for batch processing, but users quickly needed to do more: > More complex, multi-pass algorithms > More interactive ad-hoc queries > More real-time stream processing Result: many specialized systems for these workloads
  12. 12. Big Data Systems Today MapReduce Pregel Giraph Presto Storm Dremel Drill Impala S4 . . . Specialized systems for new workloads General batch processing
  13. 13. Problems with Specialized Systems More systems to manage, tune, deploy Can’t combine processing types in one application > Even though many pipelines need to do this! > E.g. load data with SQL, then run machine learning In many pipelines, data exchange between engines is the dominant cost!
  14. 14. MapReduce Pregel Giraph Presto Storm Dremel Drill Impala S4 Specialized systems for new workloads General batch processing Unified engine Big Data Systems Today ? . . .
  15. 15. Overview Why a unified engine? Spark execution model Why was Spark so general? What’s next
  16. 16. Background Recall 3 workloads were issues for MapReduce: > More complex, multi-pass algorithms > More interactive ad-hoc queries > More real-time stream processing While these look different, all 3 need one thing that MapReduce lacks: efficient data sharing
  17. 17. Data Sharing in MapReduce iter. 1 iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . HDFS read Slow due to data replication and disk I/O
  18. 18. What We’d Like iter. 1 iter. 2 . . . Input Distributed memory Input query 1 query 2 query 3 . . . one-time processing 10-100× faster than network and disk
  19. 19. Spark Model Resilient Distributed Datasets (RDDs) > Collections of objects that can be stored in memory or disk across a cluster > Built via parallel transformations (map, filter, …) > Fault-tolerant without replication
  20. 20. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RTDraDn sformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . results tasks Cache 1 Cache 2 Cache 3 Action Full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
  21. 21. Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file
  22. 22. Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda rec: (rec.type, 1)) map reduce filter Input file .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
  23. 23. Example: Logistic Regression 4000 3500 Running Time (s) Number of Iterations 3000 2500 2000 1500 1000 500 0 1 5 10 20 30 110 s / iteration Hadoop Spark first iteration 80 s later iterations 1 s
  24. 24. Spark in Scala and Java // Scala: val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...); lines.filter(s -> s.contains(“ERROR”)).count();
  25. 25. How General Is It?
  26. 26. Spark Streaming real-time Spark Core Spark SQL relational MLlib machine learning GraphX graph Libraries Built on Spark
  27. 27. Spark SQL Represents tables as RDDs Tables = Schema + Data
  28. 28. Spark SQL Represents tables as RDDs Tables = Schema + Data = SchemaRDD From Hive: c = HiveContext(sc) rows = c.sql(“select text, year from hivetable”) rows.filter(lambda r: r.year > 2013).collect() {“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }} From JSON: c.jsonFile(“tweets.json”).registerTempTable(“tweets”) c.sql(“select text, user.name from tweets”) tweets.json
  29. 29. Spark Streaming Time Input
  30. 30. Spark Streaming Time RDD RDD RDD RDD RDD RDD Represents streams as a series of RDDs over time val spammers = sc.sequenceFile(“hdfs://spammers.seq”) sc.twitterStream(...) .filter(t => t.text.contains(“QCon”)) .transform(tweets => tweets.map(t => (t.user, t)).join(spammers)) .print()
  31. 31. MLlib Vectors, Matrices
  32. 32. MLlib Vectors, Matrices = RDD[Vector] Iterative computation points = sc.textFile(“data.txt”).map(parsePoint) model = KMeans.train(points, 10) model.predict(newPoint)
  33. 33. GraphX Represents graphs as RDDs of edges and vertices
  34. 34. GraphX Represents graphs as RDDs of edges and vertices
  35. 35. GraphX Represents graphs as RDDs of edges and vertices
  36. 36. Combining Processing Types // Load data using SQL val points = ctx.sql( “select latitude, longitude from historic_tweets”) // Train a machine learning model val model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)
  37. 37. Composing Workloads Separate systems: . . . HDFS read HDFS write ETL HDFS read HDFS write train HDFS read HDFS write query HDFS write HDFS read ETL train query Spark:
  38. 38. Hive Impala (disk) Impala (mem) Spark (disk) Spark (mem) 0 10 20 30 40 50 Response Time (sec) SQL Mahout GraphLab Spark 0 10 20 30 40 50 60 Response Time (min) ML Performance vs Specialized Systems Storm Spark 0 5 10 15 20 25 30 35 Throughput (MB/s/node) Streaming
  39. 39. On-Disk Performance: Petabyte Sort Spark beat last year’s Sort Benchmark winner, Hadoop, by 3× using 10× fewer machines 2013 Record (Hadoop) Spark 100 TB Spark 1 PB Data Size 102.5 TB 100 TB 1000 TB Time 72 min 23 min 234 min Nodes 2100 206 190 Cores 50400 6592 6080 Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min tinyurl.com/spark-sort
  40. 40. Overview Why a unified engine? Spark execution model Why was Spark so general? What’s next
  41. 41. Why was Spark so General? In a world of growing data complexity, understanding this can help us design new tools / pipelines Two perspectives: > Expressiveness perspective > Systems perspective
  42. 42. 1. Expressiveness Perspective Spark ≈ MapReduce + fast data sharing
  43. 43. 1. Expressiveness Perspective MapReduce can emulate any distributed system! How to share data! quickly across steps? Local computation All-to-all communication One MR step … Spark: RDDs How low is this latency? Spark: ~100 ms
  44. 44. 2. Systems Perspective Main bottlenecks in clusters are network and I/O Any system that lets apps control these resources can match speed of specialized ones In Spark: > Users control data partitioning & caching > We implement the data structures and algorithms of specialized systems within Spark records
  45. 45. Examples Spark SQL > A SchemaRDD holds records for each chunk of data (multiple rows), with columnar compression GraphX > GraphX represents graphs as an RDD of HashMaps so that it can join quickly against each partition
  46. 46. Result Spark can leverage most of the latest innovations in databases, graph processing, machine learning, … Users get a single API that composes very efficiently More info: tinyurl.com/matei-thesis
  47. 47. Overview Why a unified engine? Spark execution model Why was Spark so general? What’s next
  48. 48. What’s Next for Spark While Spark has been around since 2009, many pieces are just beginning 300 contributors, 2 whole libraries new this year Big features in the works
  49. 49. Spark 1.2 (Coming in Dec) New machine learning pipelines API > Featurization & parameter search, similar to SciKit-Learn Python API for Spark Streaming Spark SQL pluggable data sources > Hive, JSON, Parquet, Cassandra, ORC, … Scala 2.11 support
  50. 50. Beyond Hadoop Batch Interactive Streaming Hadoop Cassandra Mesos … … Public Clouds Your application here Unified API across workloads, storage systems and environments
  51. 51. Learn More Downloads and tutorials: spark.apache.org Training: databricks.com/training (free videos) Databricks Cloud: databricks.com/cloud
  52. 52. www.spark-summit.org
  53. 53. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/apache-spark- big-data

×