Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Spark (Intern Event Presentation)

2,550 views

Published on

An introduction to Apache Spark from its creator, Matei Zaharia, for the intern event hosted by Databricks.

Published in: Software
  • Login to see the comments

Introduction to Spark (Intern Event Presentation)

  1. 1. Introduction to Spark Matei Zaharia Databricks Intern Event, August 2015
  2. 2. What is Apache Spark? Fast and general computing engine for clusters Makes it easy and fast to process large datasets • APIs in Java, Scala, Python, R • Libraries for SQL, streaming, machine learning, … • 100x faster than Hadoop MapReduce for some apps
  3. 3. About Databricks Founded by creators of Spark in 2013 Offers a hosted cloud service built on Spark • Interactive workspace with notebooks, dashboards, jobs
  4. 4. 0 20 40 60 80 100 120 140 160 2010 2011 2012 2013 2014 2015 Contributors Contributors / Month to Spark Community Growth Most active open source project in big data
  5. 5. Spark Programming Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) • Collections of objects stored in memory or disk across a cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure
  6. 6. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines  =  spark.textFile(“hdfs://...”)   errors  =  lines.filter(lambda  s:  s.startswith(“ERROR”))   messages  =  errors.map(lambda  s:  s.split(‘t’)[2])   messages.cache()   Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(lambda  s:  “MySQL”  in  s).count()   messages.filter(lambda  s:  “Redis”  in  s).count()   .  .  .   tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in 0.5 sec (vs 20s for on-disk data)
  7. 7. Example: Logistic Regression 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s Iterative algorithm used in machine learning
  8. 8. Source: Daytona GraySort benchmark, sortbenchmark.org 2100 machines2013 Record: Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes On-Disk Performance Time to sort 100TB
  9. 9. Higher-Level Libraries Spark Spark Streaming real-time Spark SQL structured data MLlib machine learning GraphX graph
  10. 10. Higher-Level Libraries //  Load  data  using  SQL   points  =  ctx.sql(“select  latitude,  longitude  from  tweets”)   //  Train  a  machine  learning  model   model  =  KMeans.train(points,  10)   //  Apply  it  to  a  stream   sc.twitterStream(...)      .map(lambda  t:  (model.predict(t.location),  1))      .reduceByWindow(“5s”,  lambda  a,  b:  a  +  b)  
  11. 11. Demo
  12. 12. Over 1000 production users, clusters up to 8000 nodes Many talks online at spark-summit.org Spark Community
  13. 13. Ongoing Work Speeding up Spark through code generation and binary processing (Project Tungsten) R interface to Spark (SparkR) Real-time machine learning library Frontend and backend work in Databricks (visualization, collaboration, auto-scaling, …)
  14. 14. Thank you. We’re hiring!

×