Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More

3,329 views

Published on

Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289

Published in: Technology

Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More

  1. 1. Spark Camp @ Strata CA
 An Intro to Apache Spark with Hands-on Tutorials Wed Feb 18, 2015 9:00am–5:00pm strataconf.com/big-data-conference-ca-2015/
  2. 2. Spark Camp @ Strata + Hadoop World A day long hands-on introduction to the Spark platform including Spark Core, the Spark Shell, Spark Streaming, Spark SQL, MLlib, GraphX, and more… • overview of use cases and demonstrate writing 
 simple Spark applications • cover each of the main components of the 
 Spark stack • a series of technical talks targeted at developers 
 who are new to Spark • intermixed with the talks will be periods of 
 hands-on lab work
  3. 3. Spark Camp @ Strata + Hadoop World Strata NY @ NYC
 2014-10-15
 ~450 people Strata EU @ Barcelona
 2014-11-19
 ~250 people
  4. 4. Spark Camp: Ask Us Anything Fri, Feb 20 2:20pm-3:00pm
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/40701 Join the Spark team for an informal question and answer session. Several of the Spark committers, trainers, etc., from Databricks will be on hand to field a wide range of detailed questions. Even if you don’t have a specific question, join 
 in to hear what others are asking!
  5. 5. Apache Spark Advanced Training Feb 17-19 9:00am-5:00pm
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/39399 Sameer Farooqui leads this new 3-day training program offered by Databricks and O’Reilly Media at Strata + Hadoop World events worldwide. Participants will also receive limited free-tier accounts on Databricks Cloud. Note: this sold out early, so if you want to attend it at Strata EU, sign up quickly!
  6. 6. Spark Developer Certification
 Fri Feb 20, 2015 10:40am-12:40pm • http://oreilly.com/go/sparkcert • defined by Spark experts @Databricks • assessed by O’Reilly Media • establishes the bar for Spark expertise
  7. 7. • 40 multiple-choice questions, 90 minutes • mostly structured as choices among code blocks • expect some Python, Java, Scala, SQL • understand theory of operation • identify best practices • recognize code that is more parallel, less memory constrained ! Overall, you need to write Spark apps in practice Developer Certification: Overview 7
  8. 8. Even More Apache Spark!
 Feb 17-20, 2015
  9. 9. Keynote: New Directions for Spark in 2015 Fri Feb 20 9:15am-9:25am
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/39547 As the Apache Spark userbase grows, the developer community is working 
 to adapt it for ever-wider use cases. 2014 saw fast adoption of Spark in the enterprise and major improvements in its performance, scalability and standard libraries. In 2015, we want to make Spark accessible to a wider 
 set of users, through new high-level APIs for data science: machine learning pipelines, data frames, and R language bindings. In addition, we are defining extension points to let Spark grow as a platform, making it easy to plug in data sources, algorithms, and external packages. Like all work on Spark, 
 these APIs are designed to plug seamlessly into Spark applications, giving users a unified platform for streaming, batch and interactive data processing. Matei Zaharia – started the Spark project at UC Berkeley, currently CTO of Databricks, SparkVP at Apache, and an assistant professor at MIT
  10. 10. Databricks Spark Talks @ Strata + Hadoop World Thu Feb 19 10:40am-11:20am
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/38237 Lessons from Running Large Scale SparkWorkloads
 Reynold Xin, Matei Zaharia Thu Feb 19 4:00pm–4:40pm
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/38518 Spark Streaming -The State of the Union, and Beyond
 Tathagata Das
  11. 11. Databricks Spark Talks @ Strata + Hadoop World Fri Feb 20 11:30am-12:10pm
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/38237 Tuning and Debugging in Apache Spark
 Patrick Wendell Fri Feb 20 4:00pm–4:40pm
 strataconf.com/big-data-conference-ca-2015/ public/schedule/detail/38391 Everyday I’m Shuffling -Tips forWriting Better Spark Programs
 Vida Ha, Holden Karau
  12. 12. A Brief History
  13. 13. 13 A Brief History: Functional Programming for Big Data circa late 1990s: 
 explosive growth e-commerce and machine data implied that workloads could not fit on a single computer anymore… notable firms led the shift to horizontal scale-out 
 on clusters of commodity hardware, especially 
 for machine learning use cases at scale
  14. 14. 14 A Brief History: Functional Programming for Big Data 2002 2002 MapReduce @ Google 2004 MapReduce paper 2006 Hadoop @Yahoo! 2004 2006 2008 2010 2012 2014 2014 Apache Spark top-level 2010 Spark paper 2008 Hadoop Summit
  15. 15. 15 circa 2002: 
 mitigate risk of large distributed workloads lost 
 due to disk failures on commodity hardware… Google File System Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung research.google.com/archive/gfs.html ! MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean, Sanjay Ghemawat research.google.com/archive/mapreduce.html A Brief History: MapReduce
  16. 16. 16 MR doesn’t compose well for large applications, 
 and so specialized systems emerged as workarounds MapReduce General Batch Processing Specialized Systems: iterative, interactive, streaming, graph, etc. Pregel Giraph Dremel Drill Tez Impala GraphLab StormS4 F1 MillWheel A Brief History: MapReduce
  17. 17. Developed in 2009 at UC Berkeley AMPLab, then open sourced in 2010, Spark has since become 
 one of the largest OSS communities in big data, with over 200 contributors in 50+ organizations spark.apache.org “Organizations that are looking at big data challenges –
 including collection, ETL, storage, exploration and analytics –
 should consider Spark for its in-memory performance and
 the breadth of its model. It supports advanced analytics
 solutions on Hadoop clusters, including the iterative model
 required for machine learning and graph analysis.” Gartner, Advanced Analytics and Data Science (2014) 17 A Brief History: Spark
  18. 18. 18 A Brief History: Spark
  19. 19. Spark is one of the most active Apache projects ohloh.net/orgs/apache 19 TL;DR: Sustained Exponential Growth
  20. 20. databricks.com/blog/2015/01/27/big-data-projects-are- hungry-for-simpler-and-more-powerful-tools-survey- validates-apache-spark-is-gaining-developer-traction.html TL;DR: Spark Survey 2015 by Databricks +Typesafe 20
  21. 21. databricks.com/blog/2014/11/05/spark-officially- sets-a-new-record-in-large-scale-sorting.html TL;DR: SmashingThe Previous Petabyte Sort Record 21
  22. 22. oreilly.com/data/free/2014-data-science- salary-survey.csp TL;DR: Spark ExpertiseTops Median Salaries within Big Data 22
  23. 23. Unifying the Pieces
  24. 24. WordCount in 3 lines of Spark WordCount in 50+ lines of Java MR 24 Simple Spark Apps: WordCount
  25. 25. val sqlContext = new org.apache.spark.sql.SQLContext(sc)! import sqlContext._! ! // Define the schema using a case class.! case class Person(name: String, age: Int)! ! // Create an RDD of Person objects and register it as a table.! val people = sc.textFile("examples/src/main/resources/ people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))! ! people.registerTempTable("people")! ! // SQL statements can be run by using the sql methods provided by sqlContext.! val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")! ! // The results of SQL queries are SchemaRDDs and support all the ! // normal RDD operations.! // The columns of a row in the result can be accessed by ordinal.! teenagers.map(t => "Name: " + t(0)).collect().foreach(println) Data Workflows: Spark SQL 25
  26. 26. // http://spark.apache.org/docs/latest/streaming-programming-guide.html! ! import org.apache.spark.streaming._! import org.apache.spark.streaming.StreamingContext._! ! // create a StreamingContext with a SparkConf configuration! val ssc = new StreamingContext(sparkConf, Seconds(10))! ! // create a DStream that will connect to serverIP:serverPort! val lines = ssc.socketTextStream(serverIP, serverPort)! ! // split each line into words! val words = lines.flatMap(_.split(" "))! ! // count each word in each batch! val pairs = words.map(word => (word, 1))! val wordCounts = pairs.reduceByKey(_ + _)! ! // print a few of the counts to the console! wordCounts.print()! ! ssc.start() // start the computation! ssc.awaitTermination() // wait for the computation to terminate Data Workflows: Spark Streaming 26
  27. 27. 27 spark.apache.org/docs/latest/mllib-guide.html ! Key Points: ! • framework vs. library • scale, parallelism, sparsity • building blocks for long-term approach MLI: An API for Distributed Machine Learning Evan Sparks, Ameet Talwalkar, et al. International Conference on Data Mining (2013) http://arxiv.org/abs/1310.5426 Data Workflows: MLlib
  28. 28. Community
 Resources
  29. 29. community: spark.apache.org/community.html events worldwide: goo.gl/2YqJZK ! video+preso archives: spark-summit.org resources: databricks.com/spark-training-resources workshops: databricks.com/spark-training
  30. 30. books: Fast Data Processing 
 with Spark
 Holden Karau
 Packt (2013)
 shop.oreilly.com/product/ 9781782167068.do Spark in Action
 Chris Fregly
 Manning (2015*)
 sparkinaction.com/ Learning Spark
 Holden Karau, 
 Andy Konwinski, Matei Zaharia
 O’Reilly (2015*)
 shop.oreilly.com/product/ 0636920028512.do
  31. 31. 31 http://spark-summit.org/20% discount:
 SSEDBFRIEND20

×