Spark For Plain Old Java Geeks (June2014 Meetup)

1,213 views

Published on

An overview of the Apache Spark project from the perspective of a Java programmer. Topics: What is Spark, Spark Programming Model, Spark eco-system, 1.0 release and why it's a huge milestone.

Published in: Software, Technology, Education
1 Comment
4 Likes
Statistics
Notes
No Downloads
Views
Total views
1,213
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
59
Comments
1
Likes
4
Embeds 0
No embeds

No notes for slide

Spark For Plain Old Java Geeks (June2014 Meetup)

  1. 1. 1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved. Intro to Apache Spark A primer for POJGs (Plain Old Java Geeks) Scott Deeg: Sr. Field Engineer sdeeg@gopivotal.com
  2. 2. 2© Copyright 2013 Pivotal. All rights reserved. Agenda Ÿ  Intro: Agenda, it’s all about ME!, 10 seconds on Pivotal Ÿ  What is Spark, and what does it have to do with BigData/Hadoop? –  Ecosystem (Shark, Streaming, MLlib, GraphX) Ÿ  Spark Programming Model –  Demo: interactive shell Ÿ  Related Projects Ÿ  Spark 1.0 Ÿ  More Tech: WordCount, TicTacToe – dev experience, Java8 Ÿ  Deployment Topologies –  Simple Cluster Demo
  3. 3. 3© Copyright 2013 Pivotal. All rights reserved. Who Am I? Just a Plain Old Java Guy Ÿ  Java since 1996, Symantec Visual Café 1.0 Ÿ  Random consulting around Si Valley Ÿ  Hacker on Java based BPM product for 10 years Ÿ  Joined VMW 2009 when they acquired SpringSource Ÿ  Rolled into Pivotal April 1 2013
  4. 4. 4© Copyright 2013 Pivotal. All rights reserved. What is Pivotal? Ÿ  Cloud, Big Data, Fast Data, Modern Apps Ÿ  Technology Bets –  HDFS will be the way we talk to Enterprise data repositories ▪  Consolidate Silos in “Data Lake” ▪  Eco-system of services will arise to utilize HDFS data –  PaaS will manage the Application Life Cycle –  OSS will be the basis for solutions –  Cloud Architecture ▪  Distributed / Parallel ▪  CPU, Memory, Network … storage is a distributed service
  5. 5. 5© Copyright 2013 Pivotal. All rights reserved. Data Sources Application Platform Stream Server IMDG ASF Services MPP SQL HDFS Pivotal Platform SQL Objects JSON GemFireXD ...ETC End Users Developers AppOps
  6. 6. 6© Copyright 2013 Pivotal. All rights reserved. 6© Copyright 2013 Pivotal. All rights reserved. What Is Spark? Hint: It’s all about the RDD
  7. 7. 7© Copyright 2013 Pivotal. All rights reserved. ? Ÿ  Is it “Big Data” Ÿ  Is it “Hadoop” Ÿ  It’s one of those “in memory” things, right Ÿ  JVM, Java, Scala Ÿ  Is it Real or just another shiny technology with a long, but ultimately small tail
  8. 8. 8© Copyright 2013 Pivotal. All rights reserved. Spark is … Ÿ  Distributed/Cluster Compute Execution Engine –  Came out of AMPLab project at UCB, now ASF top level project Ÿ  Designed to work with data in memory Ÿ  Similar scalability and fault tolerance as Hadoop Map/Reduce –  Utilizes Lineage to reconstitute data instead of replication Ÿ  Generalization of Map/Reduce –  Implementation of Resilient Distributed Dataset (RDD) Ÿ  Programmatic or Interactive Ÿ  Written in Scala
  9. 9. 9© Copyright 2013 Pivotal. All rights reserved. Spark is also … Ÿ  An ASF Top Level project Ÿ  Has ~100 contributors across 25 companies –  More active than Hadoop MapReduce Ÿ  An eco-system of domain specific tools –  Different models, but mostly interoperable Ÿ  Hadoop Compatible
  10. 10. 10© Copyright 2013 Pivotal. All rights reserved. Berkley Data Analytics Stack (BDAS) Support Ÿ  Batch Ÿ  Streaming Ÿ  Interactive Make it easy to compose them
  11. 11. 11© Copyright 2013 Pivotal. All rights reserved. Short History Ÿ  2009 Started as research project at UCB Ÿ  2010 Open Sourced Ÿ  January 2011 AMPLab Created Ÿ  October 2012 0.6 –  Java, Stand alone cluster, maven Ÿ  June 21 2013 Spark accepted into ASF Incubator Ÿ  Feb 27 2014 Spark becomes top level ASF project Ÿ  May 30 2014 Spark 1.0
  12. 12. 12© Copyright 2013 Pivotal. All rights reserved. Spark Philosophy Ÿ  Make life easy and productive for Data Scientists Ÿ  Provide well documented and expressive APIs Ÿ  Powerful Domain Specific Libraries Ÿ  Easy integration with storage systems Ÿ  Caching to avoid data movement (performance) Ÿ  Well defined releases, stable API
  13. 13. 13© Copyright 2013 Pivotal. All rights reserved. Spark is not Hadoop, but is compatible Ÿ  Often better than Hadoop (Eric Baldeschwieler) –  M/R fine for “Data Parallel”, but awkward for some workloads –  Low latency dispatch, Iterative, Streaming Ÿ  Natively accesses Hadoop data Ÿ  Spark just another YARN job –  Maintains huge investment in data collection –  Brings Spark to the Data Ÿ  It’s not OR … it’s AND!
  14. 14. 14© Copyright 2013 Pivotal. All rights reserved. Improvements over Map/Reduce Ÿ  Efficiency –  General Execution Graphs (not just map->reduce->store) –  In memory Ÿ  Usability –  Rich APIs in Scala, Java, Python –  Interactive Ÿ  Can Spark be the R for Big Data?
  15. 15. 15© Copyright 2013 Pivotal. All rights reserved. 15© Copyright 2013 Pivotal. All rights reserved. Spark Programming Model RDDs in Detail
  16. 16. 16© Copyright 2013 Pivotal. All rights reserved. Core Concept Think of a program as a set of transformations on a Distributed Dataset Model: Resilient Distributed Dataset (RDD) –  Read Only Collection of Objects spread across a cluster –  RDDs are built through parallel transformations (map, filter, etc.) –  Automatically rebuilt on failure using lineage –  Controllable persistence (RAM, HDFS, etc.)
  17. 17. 17© Copyright 2013 Pivotal. All rights reserved. Operations Ÿ  Create –  From stable storage (hdfs) Ÿ  Transform –  Generate RDD from other RDD (map, filter, groupBy) –  Lazy Operations that build a DAG –  Once Spark knows your transformations it can build an efficient plan Ÿ  Action –  Return a result or write to storage (count, collect, reduce, save)
  18. 18. 18© Copyright 2013 Pivotal. All rights reserved. Demo: Log Mining Ÿ  Scala shell Ÿ  Load file from HDFS Ÿ  Search for patterns
  19. 19. 19© Copyright 2013 Pivotal. All rights reserved. Transformation and Actions Ÿ  Transformations –  Map –  filter –  flatMap –  sample –  groupByKey –  reduceByKey –  union –  join –  sort Ÿ  Actions –  count –  collect –  reduce –  lookup –  save
  20. 20. 20© Copyright 2013 Pivotal. All rights reserved. RDD Fault Tolerance Ÿ  RDDs maintain lineage information that can be used to reconstruct lost partitions cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘t’)(2)) .cache() HdfsRDD path: hdfs://… FilteredRDD func: contains(...) MappedRDD func: split(…) CachedRDD
  21. 21. 21© Copyright 2013 Pivotal. All rights reserved. RDDs are Foundational Ÿ  General purpose enough to use to implement other programing models –  SQL –  Graph –  ML –  MR
  22. 22. 22© Copyright 2013 Pivotal. All rights reserved. 22© Copyright 2013 Pivotal. All rights reserved. Related Projects Things that run on Spark
  23. 23. 23© Copyright 2013 Pivotal. All rights reserved. Related Projects Ÿ  Shark Ÿ  Spark SQL Ÿ  Spark Streaming Ÿ  GraphX Ÿ  MLbase Ÿ  Others
  24. 24. 24© Copyright 2013 Pivotal. All rights reserved. Shark Ÿ  Hive on Spark –  HiveQL, UDFs, etc. Ÿ  Turn SQL into RDD –  Part of the lineage Ÿ  Based on Hive, but takes advantage of Spark for –  Fast Scheduling –  Queries are DAGs of jobs, not chained M/R –  Fast broadcast variables © Apache Software Foundation
  25. 25. 25© Copyright 2013 Pivotal. All rights reserved. Shark (cont) Ÿ  Optimized Columnar Storage format Ÿ  Fast/Efficient Compression –  From Yahoo! –  Able to hold 3-20x more data in same cluster Ÿ  Various other optimizations using partitioning Ÿ  Will ultimately run on Spark SQL –  No Hive dependencies except to accessing Hive datastore –  Long running process with management tools
  26. 26. 26© Copyright 2013 Pivotal. All rights reserved. Spark SQL Ÿ  Lib in Spark Core to treat RDDs as relations –  SchemaRDD Ÿ  Lighter weight version of Shark –  No code from Hive Ÿ  Import/Export in different Storage formats –  Parquet, learn schema from existing Hive warehouse Ÿ  Takes columnar storage from Shark
  27. 27. 27© Copyright 2013 Pivotal. All rights reserved. Spark SQL Code Ÿ  Go take a look
  28. 28. 28© Copyright 2013 Pivotal. All rights reserved. Spark Streaming Ÿ  Extend Spark to do large scale stream processing –  100s of nodes and second scale end to end latency Ÿ  Stateful Processing –  Hard to make FT –  Storm: requires idempotent updates Ÿ  Simple, batch like API with RDDs Ÿ  Single semantics for both real time and high latency
  29. 29. 29© Copyright 2013 Pivotal. All rights reserved. Streaming (cont) Ÿ  Input is broken up into Batches that become RDDs Ÿ  RDD’s are composed into DAGs to generate output Ÿ  Raw data is replicated in-memory for FT
  30. 30. 30© Copyright 2013 Pivotal. All rights reserved. Streaming (cont) Ÿ  Other features –  Window-based Transformations –  Arbitrary join of streams
  31. 31. 31© Copyright 2013 Pivotal. All rights reserved. GraphX (Alpha) Ÿ  Graph processing –  Replaces Spark Bagel Ÿ  Graph Parallel not Data Parallel –  Reason in the context of neighbors –  GraphLab API
  32. 32. 32© Copyright 2013 Pivotal. All rights reserved. GraphX (cont) Ÿ  Predicting things about people (eg: political bias) –  Look at posts, apply classifier, try to predict attribute –  Local signal is difficult alone –  Look at context of social network to improve prediction Ÿ  Triangle processing –  More triangles reveals greater community Ÿ  Collaborative Filtering –  Bi-partide graph processing –  What I like, who rated those things, what they like => what I may like
  33. 33. 33© Copyright 2013 Pivotal. All rights reserved. GraphX (cont) Ÿ  Graph Creation => Algorithm => Post Processing –  Existing systems mainly deal with the Algorithm and not interactive –  Unify collection and graph models Ÿ  Graphs have –  Vertices, edges –  Transformation: reverse, filter, map –  Joins: graphs and tables –  Aggregate Neighbors
  34. 34. 34© Copyright 2013 Pivotal. All rights reserved. MLbase Ÿ  Machine Learning toolset –  Library and higher level abstractions Ÿ  General tool is MatLab –  Difficult for end users to learn, debug, scale solutions Ÿ  Starting with MLlib –  Low level Distributed Machine Learning Library Ÿ  Many different Algorithms –  Classification, Regression, Collaborative Filtering, etc.
  35. 35. 35© Copyright 2013 Pivotal. All rights reserved. Others Ÿ  Mesos –  Enable multiple frameworks to share same cluster resources –  Twitter is largest user: Over 6,000 servers Ÿ  Tachyon –  In-memory, fault tolerant file system that exposes HDFS Ÿ  Catalyst –  SQL Query Optimizer
  36. 36. 36© Copyright 2013 Pivotal. All rights reserved. 36© Copyright 2013 Pivotal. All rights reserved. Spark 1.0
  37. 37. 37© Copyright 2013 Pivotal. All rights reserved. Release cycle Ÿ  1.0 Came out at end of May Ÿ  1.X expected to be current for several years Ÿ  Quarterly release cycle –  2 mo dev / 1 mo QA –  Actual release is based on vote Ÿ  1.1 due end of August
  38. 38. 38© Copyright 2013 Pivotal. All rights reserved. 1.0 Ÿ  API Stability in 1.X for all non-Alpha projects –  Can recompile jobs, but hoping for binary compatibility –  Internal API are marked @DeveloperApi or @Experimental Ÿ  Focus: Core Engine, Streaming, MLLib, SparkSQL Ÿ  History Server for Spark UI –  Driving development of instrumentation Ÿ  Job Submission Tool –  Don’t configure Context in code (eg: master)
  39. 39. 39© Copyright 2013 Pivotal. All rights reserved. 1.0 Ÿ  Java8 Lamdas –  No more writing closures as Classes –  Functions are interfaces –  Return type sensitive functions ▪  mapToPair Ÿ  Python improvements
  40. 40. 40© Copyright 2013 Pivotal. All rights reserved. 1.0 Ÿ  Hadoop security –  Kerberos, ACL for UI Ÿ  Job cancel from UI Ÿ  Distributed GC as things go out of scope –  Good for long lives service Ÿ  Spark SQL
  41. 41. 41© Copyright 2013 Pivotal. All rights reserved. 41© Copyright 2013 Pivotal. All rights reserved. More Code and Demos WordCount, TicTacToe, Java8
  42. 42. 42© Copyright 2013 Pivotal. All rights reserved. Code Review: WordCount Ÿ  Java API Ÿ  Java Code Ÿ  More usage of RDDs
  43. 43. 43© Copyright 2013 Pivotal. All rights reserved. TicTacToe: a developers experience Ÿ  IDE Ÿ  Spring Ÿ  Building/Logging Ÿ  Debugging
  44. 44. 44© Copyright 2013 Pivotal. All rights reserved. Demo: Java 8 Lamda Lamda Lamda
  45. 45. 45© Copyright 2013 Pivotal. All rights reserved. 45© Copyright 2013 Pivotal. All rights reserved. Deployment Topologies
  46. 46. 46© Copyright 2013 Pivotal. All rights reserved. Topologies Ÿ  Local Ÿ  Spark Cluster (master/slaves) Ÿ  Cluster Resource Managers –  YARN –  MESOS Ÿ  (PaaS?)
  47. 47. 47© Copyright 2013 Pivotal. All rights reserved. Demo: Ÿ  Start master and slaves Ÿ  Show the UI Ÿ  Run a Job Ÿ  Talk about the History Server
  48. 48. 48© Copyright 2013 Pivotal. All rights reserved. 48© Copyright 2013 Pivotal. All rights reserved. This And That
  49. 49. 49© Copyright 2013 Pivotal. All rights reserved. How Real is Spark? Ÿ  There is some criticism –  As expected –  New project! Ÿ  There are many indicators that Spark is heading to success –  Solid technology –  Good buzz –  Significant community
  50. 50. 50© Copyright 2013 Pivotal. All rights reserved. Next Steps Ÿ  Spark website: http://spark.apache.org –  Lots’O’Goodstuff Ÿ  Spark Summit June 30/July 01 –  http://spark-summit.org
  51. 51. 51© Copyright 2013 Pivotal. All rights reserved. 51© Copyright 2013 Pivotal. All rights reserved. A NEW PLATFORM FOR A NEW ERA

×