Spark For Plain Old Java Geeks (June2014 Meetup)
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Spark For Plain Old Java Geeks (June2014 Meetup)

on

  • 601 views

An overview of the Apache Spark project from the perspective of a Java programmer. Topics: What is Spark, Spark Programming Model, Spark eco-system, 1.0 release and why it's a huge milestone.

An overview of the Apache Spark project from the perspective of a Java programmer. Topics: What is Spark, Spark Programming Model, Spark eco-system, 1.0 release and why it's a huge milestone.

Statistics

Views

Total Views
601
Views on SlideShare
599
Embed Views
2

Actions

Likes
3
Downloads
27
Comments
1

2 Embeds 2

http://www.dschool.co 1
http://dschool.co 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Spark For Plain Old Java Geeks (June2014 Meetup) Presentation Transcript

  • 1. 1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved. Intro to Apache Spark A primer for POJGs (Plain Old Java Geeks) Scott Deeg: Sr. Field Engineer sdeeg@gopivotal.com
  • 2. 2© Copyright 2013 Pivotal. All rights reserved. Agenda Ÿ  Intro: Agenda, it’s all about ME!, 10 seconds on Pivotal Ÿ  What is Spark, and what does it have to do with BigData/Hadoop? –  Ecosystem (Shark, Streaming, MLlib, GraphX) Ÿ  Spark Programming Model –  Demo: interactive shell Ÿ  Related Projects Ÿ  Spark 1.0 Ÿ  More Tech: WordCount, TicTacToe – dev experience, Java8 Ÿ  Deployment Topologies –  Simple Cluster Demo
  • 3. 3© Copyright 2013 Pivotal. All rights reserved. Who Am I? Just a Plain Old Java Guy Ÿ  Java since 1996, Symantec Visual Café 1.0 Ÿ  Random consulting around Si Valley Ÿ  Hacker on Java based BPM product for 10 years Ÿ  Joined VMW 2009 when they acquired SpringSource Ÿ  Rolled into Pivotal April 1 2013
  • 4. 4© Copyright 2013 Pivotal. All rights reserved. What is Pivotal? Ÿ  Cloud, Big Data, Fast Data, Modern Apps Ÿ  Technology Bets –  HDFS will be the way we talk to Enterprise data repositories ▪  Consolidate Silos in “Data Lake” ▪  Eco-system of services will arise to utilize HDFS data –  PaaS will manage the Application Life Cycle –  OSS will be the basis for solutions –  Cloud Architecture ▪  Distributed / Parallel ▪  CPU, Memory, Network … storage is a distributed service
  • 5. 5© Copyright 2013 Pivotal. All rights reserved. Data Sources Application Platform Stream Server IMDG ASF Services MPP SQL HDFS Pivotal Platform SQL Objects JSON GemFireXD ...ETC End Users Developers AppOps
  • 6. 6© Copyright 2013 Pivotal. All rights reserved. 6© Copyright 2013 Pivotal. All rights reserved. What Is Spark? Hint: It’s all about the RDD
  • 7. 7© Copyright 2013 Pivotal. All rights reserved. ? Ÿ  Is it “Big Data” Ÿ  Is it “Hadoop” Ÿ  It’s one of those “in memory” things, right Ÿ  JVM, Java, Scala Ÿ  Is it Real or just another shiny technology with a long, but ultimately small tail
  • 8. 8© Copyright 2013 Pivotal. All rights reserved. Spark is … Ÿ  Distributed/Cluster Compute Execution Engine –  Came out of AMPLab project at UCB, now ASF top level project Ÿ  Designed to work with data in memory Ÿ  Similar scalability and fault tolerance as Hadoop Map/Reduce –  Utilizes Lineage to reconstitute data instead of replication Ÿ  Generalization of Map/Reduce –  Implementation of Resilient Distributed Dataset (RDD) Ÿ  Programmatic or Interactive Ÿ  Written in Scala
  • 9. 9© Copyright 2013 Pivotal. All rights reserved. Spark is also … Ÿ  An ASF Top Level project Ÿ  Has ~100 contributors across 25 companies –  More active than Hadoop MapReduce Ÿ  An eco-system of domain specific tools –  Different models, but mostly interoperable Ÿ  Hadoop Compatible
  • 10. 10© Copyright 2013 Pivotal. All rights reserved. Berkley Data Analytics Stack (BDAS) Support Ÿ  Batch Ÿ  Streaming Ÿ  Interactive Make it easy to compose them
  • 11. 11© Copyright 2013 Pivotal. All rights reserved. Short History Ÿ  2009 Started as research project at UCB Ÿ  2010 Open Sourced Ÿ  January 2011 AMPLab Created Ÿ  October 2012 0.6 –  Java, Stand alone cluster, maven Ÿ  June 21 2013 Spark accepted into ASF Incubator Ÿ  Feb 27 2014 Spark becomes top level ASF project Ÿ  May 30 2014 Spark 1.0
  • 12. 12© Copyright 2013 Pivotal. All rights reserved. Spark Philosophy Ÿ  Make life easy and productive for Data Scientists Ÿ  Provide well documented and expressive APIs Ÿ  Powerful Domain Specific Libraries Ÿ  Easy integration with storage systems Ÿ  Caching to avoid data movement (performance) Ÿ  Well defined releases, stable API
  • 13. 13© Copyright 2013 Pivotal. All rights reserved. Spark is not Hadoop, but is compatible Ÿ  Often better than Hadoop (Eric Baldeschwieler) –  M/R fine for “Data Parallel”, but awkward for some workloads –  Low latency dispatch, Iterative, Streaming Ÿ  Natively accesses Hadoop data Ÿ  Spark just another YARN job –  Maintains huge investment in data collection –  Brings Spark to the Data Ÿ  It’s not OR … it’s AND!
  • 14. 14© Copyright 2013 Pivotal. All rights reserved. Improvements over Map/Reduce Ÿ  Efficiency –  General Execution Graphs (not just map->reduce->store) –  In memory Ÿ  Usability –  Rich APIs in Scala, Java, Python –  Interactive Ÿ  Can Spark be the R for Big Data?
  • 15. 15© Copyright 2013 Pivotal. All rights reserved. 15© Copyright 2013 Pivotal. All rights reserved. Spark Programming Model RDDs in Detail
  • 16. 16© Copyright 2013 Pivotal. All rights reserved. Core Concept Think of a program as a set of transformations on a Distributed Dataset Model: Resilient Distributed Dataset (RDD) –  Read Only Collection of Objects spread across a cluster –  RDDs are built through parallel transformations (map, filter, etc.) –  Automatically rebuilt on failure using lineage –  Controllable persistence (RAM, HDFS, etc.)
  • 17. 17© Copyright 2013 Pivotal. All rights reserved. Operations Ÿ  Create –  From stable storage (hdfs) Ÿ  Transform –  Generate RDD from other RDD (map, filter, groupBy) –  Lazy Operations that build a DAG –  Once Spark knows your transformations it can build an efficient plan Ÿ  Action –  Return a result or write to storage (count, collect, reduce, save)
  • 18. 18© Copyright 2013 Pivotal. All rights reserved. Demo: Log Mining Ÿ  Scala shell Ÿ  Load file from HDFS Ÿ  Search for patterns
  • 19. 19© Copyright 2013 Pivotal. All rights reserved. Transformation and Actions Ÿ  Transformations –  Map –  filter –  flatMap –  sample –  groupByKey –  reduceByKey –  union –  join –  sort Ÿ  Actions –  count –  collect –  reduce –  lookup –  save
  • 20. 20© Copyright 2013 Pivotal. All rights reserved. RDD Fault Tolerance Ÿ  RDDs maintain lineage information that can be used to reconstruct lost partitions cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘t’)(2)) .cache() HdfsRDD path: hdfs://… FilteredRDD func: contains(...) MappedRDD func: split(…) CachedRDD
  • 21. 21© Copyright 2013 Pivotal. All rights reserved. RDDs are Foundational Ÿ  General purpose enough to use to implement other programing models –  SQL –  Graph –  ML –  MR
  • 22. 22© Copyright 2013 Pivotal. All rights reserved. 22© Copyright 2013 Pivotal. All rights reserved. Related Projects Things that run on Spark
  • 23. 23© Copyright 2013 Pivotal. All rights reserved. Related Projects Ÿ  Shark Ÿ  Spark SQL Ÿ  Spark Streaming Ÿ  GraphX Ÿ  MLbase Ÿ  Others
  • 24. 24© Copyright 2013 Pivotal. All rights reserved. Shark Ÿ  Hive on Spark –  HiveQL, UDFs, etc. Ÿ  Turn SQL into RDD –  Part of the lineage Ÿ  Based on Hive, but takes advantage of Spark for –  Fast Scheduling –  Queries are DAGs of jobs, not chained M/R –  Fast broadcast variables © Apache Software Foundation
  • 25. 25© Copyright 2013 Pivotal. All rights reserved. Shark (cont) Ÿ  Optimized Columnar Storage format Ÿ  Fast/Efficient Compression –  From Yahoo! –  Able to hold 3-20x more data in same cluster Ÿ  Various other optimizations using partitioning Ÿ  Will ultimately run on Spark SQL –  No Hive dependencies except to accessing Hive datastore –  Long running process with management tools
  • 26. 26© Copyright 2013 Pivotal. All rights reserved. Spark SQL Ÿ  Lib in Spark Core to treat RDDs as relations –  SchemaRDD Ÿ  Lighter weight version of Shark –  No code from Hive Ÿ  Import/Export in different Storage formats –  Parquet, learn schema from existing Hive warehouse Ÿ  Takes columnar storage from Shark
  • 27. 27© Copyright 2013 Pivotal. All rights reserved. Spark SQL Code Ÿ  Go take a look
  • 28. 28© Copyright 2013 Pivotal. All rights reserved. Spark Streaming Ÿ  Extend Spark to do large scale stream processing –  100s of nodes and second scale end to end latency Ÿ  Stateful Processing –  Hard to make FT –  Storm: requires idempotent updates Ÿ  Simple, batch like API with RDDs Ÿ  Single semantics for both real time and high latency
  • 29. 29© Copyright 2013 Pivotal. All rights reserved. Streaming (cont) Ÿ  Input is broken up into Batches that become RDDs Ÿ  RDD’s are composed into DAGs to generate output Ÿ  Raw data is replicated in-memory for FT
  • 30. 30© Copyright 2013 Pivotal. All rights reserved. Streaming (cont) Ÿ  Other features –  Window-based Transformations –  Arbitrary join of streams
  • 31. 31© Copyright 2013 Pivotal. All rights reserved. GraphX (Alpha) Ÿ  Graph processing –  Replaces Spark Bagel Ÿ  Graph Parallel not Data Parallel –  Reason in the context of neighbors –  GraphLab API
  • 32. 32© Copyright 2013 Pivotal. All rights reserved. GraphX (cont) Ÿ  Predicting things about people (eg: political bias) –  Look at posts, apply classifier, try to predict attribute –  Local signal is difficult alone –  Look at context of social network to improve prediction Ÿ  Triangle processing –  More triangles reveals greater community Ÿ  Collaborative Filtering –  Bi-partide graph processing –  What I like, who rated those things, what they like => what I may like
  • 33. 33© Copyright 2013 Pivotal. All rights reserved. GraphX (cont) Ÿ  Graph Creation => Algorithm => Post Processing –  Existing systems mainly deal with the Algorithm and not interactive –  Unify collection and graph models Ÿ  Graphs have –  Vertices, edges –  Transformation: reverse, filter, map –  Joins: graphs and tables –  Aggregate Neighbors
  • 34. 34© Copyright 2013 Pivotal. All rights reserved. MLbase Ÿ  Machine Learning toolset –  Library and higher level abstractions Ÿ  General tool is MatLab –  Difficult for end users to learn, debug, scale solutions Ÿ  Starting with MLlib –  Low level Distributed Machine Learning Library Ÿ  Many different Algorithms –  Classification, Regression, Collaborative Filtering, etc.
  • 35. 35© Copyright 2013 Pivotal. All rights reserved. Others Ÿ  Mesos –  Enable multiple frameworks to share same cluster resources –  Twitter is largest user: Over 6,000 servers Ÿ  Tachyon –  In-memory, fault tolerant file system that exposes HDFS Ÿ  Catalyst –  SQL Query Optimizer
  • 36. 36© Copyright 2013 Pivotal. All rights reserved. 36© Copyright 2013 Pivotal. All rights reserved. Spark 1.0
  • 37. 37© Copyright 2013 Pivotal. All rights reserved. Release cycle Ÿ  1.0 Came out at end of May Ÿ  1.X expected to be current for several years Ÿ  Quarterly release cycle –  2 mo dev / 1 mo QA –  Actual release is based on vote Ÿ  1.1 due end of August
  • 38. 38© Copyright 2013 Pivotal. All rights reserved. 1.0 Ÿ  API Stability in 1.X for all non-Alpha projects –  Can recompile jobs, but hoping for binary compatibility –  Internal API are marked @DeveloperApi or @Experimental Ÿ  Focus: Core Engine, Streaming, MLLib, SparkSQL Ÿ  History Server for Spark UI –  Driving development of instrumentation Ÿ  Job Submission Tool –  Don’t configure Context in code (eg: master)
  • 39. 39© Copyright 2013 Pivotal. All rights reserved. 1.0 Ÿ  Java8 Lamdas –  No more writing closures as Classes –  Functions are interfaces –  Return type sensitive functions ▪  mapToPair Ÿ  Python improvements
  • 40. 40© Copyright 2013 Pivotal. All rights reserved. 1.0 Ÿ  Hadoop security –  Kerberos, ACL for UI Ÿ  Job cancel from UI Ÿ  Distributed GC as things go out of scope –  Good for long lives service Ÿ  Spark SQL
  • 41. 41© Copyright 2013 Pivotal. All rights reserved. 41© Copyright 2013 Pivotal. All rights reserved. More Code and Demos WordCount, TicTacToe, Java8
  • 42. 42© Copyright 2013 Pivotal. All rights reserved. Code Review: WordCount Ÿ  Java API Ÿ  Java Code Ÿ  More usage of RDDs
  • 43. 43© Copyright 2013 Pivotal. All rights reserved. TicTacToe: a developers experience Ÿ  IDE Ÿ  Spring Ÿ  Building/Logging Ÿ  Debugging
  • 44. 44© Copyright 2013 Pivotal. All rights reserved. Demo: Java 8 Lamda Lamda Lamda
  • 45. 45© Copyright 2013 Pivotal. All rights reserved. 45© Copyright 2013 Pivotal. All rights reserved. Deployment Topologies
  • 46. 46© Copyright 2013 Pivotal. All rights reserved. Topologies Ÿ  Local Ÿ  Spark Cluster (master/slaves) Ÿ  Cluster Resource Managers –  YARN –  MESOS Ÿ  (PaaS?)
  • 47. 47© Copyright 2013 Pivotal. All rights reserved. Demo: Ÿ  Start master and slaves Ÿ  Show the UI Ÿ  Run a Job Ÿ  Talk about the History Server
  • 48. 48© Copyright 2013 Pivotal. All rights reserved. 48© Copyright 2013 Pivotal. All rights reserved. This And That
  • 49. 49© Copyright 2013 Pivotal. All rights reserved. How Real is Spark? Ÿ  There is some criticism –  As expected –  New project! Ÿ  There are many indicators that Spark is heading to success –  Solid technology –  Good buzz –  Significant community
  • 50. 50© Copyright 2013 Pivotal. All rights reserved. Next Steps Ÿ  Spark website: http://spark.apache.org –  Lots’O’Goodstuff Ÿ  Spark Summit June 30/July 01 –  http://spark-summit.org
  • 51. 51© Copyright 2013 Pivotal. All rights reserved. 51© Copyright 2013 Pivotal. All rights reserved. A NEW PLATFORM FOR A NEW ERA