Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intro to Apache Spark

34,687 views

Published on

This presentation includes a comprehensive introduction to Apache Spark. From an explanation of its rapid ascent to performance and developer advantages over MapReduce. We also explore its built-in functionality for application types involving streaming, machine learning, and Extract, Transform and Load (ETL).

Published in: Technology
  • Be the first to comment

Intro to Apache Spark

  1. 1. Introduction to Apache Spark
  2. 2. www.mammothdata.com | @mammothdataco The Leader in Big Data Consulting ● BI/Data Strategy ○ Development of a business intelligence/ data architecture strategy. ● Installation ○ Installation of Hadoop or relevant technology. ● Data Consolidation ○ Load data from diverse sources into a single scalable repository. ● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards, feeds or computer-driven decision making processes to derive insights and make decisions. ● Visualization Tools ○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to necessary employees who will analyze the data. Mammoth Data, based in downtown Durham (right above Toast)
  3. 3. www.mammothdata.com | @mammothdataco ●Lead Consultant on all things DevOps and Spark ●@carsondial Me!
  4. 4. www.mammothdata.com | @mammothdataco ●Apache Spark™ is a fast and general engine for large-scale data processing What Is Apache Spark?!
  5. 5. www.mammothdata.com | @mammothdataco ● Framework for massive parallel computing (cluster) ● Harnessing power of cheap memory ● Direct Acyclic Graph (DAG) computing engine ● It goes very fast! ● Apache Project (spark.apache.org) What Is Apache Spark?! No, But Really…
  6. 6. www.mammothdata.com | @mammothdataco ● Began at UC Berkeley in 2009 ● Apache project since 2013 ● Top-level Apache project since 2014 ● Creators formed databricks.com History
  7. 7. www.mammothdata.com | @mammothdataco ● Performance ● Developer productivity Why Spark?
  8. 8. www.mammothdata.com | @mammothdataco ● Graysort benchmark (100TB) ● Hadoop - 72 minutes / 2100 nodes / datacentre ● Spark - 23 minutes / 206 nodes / AWS ● HDFS versus Memory Performance!
  9. 9. www.mammothdata.com | @mammothdataco ●First class support for Scala, Java, Python, and R! ●Data Science friendly Developers!
  10. 10. www.mammothdata.com | @mammothdataco Word Count: Hadoop
  11. 11. www.mammothdata.com | @mammothdataco from pyspark import SparkContext logFile = "hdfs:///input" sc = SparkContext("spark://spark-m:7077", "WordCount") textFile = sc.textFile(logFile) wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) wordCounts.saveAsTextFile("hdfs:///output") Word Count: Spark
  12. 12. www.mammothdata.com | @mammothdataco ●Spark Streaming ●GraphX (graph algorithms) ●MLLib (machine learning) ●Dataframes (data access) Spark: Batteries Included
  13. 13. www.mammothdata.com | @mammothdataco ● Analytics (batch / streaming) ● Machine Learning ● ETL (Extract - Transform - Load) ● …and many more! Applications
  14. 14. www.mammothdata.com | @mammothdataco ●RDD = Resilient Distributed Dataset ●Immutable, Fault-tolerant ●Operated on in parallel ●Can be created manually or from external sources RDDs – The Building Block
  15. 15. www.mammothdata.com | @mammothdataco ●Transformations ●Actions ●Transformations are lazy ●Actions evaluate transformations in pipeline as well as performing action RDDs – The Building Block
  16. 16. www.mammothdata.com | @mammothdataco ● map() ● filter() ● pipe() ● sample() ● …and more! RDDs – Example Transformations
  17. 17. www.mammothdata.com | @mammothdataco ●reduce() ●count() ●take() ●saveAsTextFile() ●…and yes, more RDDs – Example Actions
  18. 18. www.mammothdata.com | @mammothdataco from pyspark import SparkContext logFile = "hdfs:///input" sc = SparkContext("spark://spark-m:7077", "WordCount") textFile = sc.textFile(logFile) wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) wordCounts.saveAsTextFile("hdfs:///output") Word Count: Spark
  19. 19. www.mammothdata.com | @mammothdataco ●cache() / persist() ●When an action is performed for the first time - keep the result in memory ●Different levels of persistence available RDDs – cache()
  20. 20. www.mammothdata.com | @mammothdataco ●Micro-batches (DStreams of RDDs) ●Access to other parts of Spark (MLLib, GraphX, Dataframes) ●Fault-tolerant ●Connectors to Kafka, Flume, Kinesis, ZeroMQ ●(we’ll come back to this) Streaming
  21. 21. www.mammothdata.com | @mammothdataco ●Spark SQL ●Support for JSON, Cassandra, SQL databases, etc. ●Easier syntax than RDDs ●Dataframes ‘borrowed’ from Python/R ●Catalyst query planner Dataframes
  22. 22. www.mammothdata.com | @mammothdataco val sc = new SparkContext() val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df = sqlContext.read.json("people.json") df.show() df.filter(df("age") >= 35).show() df.groupBy("age").count().show() Dataframes: Example
  23. 23. www.mammothdata.com | @mammothdataco ●Optimizing query planning for Spark ●Takes Dataframe operations and ‘compiles’ them down to RDD operations ●Often faster than writing RDD code manually ●Use Dataframes whenever possible (v1.4+) Dataframes: Catalyst
  24. 24. www.mammothdata.com | @mammothdataco Dataframes: Catalyst
  25. 25. www.mammothdata.com | @mammothdataco ●Machine Learning ●Includes algorithm implementations for Bayes, k-means clustering, ALS, word2vec, random forests, etc. ●Matrix operations (dense / sparse), dimensionality reduction, etc. ●And basic stats too! MLLib
  26. 26. www.mammothdata.com | @mammothdataco ●Common interface between different ML solutions ●(still in progress, but production-ready as of 1.5) ●Pipelines : ML as Dataframes : RDDs MLLib - Pipelines
  27. 27. www.mammothdata.com | @mammothdataco ●Graph processing algorithms ●Operations on vertices and edges ●Includes PageRank algorithm ●Can be combined with Streaming/Dataframes/MLLib GraphX
  28. 28. www.mammothdata.com | @mammothdataco ●Standalone ●YARN (Hadoop ecosystem) ●Mesos (Hipster ecosystem) Deploying Spark
  29. 29. www.mammothdata.com | @mammothdataco ●Traditional (write code, submit to cluster) ●REPL / Shell (write code interactively, backed by cluster) ●Interactive Notebooks (iPython/Zeppelin) Using Spark
  30. 30. www.mammothdata.com | @mammothdataco ●Log / diary approach to data science ●Type code into a web page ●Visualizations built-in Interactive Notebooks
  31. 31. www.mammothdata.com | @mammothdataco ●iPython / Juypter - most popular ●Zeppelin - built for Spark Interactive Notebooks
  32. 32. www.mammothdata.com | @mammothdataco Interactive Notebooks
  33. 33. www.mammothdata.com | @mammothdataco ●spark.apache.org ●databricks.com ●zeppelin.incubator.apache.org ●mammothdata.com/white-papers/spark-a-modern-tool-for-big- data-applications Links
  34. 34. www.mammothdata.com | @mammothdataco Questions?

×