www.edureka.co/r-for-analytics
www.edureka.co/apache-spark-scala-training
5 Things One Must know about Spark!
Slide 2Slide 2Slide 2 www.edureka.co/apache-spark-scala-training
Agenda
At the end of this webinar you will be able to know about:
 #1 : Low Latency
 #2 : Streaming Support
 #3 : Machine Learning and Graph
 #4 : Data Frame API introduction
 #5 : Spark integration with hadoop
Slide 3Slide 3Slide 3 www.edureka.co/apache-spark-scala-training
Spark Architecure
Machine Learning
Library
Graph
programming
Spark interface
For RDBMS lovers
Utility for
continues
ingestion of data
Slide 4Slide 4Slide 4 www.edureka.co/apache-spark-scala-training
Low Latency
Slide 5Slide 5Slide 5 www.edureka.co/apache-spark-scala-training
Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency
computations, whereas MapReduce keeps shuffling things in and out of disk.
Sparks Cuts Down Read/Write I/O To Disk
Spark is good for data that fits in memory and off memory
Slide 6Slide 6Slide 6 www.edureka.co/apache-spark-scala-training
The previous world record was 72 minutes, set by Yahoo using a Hadoop MapReduce cluster of 2100 nodes
Using Spark on 206 EC2 nodes, spark completed the benchmark in 23 minutes.
Spark sorted the same data 3X faster using 10X fewer machines
All the sorting took place on disk (HDFS), without using Spark’s in-memory cache.
How Fast A System Can Sort 100 TB Of Data
Slide 7Slide 7Slide 7 www.edureka.co/apache-spark-scala-training
2014, 4.27 TB/min
100 TB in 1,406 seconds
207 Amazon EC2 i2.8xlarge nodes x
(32 vCores - 2.5Ghz Intel Xeon E5-2670
v2, 244GB memory, 8x800 GB SSD)
Reynold Xin, Parviz Deyhim, Xiangrui
Meng,
Ali Ghodsi, Matei Zaharia
Courtesy : sortbenchmark.org/
Sparks Benchmark
Slide 8Slide 8Slide 8 www.edureka.co/apache-spark-scala-training
Streaming Support
Slide 9Slide 9Slide 9 www.edureka.co/apache-spark-scala-training
Used for processing the real-time streaming data.
It uses the DStream : a series of RDDs, to process the real-time data
support streaming analytics reasonably well.
The Spark Streaming API closely matches that of the Spark Core
Event processing
Slide 10Slide 10Slide 10 www.edureka.co/apache-spark-scala-training
Machine Learning and graph
implementation with DAG
Slide 11Slide 11Slide 11 www.edureka.co/apache-spark-scala-training
MLlib,a machine
learning library
classification regression clustering collaborative filtering and so on
Some of these algorithms also work with streaming data, such as linear regression using ordinary least squares or k-means clustering
Machine Learning
Slide 12Slide 12Slide 12 www.edureka.co/apache-spark-scala-training
Cyclic Data Flows
• All jobs in spark comprise a series of operators and run on a set of data.
• All the operators in a job are used to construct a DAG (Directed Acyclic
Graph).
• The DAG is optimized by rearranging and combining operators where
possible.
Slide 13Slide 13Slide 13 www.edureka.co/apache-spark-scala-training
Component for graphs and graph-parallel computation
Extends the Spark RDD by introducing a new Graph abstraction
Graph Algorithms
PageRank Connected Components Triangle Counting
GraphX
Slide 14Slide 14Slide 14 www.edureka.co/apache-spark-scala-training
Support for Data Frames
Slide 15Slide 15Slide 15 www.edureka.co/apache-spark-scala-training
As spark continues to grow, it wants to enable wider audiences beyond “big data” engineers to leverage the
power of distributed processing.
Inspired by data frames in r and python (pandas)
Dataframes API is designed to make big data processing on tabular data easier
Dataframe is a distributed collection of data organized into named columns.
Provides operations to filter, group, or compute aggregates, and can be used with spark sql.
Can be constructed from structured data files, existing rdds, tables in hive, or external databases.
DataFrame
Slide 16Slide 16Slide 16 www.edureka.co/apache-spark-scala-training
Ability to scale from KBs to PBs
Support for a wide array of data formats and storage systems
State-of-the-art optimization and code generation through the spark SQL catalyst optimizer
Seamless integration with all big data tooling and infrastructure via spark
Apis for python, java, scala, and R (in development via sparkr)
DataFrame features
Slide 17Slide 17Slide 17 www.edureka.co/apache-spark-scala-training
Spark can use HDFS
Spark can use YARN
Slide 18Slide 18Slide 18 www.edureka.co/apache-spark-scala-training
Spark can leverage the resource negotiator of Hadoop framework i.e. YARN
Spark workloads can make use of Symphony scheduling policies and execute via YARN
Spark execution modes
Standalone Mesos HDFS
Spark Execution Platforms
Slide 19Slide 19Slide 19 www.edureka.co/apache-spark-scala-training
Spark Features/Modules In Demand
Source: Typesafe
Slide 20Slide 20Slide 20 www.edureka.co/apache-spark-scala-training
New Features In 2015
Data Frames 
• Similar API to data frames in R and Pandas
• Automatically optimised via Spark SQL
• Released in Spark 1.3
SparkR 
• Released in Spark 1.4
• Exposes DataFrames, RDD’s & ML library in R
Machine Learning Pipelines 
• High Level API
• Featurization
• Evaluation
• Model Tuning
External Data Sources 
• Platform API to plug Data-Sources into Spark
• Pushes logic into sources
Source: Databrix
Slide 21Slide 21Slide 21 www.edureka.co/apache-spark-scala-training
Spark overview
Questions
Slide 22

5 things one must know about spark!

  • 1.
  • 2.
    Slide 2Slide 2Slide2 www.edureka.co/apache-spark-scala-training Agenda At the end of this webinar you will be able to know about:  #1 : Low Latency  #2 : Streaming Support  #3 : Machine Learning and Graph  #4 : Data Frame API introduction  #5 : Spark integration with hadoop
  • 3.
    Slide 3Slide 3Slide3 www.edureka.co/apache-spark-scala-training Spark Architecure Machine Learning Library Graph programming Spark interface For RDBMS lovers Utility for continues ingestion of data
  • 4.
    Slide 4Slide 4Slide4 www.edureka.co/apache-spark-scala-training Low Latency
  • 5.
    Slide 5Slide 5Slide5 www.edureka.co/apache-spark-scala-training Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency computations, whereas MapReduce keeps shuffling things in and out of disk. Sparks Cuts Down Read/Write I/O To Disk Spark is good for data that fits in memory and off memory
  • 6.
    Slide 6Slide 6Slide6 www.edureka.co/apache-spark-scala-training The previous world record was 72 minutes, set by Yahoo using a Hadoop MapReduce cluster of 2100 nodes Using Spark on 206 EC2 nodes, spark completed the benchmark in 23 minutes. Spark sorted the same data 3X faster using 10X fewer machines All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. How Fast A System Can Sort 100 TB Of Data
  • 7.
    Slide 7Slide 7Slide7 www.edureka.co/apache-spark-scala-training 2014, 4.27 TB/min 100 TB in 1,406 seconds 207 Amazon EC2 i2.8xlarge nodes x (32 vCores - 2.5Ghz Intel Xeon E5-2670 v2, 244GB memory, 8x800 GB SSD) Reynold Xin, Parviz Deyhim, Xiangrui Meng, Ali Ghodsi, Matei Zaharia Courtesy : sortbenchmark.org/ Sparks Benchmark
  • 8.
    Slide 8Slide 8Slide8 www.edureka.co/apache-spark-scala-training Streaming Support
  • 9.
    Slide 9Slide 9Slide9 www.edureka.co/apache-spark-scala-training Used for processing the real-time streaming data. It uses the DStream : a series of RDDs, to process the real-time data support streaming analytics reasonably well. The Spark Streaming API closely matches that of the Spark Core Event processing
  • 10.
    Slide 10Slide 10Slide10 www.edureka.co/apache-spark-scala-training Machine Learning and graph implementation with DAG
  • 11.
    Slide 11Slide 11Slide11 www.edureka.co/apache-spark-scala-training MLlib,a machine learning library classification regression clustering collaborative filtering and so on Some of these algorithms also work with streaming data, such as linear regression using ordinary least squares or k-means clustering Machine Learning
  • 12.
    Slide 12Slide 12Slide12 www.edureka.co/apache-spark-scala-training Cyclic Data Flows • All jobs in spark comprise a series of operators and run on a set of data. • All the operators in a job are used to construct a DAG (Directed Acyclic Graph). • The DAG is optimized by rearranging and combining operators where possible.
  • 13.
    Slide 13Slide 13Slide13 www.edureka.co/apache-spark-scala-training Component for graphs and graph-parallel computation Extends the Spark RDD by introducing a new Graph abstraction Graph Algorithms PageRank Connected Components Triangle Counting GraphX
  • 14.
    Slide 14Slide 14Slide14 www.edureka.co/apache-spark-scala-training Support for Data Frames
  • 15.
    Slide 15Slide 15Slide15 www.edureka.co/apache-spark-scala-training As spark continues to grow, it wants to enable wider audiences beyond “big data” engineers to leverage the power of distributed processing. Inspired by data frames in r and python (pandas) Dataframes API is designed to make big data processing on tabular data easier Dataframe is a distributed collection of data organized into named columns. Provides operations to filter, group, or compute aggregates, and can be used with spark sql. Can be constructed from structured data files, existing rdds, tables in hive, or external databases. DataFrame
  • 16.
    Slide 16Slide 16Slide16 www.edureka.co/apache-spark-scala-training Ability to scale from KBs to PBs Support for a wide array of data formats and storage systems State-of-the-art optimization and code generation through the spark SQL catalyst optimizer Seamless integration with all big data tooling and infrastructure via spark Apis for python, java, scala, and R (in development via sparkr) DataFrame features
  • 17.
    Slide 17Slide 17Slide17 www.edureka.co/apache-spark-scala-training Spark can use HDFS Spark can use YARN
  • 18.
    Slide 18Slide 18Slide18 www.edureka.co/apache-spark-scala-training Spark can leverage the resource negotiator of Hadoop framework i.e. YARN Spark workloads can make use of Symphony scheduling policies and execute via YARN Spark execution modes Standalone Mesos HDFS Spark Execution Platforms
  • 19.
    Slide 19Slide 19Slide19 www.edureka.co/apache-spark-scala-training Spark Features/Modules In Demand Source: Typesafe
  • 20.
    Slide 20Slide 20Slide20 www.edureka.co/apache-spark-scala-training New Features In 2015 Data Frames  • Similar API to data frames in R and Pandas • Automatically optimised via Spark SQL • Released in Spark 1.3 SparkR  • Released in Spark 1.4 • Exposes DataFrames, RDD’s & ML library in R Machine Learning Pipelines  • High Level API • Featurization • Evaluation • Model Tuning External Data Sources  • Platform API to plug Data-Sources into Spark • Pushes logic into sources Source: Databrix
  • 21.
    Slide 21Slide 21Slide21 www.edureka.co/apache-spark-scala-training Spark overview
  • 22.

Editor's Notes

  • #16 You can show hands on with data frames like Load data into Spark DataFrames Explore data with Spark SQL Here is a reference for hands on : https://www.mapr.com/blog/using-apache-spark-dataframes-processing-tabular-data#.VdxJofmqqko
  • #21 http://www.information-management.com/gallery/Big-Data-Hadoop-2015-Predictions-Forrester-10026357-1.html https://www.forrester.com/Predictions+2015+Hadoop+Will+Become+A+Cornerstone+Of+Your+Business+Technology+Agenda/fulltext/-/E-RES117705