Unit II Introducing Real-Time Processing Tool
Agenda
• Apache Spark, Why Apache Spark, Evolution of Apache Spark,
• Architecture Apache Spark, Features of Apache Spark,
• Spark Deployment, Standalone, Hadoop YARN, Spark MapReduce,
• Components of Apache Spark, Spark core, Spark SQL, Spark Streaming, Spark
Machine Learning, Spark GraphX, Spark Shell,
• Resilient Distributed Dataset (RDD) Basic, Spark Context, RDD Transformations,
Creating RDDs, RDD Operations, Programming with RDD,
• Transformations, Actions, Lazy Evaluation, Converting between RDD Types
Apache Spark
• Apache Spark is a lightning-fast cluster computing framework
designed for real-time processing.
• Spark is an open-source project from Apache Software Foundation.
• Spark overcomes the limitations of Hadoop MapReduce, and it
extends the MapReduce model to be efficiently used for data
processing.
• Spark is a market leader for big data processing.
• It is widely used across organizations in many ways.
• It has surpassed Hadoop by running 100 times faster in memory
and 10 times faster on disks.
Why Apache Spark
• Most of the technology-based companies across the globe have moved toward Apache Spark.
• They were quick enough to understand the real value possessed by Sparks such as Machine Learning
and interactive querying.
• Industry leaders such as Amazon, Huawei, and IBM have already adopted Apache Spark.
• The firms that were initially based on Hadoop, such as Hortonworks, Cloudera, and MapR, have also
moved to Apache Spark.
• Big Data Hadoop professionals surely need to learn Apache Spark since it is the next most important
technology in Hadoop data processing.
• ETL professionals, SQL professionals, and Project Managers can gain immensely if they master
Apache Spark.
• Data Scientists also need to gain in-depth knowledge of Spark to excel in their careers.
• Spark can be extensively deployed in Machine Learning scenarios.
Evolution of Apache Spark
• Before Spark, there was MapReduce that was used as a processing framework.
• Then, Spark got initiated as one of the research projects in 2009 at UC Berkeley AMPLab.
• It was later open-sourced in 2010.
• After its release in the market, Spark grew and moved to Apache Software Foundation in 2013
• Most organizations across the world have incorporated Apache Spark for empowering their Big Data
applications.
Architecture of Apache Spark
Feature of Apache Spark
• Apache Spark has many features-
• Fault tolerance- design to handle worker node failure using DAG and RDD.
• Dynamic In Nature- offer 80 high-level operators to build parallel apps
• Lazy Evaluation- transformation lazily evaluated, added to DAG and results obtained after action called.
• Real-Time Stream Processing- language –integrated API to stream processing
• Speed- run on Hadoop up to 100x faster in memory and 10x faster on disk, minimize disk read/write operation for
intermediate results.
Feature of Apache Spark
• Reusability- spark code used for batch-processing, join streaming data and to adhoc queries on streaming state.
• Advanced Analytics- de facto standard for big data processing and data sciences across multiple industries, machine learning and
graph processing libraries
• In Memory Computing- capable of processing tasks in memory and it is not required to write back intermediate results to the disk ,
capable of caching the intermediate results so that it can be reused in the next iteration, common dataset which can be used across multiple tasks.
• Supporting Multiple languages- APIs available in Java, Scala, Python and R, advanced features available with R language
for data analytics, SparkSQL.
• Integrated with Hadoop- integrates very well with Hadoop file system HDFS, support to multiple file formats like parquet,
json, csv, ORC, Avro etc
• Cost efficient- open source software, so it does not have any licensing fee associated with it.
Spark Deployment
• Apache Spark can be used with Hadoop or Hadoop
YARN together.
• It can be deployed on Hadoop in three ways:
• Standalone- allows Spark to allocate all resources or a subset of resources in a
Hadoop cluster run Spark in parallel with Hadoop MapReduce
• YARN- config files can easily read/write to HDFS and YARN Resource Manager, run
Spark on YARN without any pre-installation.
• SIMR- help us start experimenting with Spark to explore more.
Components of Spark
• The following image gives you a clear picture of the different Spark components..
Components of Spark
• The following image gives you a clear picture of the different Spark components..
Apache Spark Core-
general execution engine for the Spark platform which is built as per the requirement, in-built memory
computing and references datasets stored in external storage systems.
write code quickly with the help of a rich set of operators.
takes fewer lines when written in Spark Scala.
Spark SQL-
introduces a new set of data abstraction called SchemaRDD.
SchemaRDD provides support for both structured and semi-structured data
MLlib (Machine Learning Library)-
contains a wide array of Machine Learning algorithms, classification, clustering, and collaboration filters, etc
GraphX-
library to manipulate graphs and perform computations
extends Spark RDD API, which creates a directed graph.
numerous operators in order to manipulate the graphs, along with graph algorithms.
Resilient Distributed Dataset (RDD) Basic
RDDs are the main logical data units in Spark.
They are a distributed collection of objects, which are stored in memory or on disks of different machines of a cluster.
A single RDD can be divided into multiple logical partitions so that these partitions can be stored and processed on different
machines of a cluster.
RDDs are immutable (read-only) in nature.
You cannot change an original RDD, but you can create new RDDs by performing coarse-grain operations, like
transformations, on an existing RDD.
An RDD in Spark can be cached and used again for future transformations, which is a huge benefit for users.
RDDs are said to be lazily evaluated, i.e., they delay the evaluation until it is really needed.
This saves a lot of time and improves efficiency.
Features of an RDD in Spark
• Here are some features of RDD in Spark:
• Resilience: RDDs track data lineage information to recover lost data, automatically on failure. It is also
called fault tolerance.
• Distributed: Data present in an RDD resides on multiple nodes. It is distributed across different nodes of
a cluster.
• Lazy evaluation: Data does not get loaded in an RDD even if you define it. Transformations are actually
computed when you call action, such as count or collect, or save the output to a file system.
Features of an RDD in Spark
• Here are some features of RDD in Spark:
• Immutability: Data stored in an RDD is in the read-only mode━you cannot edit
the data which is present in the RDD. But, you can create new RDDs by
performing transformations on the existing RDDs.
• In-memory computation: An RDD stores any immediate data that is generated
in the memory (RAM) than on the disk so that it provides faster access.
• Partitioning: Partitions can be done on any existing RDD to create logical parts
that are mutable. You can achieve this by applying transformations to the existing
partitions.
RDD abstraction
• Resilient Distributed Datasets
• partitioned collection of records
• spread across the cluster
• read-only
• caching dataset in memory
– different storage levels available
– fallback to disk possible
RDD operations
• transformations to build RDDs through
deterministic operations on other RDDs
– transformations include map, filter, join
– lazy operation
• actions to return value or export data
– actions include count, collect, save
– triggers execution
Job example
val log = sc.textFile(“hdfs://...”)
val errors = file.filter(_.contains(“ERROR”))
errors.cache()
errors.filter(_.contains(“I/O”)).count()
errors.filter(_.contains(“timeout”)).count()
Driver
Worker Worker Worker
Block3
Block1 Block2
Cache1 Cache2 Cache2
Action!
RDD partition-level view
HadoopRDD
path = hdfs://...
FilteredRDD
func = _.contains(…)
shouldCache = true
log:
errors:
Partition-level view:
Dataset-level view:
Task 1 Task 2 ...
source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
Job scheduling
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
Available APIs
• You can write in Java, Scala or Python
• interactive interpreter: Scala & Python only
• standalone applications: any
• performance: Java & Scala are faster thanks to
static typing
Hand on - interpreter
• script
• run scala spark interpreter
• or python interpreter
http://cern.ch/kacper/spark.txt
$ spark-shell
$ pyspark
Hand on – build and submission
• download and unpack source code
• build definition in
• source code
• building
• job submission
GvaWeather/src/main/scala/GvaWeather.scala
spark-submit --master local --class GvaWeather 
target/scala-2.10/gva-weather_2.10-1.0.jar
cd GvaWeather
sbt package
GvaWeather/gvaweather.sbt
wget http://cern.ch/kacper/GvaWeather.tar.gz; tar -xzf GvaWeather.tar.gz
Summary
• concept not limited to single pass map-reduce
• avoid soring intermediate results on disk or
HDFS
• speedup computations when reusing datasets

Unit II Real Time Data Processing tools.pptx

  • 1.
    Unit II IntroducingReal-Time Processing Tool
  • 2.
    Agenda • Apache Spark,Why Apache Spark, Evolution of Apache Spark, • Architecture Apache Spark, Features of Apache Spark, • Spark Deployment, Standalone, Hadoop YARN, Spark MapReduce, • Components of Apache Spark, Spark core, Spark SQL, Spark Streaming, Spark Machine Learning, Spark GraphX, Spark Shell, • Resilient Distributed Dataset (RDD) Basic, Spark Context, RDD Transformations, Creating RDDs, RDD Operations, Programming with RDD, • Transformations, Actions, Lazy Evaluation, Converting between RDD Types
  • 3.
    Apache Spark • ApacheSpark is a lightning-fast cluster computing framework designed for real-time processing. • Spark is an open-source project from Apache Software Foundation. • Spark overcomes the limitations of Hadoop MapReduce, and it extends the MapReduce model to be efficiently used for data processing. • Spark is a market leader for big data processing. • It is widely used across organizations in many ways. • It has surpassed Hadoop by running 100 times faster in memory and 10 times faster on disks.
  • 4.
    Why Apache Spark •Most of the technology-based companies across the globe have moved toward Apache Spark. • They were quick enough to understand the real value possessed by Sparks such as Machine Learning and interactive querying. • Industry leaders such as Amazon, Huawei, and IBM have already adopted Apache Spark. • The firms that were initially based on Hadoop, such as Hortonworks, Cloudera, and MapR, have also moved to Apache Spark. • Big Data Hadoop professionals surely need to learn Apache Spark since it is the next most important technology in Hadoop data processing. • ETL professionals, SQL professionals, and Project Managers can gain immensely if they master Apache Spark. • Data Scientists also need to gain in-depth knowledge of Spark to excel in their careers. • Spark can be extensively deployed in Machine Learning scenarios.
  • 5.
    Evolution of ApacheSpark • Before Spark, there was MapReduce that was used as a processing framework. • Then, Spark got initiated as one of the research projects in 2009 at UC Berkeley AMPLab. • It was later open-sourced in 2010. • After its release in the market, Spark grew and moved to Apache Software Foundation in 2013 • Most organizations across the world have incorporated Apache Spark for empowering their Big Data applications.
  • 6.
  • 7.
    Feature of ApacheSpark • Apache Spark has many features- • Fault tolerance- design to handle worker node failure using DAG and RDD. • Dynamic In Nature- offer 80 high-level operators to build parallel apps • Lazy Evaluation- transformation lazily evaluated, added to DAG and results obtained after action called. • Real-Time Stream Processing- language –integrated API to stream processing • Speed- run on Hadoop up to 100x faster in memory and 10x faster on disk, minimize disk read/write operation for intermediate results.
  • 8.
    Feature of ApacheSpark • Reusability- spark code used for batch-processing, join streaming data and to adhoc queries on streaming state. • Advanced Analytics- de facto standard for big data processing and data sciences across multiple industries, machine learning and graph processing libraries • In Memory Computing- capable of processing tasks in memory and it is not required to write back intermediate results to the disk , capable of caching the intermediate results so that it can be reused in the next iteration, common dataset which can be used across multiple tasks. • Supporting Multiple languages- APIs available in Java, Scala, Python and R, advanced features available with R language for data analytics, SparkSQL. • Integrated with Hadoop- integrates very well with Hadoop file system HDFS, support to multiple file formats like parquet, json, csv, ORC, Avro etc • Cost efficient- open source software, so it does not have any licensing fee associated with it.
  • 9.
    Spark Deployment • ApacheSpark can be used with Hadoop or Hadoop YARN together. • It can be deployed on Hadoop in three ways: • Standalone- allows Spark to allocate all resources or a subset of resources in a Hadoop cluster run Spark in parallel with Hadoop MapReduce • YARN- config files can easily read/write to HDFS and YARN Resource Manager, run Spark on YARN without any pre-installation. • SIMR- help us start experimenting with Spark to explore more.
  • 10.
    Components of Spark •The following image gives you a clear picture of the different Spark components..
  • 11.
    Components of Spark •The following image gives you a clear picture of the different Spark components.. Apache Spark Core- general execution engine for the Spark platform which is built as per the requirement, in-built memory computing and references datasets stored in external storage systems. write code quickly with the help of a rich set of operators. takes fewer lines when written in Spark Scala. Spark SQL- introduces a new set of data abstraction called SchemaRDD. SchemaRDD provides support for both structured and semi-structured data MLlib (Machine Learning Library)- contains a wide array of Machine Learning algorithms, classification, clustering, and collaboration filters, etc GraphX- library to manipulate graphs and perform computations extends Spark RDD API, which creates a directed graph. numerous operators in order to manipulate the graphs, along with graph algorithms.
  • 12.
    Resilient Distributed Dataset(RDD) Basic RDDs are the main logical data units in Spark. They are a distributed collection of objects, which are stored in memory or on disks of different machines of a cluster. A single RDD can be divided into multiple logical partitions so that these partitions can be stored and processed on different machines of a cluster. RDDs are immutable (read-only) in nature. You cannot change an original RDD, but you can create new RDDs by performing coarse-grain operations, like transformations, on an existing RDD. An RDD in Spark can be cached and used again for future transformations, which is a huge benefit for users. RDDs are said to be lazily evaluated, i.e., they delay the evaluation until it is really needed. This saves a lot of time and improves efficiency.
  • 13.
    Features of anRDD in Spark • Here are some features of RDD in Spark: • Resilience: RDDs track data lineage information to recover lost data, automatically on failure. It is also called fault tolerance. • Distributed: Data present in an RDD resides on multiple nodes. It is distributed across different nodes of a cluster. • Lazy evaluation: Data does not get loaded in an RDD even if you define it. Transformations are actually computed when you call action, such as count or collect, or save the output to a file system.
  • 14.
    Features of anRDD in Spark • Here are some features of RDD in Spark: • Immutability: Data stored in an RDD is in the read-only mode━you cannot edit the data which is present in the RDD. But, you can create new RDDs by performing transformations on the existing RDDs. • In-memory computation: An RDD stores any immediate data that is generated in the memory (RAM) than on the disk so that it provides faster access. • Partitioning: Partitions can be done on any existing RDD to create logical parts that are mutable. You can achieve this by applying transformations to the existing partitions.
  • 15.
    RDD abstraction • ResilientDistributed Datasets • partitioned collection of records • spread across the cluster • read-only • caching dataset in memory – different storage levels available – fallback to disk possible
  • 16.
    RDD operations • transformationsto build RDDs through deterministic operations on other RDDs – transformations include map, filter, join – lazy operation • actions to return value or export data – actions include count, collect, save – triggers execution
  • 17.
    Job example val log= sc.textFile(“hdfs://...”) val errors = file.filter(_.contains(“ERROR”)) errors.cache() errors.filter(_.contains(“I/O”)).count() errors.filter(_.contains(“timeout”)).count() Driver Worker Worker Worker Block3 Block1 Block2 Cache1 Cache2 Cache2 Action!
  • 18.
    RDD partition-level view HadoopRDD path= hdfs://... FilteredRDD func = _.contains(…) shouldCache = true log: errors: Partition-level view: Dataset-level view: Task 1 Task 2 ... source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
  • 19.
    Job scheduling rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects buildoperator DAG DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
  • 20.
    Available APIs • Youcan write in Java, Scala or Python • interactive interpreter: Scala & Python only • standalone applications: any • performance: Java & Scala are faster thanks to static typing
  • 21.
    Hand on -interpreter • script • run scala spark interpreter • or python interpreter http://cern.ch/kacper/spark.txt $ spark-shell $ pyspark
  • 22.
    Hand on –build and submission • download and unpack source code • build definition in • source code • building • job submission GvaWeather/src/main/scala/GvaWeather.scala spark-submit --master local --class GvaWeather target/scala-2.10/gva-weather_2.10-1.0.jar cd GvaWeather sbt package GvaWeather/gvaweather.sbt wget http://cern.ch/kacper/GvaWeather.tar.gz; tar -xzf GvaWeather.tar.gz
  • 23.
    Summary • concept notlimited to single pass map-reduce • avoid soring intermediate results on disk or HDFS • speedup computations when reusing datasets