SlideShare a Scribd company logo
Apache Spark Components
Spark Streaming | Spark SQL | MLlib | GraphX
Girish Khanzode
Contents
• Spark Streaming
– Micro batch
– Stateful Stream Processing
– DStream
– Socket Stream
– File Stream
• Spark SQL
– DataFrame
– DataFrameAPI
– Supported Data Formats and Sources
– Plan Optimization & Execution
– Rules Based Optimization
• MLlib
– Algorithms
– Key Features
– Pipeline
• GraphX
– PropertyGraph
– GraphViews
– TripletView
– Subgraph
– DistributedGraph Representation
• Graph Algorithms
• References
Spark Streaming
A framework for large scale stream processing
Spark Streaming
Spark Streaming
• Extends Spark for big data stream processing
• Efficient, fault-tolerant, stateful stream processing of live stream data
• Integrates with Spark’s batch and interactive processing
• Scales to hundreds of nodes
• Can achieve latencies on scale of seconds
Spark Streaming
• Can absorb live data streams from Kafka, Flume, ZeroMQ etc
• Simple Batch likeAPI to implement complex algorithms
• Integrates with other Spark extensions
• Started in 2012, alpha released with Spark 0.7 in 2013, released with Spark
0.9 in 2014
Need for Spark Streaming
• Existing frameworks can either
– Stream process 100s of MBs with low latency
– Batch processTBs of data with high latency
• Painful to maintain two different stacks
– Different programming models
– Doubles implementation effort
Need for Spark Streaming
• Many applications must process large streams of live data and provide
results in near-real-time
– Social network trends
– Website statistics
– Intrusion detection systems
• Many environments require processing same data in live streaming as
well as batch post-processing
Micro batch
• Spark streaming is a fast batch processing system
• Spark streaming collects stream data into small batch and runs batch
processing on it
• Batch can be as small as 1 second to as big as multiple hours
• Spark job creation and execution overhead is so low it can do all that
under a second
• These batches are called as DStreams
Stateful Stream Processing
• Traditional streaming systems have a event-driven record-at-a-time
processing model
– Each node has mutable state
– For each record, update state & send new records
• State is lost if node dies
• Making stateful stream processing fault-tolerant is a challenge
Stateful Stream Processing
Streaming System - Storm
• Replays record if not processed by a node
• Processes each record at least once
• May update mutable state twice
• Mutable state can be lost due to failure
Streaming System -Trident
• Uses transactions to update state
• Processes each record exactly once
• Per state transaction updates slow
Spark Streaming
• Runs a streaming computation as a series of very small deterministic
batch jobs
• Splits the live stream into batches of X seconds
• Spark treats each batch of data as RDDs and processes them using RDD
operations
• Processed results of RDD operations are returned in batches
High LevelView
Spark Streaming
• Runs as a series of small (~1 s) batch jobs, keeping state in memory as
fault-tolerant RDDs
• Batch sizes as low as 0.5 second, latency ~ 1 second
• Potential for combining batch processing and streaming processing in the
same system
• Result: can process 42 million records/second (4 GB/s) on 100 nodes at
sub-second latency
Spark Streaming
• tweetStream
• .flatMap(_.toLower.split)
• .map(word => (word, 1))
.reduceByWindow(“5s”, _ + _)
T=1
T=2
…
map reduceByWindow
Streaming
• Creates RDDs from stream source on a defined interval
• Same operation as normal RDDs
• Supports a variety of sources
• Exactly once message guarantee
Discretized Stream - DStream
• Basic abstraction provided by Spark Streaming
• Input stream is divided into multiple discrete batches
• Represents a stream of data
• Implemented as a sequence of RDDs
• Each batch of DStream is represented as RDD
underneath
Discretized Stream - DStream
• These RDD are replicated in cluster for fault tolerance
• Every DStream operation results in RDD transformation
• APIs provided to access these RDD is directly
• Can combine stream and batch processing
• Configurable intervals - 1 second, 5 second, 5 minutes
etc.
Discretized Stream - DStream
DStream transformation
• val ssc = new StreamingContext(args(0),
"wordcount", Seconds(5))
• val lines =
ssc.socketTextStream("localhost",50050)
• val words = lines.flatMap(_.split(" "))
Socket Stream
• Ability to listen to any socket on remote machines
• Need to configure host and port
• Both Raw andText representation of socket available
• Built in retry mechanism
File Stream
• Allows tracking new files in a given directory on HDFS
• Whenever there is new file appears, spark streaming will pick it up
• Only works for new files, modification for existing files will not be
considered
• Tracked using file creation time
ReceiverArchitecture
Stateful Operations
• Ability to maintain random state across multiple batches
• Fault tolerant
• Exactly once semantics
• WAL (Write Ahead Log) for receiver crashes
How Stateful OperationsWork?
• Generally state is a mutable operation
• But in functional programming, state is represented with state machine
going from one state to another
• fn(oldState,newInfo) => newState
• In Spark, state is represented using RDD
• Change in the state is represented using transformation of RDD’s
• Fault tolerance of RDD helps in fault tolerance of state
Transform API
• In stream processing, ability to combine stream data with batch data is
extremely important
• Both batch API and stream API share RDD as abstraction
• TransformAPI of DStream allows us to access underneath RDD’s directly
• Example - Combine customer sales data with customer information
Window Based Operations
DStream Inputs
• DStream is Created from
– streaming input sources
– applying transformations on existing DStreams
• Basic input sources
– Built-in - file system, socket
– Non-built in -Avro, CSV …
– Unreliable
Ingest Transform Output
DStream Inputs
• Advanced input sources
– Twitter, Kafka, Flume, Kinesis, MQTT, ….
– Need external library
– Reliable or unreliable
• Custom input Dstream - Implement two classes
– Input DStream
– Receiver
• Reliable Receiver
• Unreliable Receiver
DStream Creation viaTransformation
• Data collected, buffered and replicated by receiver (one per DStream) and then
pushed to a stream as small RDDs
• Transformations modify data from one DStream to another
• Classifications
– Standard RDD operations – map, countByValue, reduceByKey, join,…
– Stateful operations – window, updateStateByKey, transform,
countByValueAndWindow, …
DStream Creation viaTransformation
Comparison with Storm
• Higher throughput than Storm
– Spark Streaming - 670k records/sec/node
– Storm - 115k records/sec/node
– Commercial systems: 100-500k records/sec/node
Throughputpernode
(MB/s)
Record Size (bytes)
WordCount
Spark
Storm
Throughputpernode
(MB/s)
Record Size (bytes)
Grep
Spark
Storm
SPARK SQL
Apache Spark Data Access
Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Integrated with the Spark stack
• Supports querying data either via SQL or via the Hive Query Language
• Originated as the Apache Hive port to run on top of Spark (in place of MapReduce)
• Can weave SQL queries with code transformations
Spark SQL
• Capability to expose Spark datasets over JDBC API and allow running the SQL like
queries on Spark data using traditional BI and visualization tools
• Allows to ETL their data from different formats like JSON, Parquet or a Database,
transform it, and expose it for ad-hoc querying
• Bindings in Python, Scala, and Java
Spark SQL
Spark SQL
SQL Access to Structured Data
• Existing RDDs
• Hive warehouses (uses existing metastore, SerDes and UDFs)
• JDBC/ODBC - use existing BI tools to query large datasets
DataFrame
• A distributed collection of data rows organized into named columns
• An abstraction for selection, filter, aggregate and plot structured data
• Conceptually equivalent to a table in a relational database or a data frame
in R/Python, but with richer optimizations under the hood
• Constructed from sources
– Structured data files
– Hive tables
– External databases
– Existing RDDs
DataFrame Internals
• Internally represented as a logical plan
• Lazy execution - computation only happens when an action (display
result, save output) is required
– Allows executions to be optimized by applying techniques such as predicate
push-downs and bytecode generation
• All DataFrame operations are also automatically parallelized and
distributed on clusters
DataFrame Construction - Python code
• # Construct a DataFrame from the users table in Hive
– users = context.table("users")
• # from JSON files in S3
– logs = context.load("s3n://path/to/data.json", "json")
• DataFrames provide a domain-specific language for distributed data
manipulation
Using DataFrames
• # Create a new DataFrame that contains “young users” only
– young = users.filter(users.age < 21)
• # Alternatively, using Pandas-like syntax
– young = users[users.age < 21]
• # Increment everybody’s age by 1
– young.select(young.name, young.age + 1)
Using DataFrames
• # Count the number of young users by gender
– young.groupBy("gender").count()
• # Join young users with another DataFrame called logs
– young.join(logs, logs.userId == users.userId, "left_outer")
• #SQL using Spark SQL - Count number of users in the young DataFrame
– young.registerTempTable("young")
– context.sql("SELECT count(*) FROM young")
Spark and Pandas - Conversion
• # Convert Spark DataFrame to Pandas
– pandas_df = young.toPandas()
• # Create a Spark DataFrame from Pandas
– spark_df = context.createDataFrame(pandas_df)
DataFrame API
• Common operations can be expressed as calls to the DataFrameAPI
– Selecting required columns
– Joining different data sources
– Aggregation (count, sum, average, etc)
– Filtering
Supported Data Formats and Sources
1. JSON files
2. Parquet files
3. Hive tables
4. Local file systems
5. Distributed file systems (HDFS)
6. Cloud storage (S3)
7. External RDBMS via JDBC
8. Extend DataFrames through Spark
SQL’s external data sources API to
support any third-party data formats
or sources
9. Existing third-party extensions - Avro,
CSV, ElasticSearch, and Cassandra
Combine Multiple Sources
• Join a site’s textual traffic log stored in S3 with a PostgreSQL database to
count the number of times each user has visited the site
– users = context.jdbc("jdbc:postgresql:production", "users")
– logs = context.load("/path/to/traffic.log")
– logs.join(users, logs.userId == users.userId, "left_outer") .groupBy("userId").agg({"*":
"count"})
Automatic Mechanisms to Read Less Data
• Converting to more efficient formats
• Using columnar formats (parquet)
• Using partitioning (/year=2014/month=02/…)
• Skipping data using statistics (min, max...)
• Pushing predicates into storage systems (JDBC)
Intelligent Optimization and Code Generation
• DataFrames in Spark have their execution automatically optimized by a
query optimizer
• Before any computation on a DataFrame starts, the Catalyst optimizer
compiles the operations that were used to build the DataFrame into a
physical plan for execution
• Because the optimizer understands the semantics of operations and
structure of the data, it can make intelligent decisions to speed up
computation
Intelligent Optimization and Code Generation
• At a high level, there are two types of optimizations
• Catalyst applies logical optimizations such as predicate pushdown
• The optimizer can push filter predicates down into the data source,
enabling the physical execution to skip irrelevant data
• In the case of Parquet files, entire blocks can be skipped and comparisons
on strings can be turned into cheaper integer comparisons via dictionary
encoding
Intelligent Optimization and Code Generation
• In the case of relational databases, predicates are pushed down into the
external databases to reduce the amount of data traffic
• Catalyst compiles operations into physical plans for execution and
generates JVM bytecode for those plans that is often more optimized
than hand-written code
• It can choose intelligently between broadcast joins and shuffle joins to
reduce network traffic
Intelligent Optimization and Code Generation
• It can also perform lower level optimizations such as eliminating
expensive object allocations and reducing virtual function calls
• Performance improvements for existing Spark programs when they
migrate to DataFrames
• Since the optimizer generates JVM bytecode for execution, Python users
experience the same high performance as Scala and Java users
Plan Optimization & Execution
DataFrames and SQL share the same
optimization/execution pipeline
SQL Execution Plans
• Logical and Physical query plans
– Both are trees representing query evaluation
– Internal nodes are operators over the data
– Logical plan is higher-level and algebraic
– Physical plan is lower-level and operational
• Logical plan operators
– Correspond to query language constructs
– Conceptually describe what operation needs to be performed
• Physical plan operators
– Correspond to implemented access methods
– Physically Implement the operation described by logical operators
Binding & Analyzing
Unresolved Logical
Plan
Logical Plan
SQLText
Optimized Logical
Plan
Physical Plan
Parsing
Optimizing
Query Planning
Query Example
SELECT name
FROM (
SELECT id, name
FROM people ) p
WHERE p.id = 1
Naive Query Planning
SELECT name
FROM (
SELECT id, name
FROM people ) p
WHERE p.id = 1
Optimized Execution
• Writing imperative code to optimize
all possible patterns is hard
• Instead opt for simpler rules
– Each rule makes single change
– Run multiple rules together to
fixed points
Rules Based Optimization
Performance Comparison
MLLIB
Apache Spark Machine Learning Library
MLlib
Algorithms
Key Features
• Low level library in Spark
• Built-in data analysis workflow
• Free performance gains
• Scalable
• Python, Scala, JavaAPIs
• Broad coverage of applications & algorithms
• Rapid improvements in speed & robustness
• Easy to use
• Integrated workflow
Functionality
• Classes for common operations
• Scaling, normalization, statistical summary, correlation …
• Numeric RDD operations, sampling …
• Random generators
• Words extractions (TF-IDF)
– generating feature vectors from text documents / web pages
Speed Improvements
Sample Code
Linear Regression Example
• Method run() trains model
• Parameters are set with setters setNumInterations and setIntercept
• Stochastic Gradient Descent (SGD) algorithm is used for minimizing function
Typical MLWorkflow
Pipeline API
• Pipeline is a series of algorithms (feature transformation, model fitting, ...)
• Easy workflow construction
• Distribution of parameters into each stage
• MLlib is easier to use
• Uses uniform dataset representation - SchemaRDD from SparkSQL
– multiple named columns (similar to SQL table)
Pipeline
GRAPHX
Apache Spark GraphAPI
Graphs are Everywhere
• Social Networks
• Web Graphs
• User-Item Graphs
GraphX
• New API that blurs distinction between graphs and tables
• Unifies data-parallel and graph-parallel systems
• SparkAPI for graphs
– Web-Graphs and Social Networks
– graph-parallel computation like PageRank and Collaborative Filtering
GraphX
• Extends Spark RDD abstraction using Resilient Distributed Property
Graph - a directed multi-graph with properties attached to each vertex
and edge
• Exposes fundamental operators like subgraph, joinVertices, and
mapReduceTriplets for graph computation
• Includes graph algorithms and builders for graph analytics tasks
Cross-World Manipulation Enabling
Unifying Data-Parallel and Graph-Parallel Analytics
• Tables and Graphs are composable views of the same physical data
• Each view has its own operators that exploit the semantics of the view to
achieve efficient execution
Property Graph
• A directed graph with potentially multiple parallel edges sharing the
same source and destination vertex with properties attached to each
vertex and edge
• Each vertex is keyed by a unique 64-bit long identifier (VertexID)
• Edges have corresponding source and destination vertex identifiers
• Properties are stored as Scala/Java objects with each edge and vertex in
the graph
Property Graph
• Vertex Property
– User Profile
– Current PageRank Value
• Edge Property
– Weights
– Relationships
– Timestamps
Property Graph
• Constructed from raw files, RDDs and synthetic generators
• Immutable, distributed, and fault-tolerant
• Changes to the values or structure of the graph are accomplished by producing a
new graph with the desired changes
• Parts of the original graph (unaffected structure, attributes, and indices) are
reused in the new graph
• Each partition of the graph can be recreated on a different machine in the event
of a failure
• Represented using two Spark RDDs
– Edge collection:VertexRDD
– Vertex collection: EdgeRDD
GraphViews
• Graph class contains members graph.vertices and graph.edges to access
the vertices and edges of the graph
• These members extend RDD[(VertexId,V)] and RDD[Edge[E]]
• Are backed by optimized representations that leverage the internal
GraphX representation of graph data
TripletView
• Triplets operator joins vertices and edges
• Logically joins the vertex and edge properties yielding an RDD[EdgeTriplet[VD,
ED]] containing instances of the EdgeTriplet class
• This join is graphically expressed as
TripletView
Subgraph
• Operator that takes vertex and edge predicates and returns the graph
containing only the vertices that satisfy the vertex predicate (evaluate to
true) and edges that satisfy the edge predicate and connect vertices that
satisfy the vertex predicate
Distributed Graph Representation
• Representing graphs using two RDDs
– edge-collection
– Vertex-collection
• Vertex-cut partitioning
Distributed Graph Representation
Distributed Graph Representation
• Each vertex partition contains a bitmask and routing table
• Routing table - a logical map from a vertex id to the set of edge partitions
that contains adjacent edges
• Bitmask - enables the set intersection and filtering
– Vertices bitmasks are updated after each operation (mapReduceTriplets)
– Vertices hidden by the bitmask do not participate in the graph operations
Graph Algorithms
• Collaborative Filtering
– Alternating Least Squares
– Stochastic Gradient Descent
– Tensor Factorization
• Structured Prediction
– Loopy Belief Propagation
– Max-Product Linear
• Programs
– Gibbs Sampling
• Semi-supervised ML
– Graph SSL
– CoEM
• Community Detection
– Triangle-Counting
– K-core Decomposition
– K-Truss
• Graph Analytics
– PageRank
– Personalized PageRank
– Shortest Path
– GraphColoring
• Classification
– Neural Networks
References
1. http://spark.apache.org/graphx
2. http://spark.apache.org/streaming/
3. http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-SQL-
Michael-Armbrust.pdf
4. http://web.stanford.edu/class/cs346/qpnotes.html
5. https://github.com/apache/spark/tree/master/sql
6. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf
Chowdhury. Technical Report UCB/EECS-2011-82. July 2011
7. M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-Tolerant Streaming Computation at Scale,
SOSP 2013, November 2013
8. K. Ousterhout, P. Wendell, M. Zaharia and I. Stoica. Sparrow: Distributed, Low-Latency Scheduling, SOSP 2013, November 2013
9. R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale, SIGMOD 2013, June 2013
10. A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, Dominant Resource Fairness: Fair Allocation of Multiple
Resources Types, NSDI 2011, March 2011
11. Spark: In-Memory Cluster Computing for Iterative and Interactive Applications, Stanford University, Stanford, CA, February 2011
ThankYou
Visit My LinkedIn Profile
https://in.linkedin.com/in/girishkhanzode

More Related Content

What's hot

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Apache spark
Apache sparkApache spark
Apache spark
shima jafari
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Sqoop
SqoopSqoop
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 

What's hot (20)

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Apache spark
Apache sparkApache spark
Apache spark
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Sqoop
SqoopSqoop
Sqoop
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 

Viewers also liked

Adventures in Timespace - How Apache Flink Handles Time and Windows
Adventures in Timespace - How Apache Flink Handles Time and WindowsAdventures in Timespace - How Apache Flink Handles Time and Windows
Adventures in Timespace - How Apache Flink Handles Time and Windows
Aljoscha Krettek
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
Alessandro Menabò
 
Ayasdi strata
Ayasdi strataAyasdi strata
Ayasdi strata
Alpine Data
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
Wes McKinney
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
Wes McKinney
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
Gabriele Modena
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
Kristian Alexander
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
Jongwook Woo
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
df: Dataframe on Spark
df: Dataframe on Sparkdf: Dataframe on Spark
df: Dataframe on Spark
Alpine Data
 
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksBig Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Data Con LA
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
Data Con LA
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
T212
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
 
Fluentd and Kafka
Fluentd and KafkaFluentd and Kafka
Fluentd and Kafka
N Masahiro
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
Knoldus Inc.
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Girish Khanzode
 

Viewers also liked (20)

Adventures in Timespace - How Apache Flink Handles Time and Windows
Adventures in Timespace - How Apache Flink Handles Time and WindowsAdventures in Timespace - How Apache Flink Handles Time and Windows
Adventures in Timespace - How Apache Flink Handles Time and Windows
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Ayasdi strata
Ayasdi strataAyasdi strata
Ayasdi strata
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
df: Dataframe on Spark
df: Dataframe on Sparkdf: Dataframe on Spark
df: Dataframe on Spark
 
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksBig Data Day LA 2016 Keynote - Reynold Xin/ Databricks
Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
Fluentd and Kafka
Fluentd and KafkaFluentd and Kafka
Fluentd and Kafka
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 

Similar to Apache Spark Components

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
Spark cep
Spark cepSpark cep
Spark cep
Byungjin Kim
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Streamsets and spark
Streamsets and sparkStreamsets and spark
Streamsets and spark
Hari Shreedharan
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
Thisara Pramuditha
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Data Con LA
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
Hadoop
HadoopHadoop
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overview
Avi Levi
 

Similar to Apache Spark Components (20)

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Spark cep
Spark cepSpark cep
Spark cep
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Streamsets and spark
Streamsets and sparkStreamsets and spark
Streamsets and spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Hadoop
HadoopHadoop
Hadoop
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Spark streaming high level overview
Spark streaming high level overviewSpark streaming high level overview
Spark streaming high level overview
 

More from Girish Khanzode

Data Visulalization
Data VisulalizationData Visulalization
Data Visulalization
Girish Khanzode
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
Girish Khanzode
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
Girish Khanzode
 
Language R
Language RLanguage R
Language R
Girish Khanzode
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
Girish Khanzode
 
Funtional Programming
Funtional ProgrammingFuntional Programming
Funtional Programming
Girish Khanzode
 

More from Girish Khanzode (10)

Data Visulalization
Data VisulalizationData Visulalization
Data Visulalization
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
IR
IRIR
IR
 
NLP
NLPNLP
NLP
 
NLTK
NLTKNLTK
NLTK
 
NoSql
NoSqlNoSql
NoSql
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Language R
Language RLanguage R
Language R
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
 
Funtional Programming
Funtional ProgrammingFuntional Programming
Funtional Programming
 

Recently uploaded

May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 

Recently uploaded (20)

May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 

Apache Spark Components

  • 1. Apache Spark Components Spark Streaming | Spark SQL | MLlib | GraphX Girish Khanzode
  • 2. Contents • Spark Streaming – Micro batch – Stateful Stream Processing – DStream – Socket Stream – File Stream • Spark SQL – DataFrame – DataFrameAPI – Supported Data Formats and Sources – Plan Optimization & Execution – Rules Based Optimization • MLlib – Algorithms – Key Features – Pipeline • GraphX – PropertyGraph – GraphViews – TripletView – Subgraph – DistributedGraph Representation • Graph Algorithms • References
  • 3. Spark Streaming A framework for large scale stream processing
  • 5. Spark Streaming • Extends Spark for big data stream processing • Efficient, fault-tolerant, stateful stream processing of live stream data • Integrates with Spark’s batch and interactive processing • Scales to hundreds of nodes • Can achieve latencies on scale of seconds
  • 6. Spark Streaming • Can absorb live data streams from Kafka, Flume, ZeroMQ etc • Simple Batch likeAPI to implement complex algorithms • Integrates with other Spark extensions • Started in 2012, alpha released with Spark 0.7 in 2013, released with Spark 0.9 in 2014
  • 7. Need for Spark Streaming • Existing frameworks can either – Stream process 100s of MBs with low latency – Batch processTBs of data with high latency • Painful to maintain two different stacks – Different programming models – Doubles implementation effort
  • 8. Need for Spark Streaming • Many applications must process large streams of live data and provide results in near-real-time – Social network trends – Website statistics – Intrusion detection systems • Many environments require processing same data in live streaming as well as batch post-processing
  • 9. Micro batch • Spark streaming is a fast batch processing system • Spark streaming collects stream data into small batch and runs batch processing on it • Batch can be as small as 1 second to as big as multiple hours • Spark job creation and execution overhead is so low it can do all that under a second • These batches are called as DStreams
  • 10. Stateful Stream Processing • Traditional streaming systems have a event-driven record-at-a-time processing model – Each node has mutable state – For each record, update state & send new records • State is lost if node dies • Making stateful stream processing fault-tolerant is a challenge
  • 12. Streaming System - Storm • Replays record if not processed by a node • Processes each record at least once • May update mutable state twice • Mutable state can be lost due to failure
  • 13. Streaming System -Trident • Uses transactions to update state • Processes each record exactly once • Per state transaction updates slow
  • 14. Spark Streaming • Runs a streaming computation as a series of very small deterministic batch jobs • Splits the live stream into batches of X seconds • Spark treats each batch of data as RDDs and processes them using RDD operations • Processed results of RDD operations are returned in batches
  • 16. Spark Streaming • Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs • Batch sizes as low as 0.5 second, latency ~ 1 second • Potential for combining batch processing and streaming processing in the same system • Result: can process 42 million records/second (4 GB/s) on 100 nodes at sub-second latency
  • 17. Spark Streaming • tweetStream • .flatMap(_.toLower.split) • .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _) T=1 T=2 … map reduceByWindow
  • 18. Streaming • Creates RDDs from stream source on a defined interval • Same operation as normal RDDs • Supports a variety of sources • Exactly once message guarantee
  • 19. Discretized Stream - DStream • Basic abstraction provided by Spark Streaming • Input stream is divided into multiple discrete batches • Represents a stream of data • Implemented as a sequence of RDDs • Each batch of DStream is represented as RDD underneath
  • 20. Discretized Stream - DStream • These RDD are replicated in cluster for fault tolerance • Every DStream operation results in RDD transformation • APIs provided to access these RDD is directly • Can combine stream and batch processing • Configurable intervals - 1 second, 5 second, 5 minutes etc.
  • 22. DStream transformation • val ssc = new StreamingContext(args(0), "wordcount", Seconds(5)) • val lines = ssc.socketTextStream("localhost",50050) • val words = lines.flatMap(_.split(" "))
  • 23. Socket Stream • Ability to listen to any socket on remote machines • Need to configure host and port • Both Raw andText representation of socket available • Built in retry mechanism
  • 24. File Stream • Allows tracking new files in a given directory on HDFS • Whenever there is new file appears, spark streaming will pick it up • Only works for new files, modification for existing files will not be considered • Tracked using file creation time
  • 26. Stateful Operations • Ability to maintain random state across multiple batches • Fault tolerant • Exactly once semantics • WAL (Write Ahead Log) for receiver crashes
  • 27. How Stateful OperationsWork? • Generally state is a mutable operation • But in functional programming, state is represented with state machine going from one state to another • fn(oldState,newInfo) => newState • In Spark, state is represented using RDD • Change in the state is represented using transformation of RDD’s • Fault tolerance of RDD helps in fault tolerance of state
  • 28. Transform API • In stream processing, ability to combine stream data with batch data is extremely important • Both batch API and stream API share RDD as abstraction • TransformAPI of DStream allows us to access underneath RDD’s directly • Example - Combine customer sales data with customer information
  • 30. DStream Inputs • DStream is Created from – streaming input sources – applying transformations on existing DStreams • Basic input sources – Built-in - file system, socket – Non-built in -Avro, CSV … – Unreliable Ingest Transform Output
  • 31. DStream Inputs • Advanced input sources – Twitter, Kafka, Flume, Kinesis, MQTT, …. – Need external library – Reliable or unreliable • Custom input Dstream - Implement two classes – Input DStream – Receiver • Reliable Receiver • Unreliable Receiver
  • 32. DStream Creation viaTransformation • Data collected, buffered and replicated by receiver (one per DStream) and then pushed to a stream as small RDDs • Transformations modify data from one DStream to another • Classifications – Standard RDD operations – map, countByValue, reduceByKey, join,… – Stateful operations – window, updateStateByKey, transform, countByValueAndWindow, …
  • 34. Comparison with Storm • Higher throughput than Storm – Spark Streaming - 670k records/sec/node – Storm - 115k records/sec/node – Commercial systems: 100-500k records/sec/node Throughputpernode (MB/s) Record Size (bytes) WordCount Spark Storm Throughputpernode (MB/s) Record Size (bytes) Grep Spark Storm
  • 35. SPARK SQL Apache Spark Data Access
  • 36. Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Integrated with the Spark stack • Supports querying data either via SQL or via the Hive Query Language • Originated as the Apache Hive port to run on top of Spark (in place of MapReduce) • Can weave SQL queries with code transformations
  • 37. Spark SQL • Capability to expose Spark datasets over JDBC API and allow running the SQL like queries on Spark data using traditional BI and visualization tools • Allows to ETL their data from different formats like JSON, Parquet or a Database, transform it, and expose it for ad-hoc querying • Bindings in Python, Scala, and Java
  • 40. SQL Access to Structured Data • Existing RDDs • Hive warehouses (uses existing metastore, SerDes and UDFs) • JDBC/ODBC - use existing BI tools to query large datasets
  • 41. DataFrame • A distributed collection of data rows organized into named columns • An abstraction for selection, filter, aggregate and plot structured data • Conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood • Constructed from sources – Structured data files – Hive tables – External databases – Existing RDDs
  • 42. DataFrame Internals • Internally represented as a logical plan • Lazy execution - computation only happens when an action (display result, save output) is required – Allows executions to be optimized by applying techniques such as predicate push-downs and bytecode generation • All DataFrame operations are also automatically parallelized and distributed on clusters
  • 43. DataFrame Construction - Python code • # Construct a DataFrame from the users table in Hive – users = context.table("users") • # from JSON files in S3 – logs = context.load("s3n://path/to/data.json", "json") • DataFrames provide a domain-specific language for distributed data manipulation
  • 44. Using DataFrames • # Create a new DataFrame that contains “young users” only – young = users.filter(users.age < 21) • # Alternatively, using Pandas-like syntax – young = users[users.age < 21] • # Increment everybody’s age by 1 – young.select(young.name, young.age + 1)
  • 45. Using DataFrames • # Count the number of young users by gender – young.groupBy("gender").count() • # Join young users with another DataFrame called logs – young.join(logs, logs.userId == users.userId, "left_outer") • #SQL using Spark SQL - Count number of users in the young DataFrame – young.registerTempTable("young") – context.sql("SELECT count(*) FROM young")
  • 46. Spark and Pandas - Conversion • # Convert Spark DataFrame to Pandas – pandas_df = young.toPandas() • # Create a Spark DataFrame from Pandas – spark_df = context.createDataFrame(pandas_df)
  • 47. DataFrame API • Common operations can be expressed as calls to the DataFrameAPI – Selecting required columns – Joining different data sources – Aggregation (count, sum, average, etc) – Filtering
  • 48. Supported Data Formats and Sources 1. JSON files 2. Parquet files 3. Hive tables 4. Local file systems 5. Distributed file systems (HDFS) 6. Cloud storage (S3) 7. External RDBMS via JDBC 8. Extend DataFrames through Spark SQL’s external data sources API to support any third-party data formats or sources 9. Existing third-party extensions - Avro, CSV, ElasticSearch, and Cassandra
  • 49. Combine Multiple Sources • Join a site’s textual traffic log stored in S3 with a PostgreSQL database to count the number of times each user has visited the site – users = context.jdbc("jdbc:postgresql:production", "users") – logs = context.load("/path/to/traffic.log") – logs.join(users, logs.userId == users.userId, "left_outer") .groupBy("userId").agg({"*": "count"})
  • 50. Automatic Mechanisms to Read Less Data • Converting to more efficient formats • Using columnar formats (parquet) • Using partitioning (/year=2014/month=02/…) • Skipping data using statistics (min, max...) • Pushing predicates into storage systems (JDBC)
  • 51. Intelligent Optimization and Code Generation • DataFrames in Spark have their execution automatically optimized by a query optimizer • Before any computation on a DataFrame starts, the Catalyst optimizer compiles the operations that were used to build the DataFrame into a physical plan for execution • Because the optimizer understands the semantics of operations and structure of the data, it can make intelligent decisions to speed up computation
  • 52. Intelligent Optimization and Code Generation • At a high level, there are two types of optimizations • Catalyst applies logical optimizations such as predicate pushdown • The optimizer can push filter predicates down into the data source, enabling the physical execution to skip irrelevant data • In the case of Parquet files, entire blocks can be skipped and comparisons on strings can be turned into cheaper integer comparisons via dictionary encoding
  • 53. Intelligent Optimization and Code Generation • In the case of relational databases, predicates are pushed down into the external databases to reduce the amount of data traffic • Catalyst compiles operations into physical plans for execution and generates JVM bytecode for those plans that is often more optimized than hand-written code • It can choose intelligently between broadcast joins and shuffle joins to reduce network traffic
  • 54. Intelligent Optimization and Code Generation • It can also perform lower level optimizations such as eliminating expensive object allocations and reducing virtual function calls • Performance improvements for existing Spark programs when they migrate to DataFrames • Since the optimizer generates JVM bytecode for execution, Python users experience the same high performance as Scala and Java users
  • 55. Plan Optimization & Execution DataFrames and SQL share the same optimization/execution pipeline
  • 56. SQL Execution Plans • Logical and Physical query plans – Both are trees representing query evaluation – Internal nodes are operators over the data – Logical plan is higher-level and algebraic – Physical plan is lower-level and operational • Logical plan operators – Correspond to query language constructs – Conceptually describe what operation needs to be performed • Physical plan operators – Correspond to implemented access methods – Physically Implement the operation described by logical operators Binding & Analyzing Unresolved Logical Plan Logical Plan SQLText Optimized Logical Plan Physical Plan Parsing Optimizing Query Planning
  • 57. Query Example SELECT name FROM ( SELECT id, name FROM people ) p WHERE p.id = 1
  • 58. Naive Query Planning SELECT name FROM ( SELECT id, name FROM people ) p WHERE p.id = 1
  • 59. Optimized Execution • Writing imperative code to optimize all possible patterns is hard • Instead opt for simpler rules – Each rule makes single change – Run multiple rules together to fixed points
  • 62. MLLIB Apache Spark Machine Learning Library
  • 63. MLlib
  • 65. Key Features • Low level library in Spark • Built-in data analysis workflow • Free performance gains • Scalable • Python, Scala, JavaAPIs • Broad coverage of applications & algorithms • Rapid improvements in speed & robustness • Easy to use • Integrated workflow
  • 66. Functionality • Classes for common operations • Scaling, normalization, statistical summary, correlation … • Numeric RDD operations, sampling … • Random generators • Words extractions (TF-IDF) – generating feature vectors from text documents / web pages
  • 69. Linear Regression Example • Method run() trains model • Parameters are set with setters setNumInterations and setIntercept • Stochastic Gradient Descent (SGD) algorithm is used for minimizing function
  • 71. Pipeline API • Pipeline is a series of algorithms (feature transformation, model fitting, ...) • Easy workflow construction • Distribution of parameters into each stage • MLlib is easier to use • Uses uniform dataset representation - SchemaRDD from SparkSQL – multiple named columns (similar to SQL table)
  • 74. Graphs are Everywhere • Social Networks • Web Graphs • User-Item Graphs
  • 75. GraphX • New API that blurs distinction between graphs and tables • Unifies data-parallel and graph-parallel systems • SparkAPI for graphs – Web-Graphs and Social Networks – graph-parallel computation like PageRank and Collaborative Filtering
  • 76. GraphX • Extends Spark RDD abstraction using Resilient Distributed Property Graph - a directed multi-graph with properties attached to each vertex and edge • Exposes fundamental operators like subgraph, joinVertices, and mapReduceTriplets for graph computation • Includes graph algorithms and builders for graph analytics tasks
  • 78. Unifying Data-Parallel and Graph-Parallel Analytics • Tables and Graphs are composable views of the same physical data • Each view has its own operators that exploit the semantics of the view to achieve efficient execution
  • 79. Property Graph • A directed graph with potentially multiple parallel edges sharing the same source and destination vertex with properties attached to each vertex and edge • Each vertex is keyed by a unique 64-bit long identifier (VertexID) • Edges have corresponding source and destination vertex identifiers • Properties are stored as Scala/Java objects with each edge and vertex in the graph
  • 80. Property Graph • Vertex Property – User Profile – Current PageRank Value • Edge Property – Weights – Relationships – Timestamps
  • 81. Property Graph • Constructed from raw files, RDDs and synthetic generators • Immutable, distributed, and fault-tolerant • Changes to the values or structure of the graph are accomplished by producing a new graph with the desired changes • Parts of the original graph (unaffected structure, attributes, and indices) are reused in the new graph • Each partition of the graph can be recreated on a different machine in the event of a failure • Represented using two Spark RDDs – Edge collection:VertexRDD – Vertex collection: EdgeRDD
  • 82. GraphViews • Graph class contains members graph.vertices and graph.edges to access the vertices and edges of the graph • These members extend RDD[(VertexId,V)] and RDD[Edge[E]] • Are backed by optimized representations that leverage the internal GraphX representation of graph data
  • 83. TripletView • Triplets operator joins vertices and edges • Logically joins the vertex and edge properties yielding an RDD[EdgeTriplet[VD, ED]] containing instances of the EdgeTriplet class • This join is graphically expressed as
  • 85. Subgraph • Operator that takes vertex and edge predicates and returns the graph containing only the vertices that satisfy the vertex predicate (evaluate to true) and edges that satisfy the edge predicate and connect vertices that satisfy the vertex predicate
  • 86. Distributed Graph Representation • Representing graphs using two RDDs – edge-collection – Vertex-collection • Vertex-cut partitioning
  • 88. Distributed Graph Representation • Each vertex partition contains a bitmask and routing table • Routing table - a logical map from a vertex id to the set of edge partitions that contains adjacent edges • Bitmask - enables the set intersection and filtering – Vertices bitmasks are updated after each operation (mapReduceTriplets) – Vertices hidden by the bitmask do not participate in the graph operations
  • 89. Graph Algorithms • Collaborative Filtering – Alternating Least Squares – Stochastic Gradient Descent – Tensor Factorization • Structured Prediction – Loopy Belief Propagation – Max-Product Linear • Programs – Gibbs Sampling • Semi-supervised ML – Graph SSL – CoEM • Community Detection – Triangle-Counting – K-core Decomposition – K-Truss • Graph Analytics – PageRank – Personalized PageRank – Shortest Path – GraphColoring • Classification – Neural Networks
  • 90. References 1. http://spark.apache.org/graphx 2. http://spark.apache.org/streaming/ 3. http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-SQL- Michael-Armbrust.pdf 4. http://web.stanford.edu/class/cs346/qpnotes.html 5. https://github.com/apache/spark/tree/master/sql 6. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury. Technical Report UCB/EECS-2011-82. July 2011 7. M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-Tolerant Streaming Computation at Scale, SOSP 2013, November 2013 8. K. Ousterhout, P. Wendell, M. Zaharia and I. Stoica. Sparrow: Distributed, Low-Latency Scheduling, SOSP 2013, November 2013 9. R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale, SIGMOD 2013, June 2013 10. A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, Dominant Resource Fairness: Fair Allocation of Multiple Resources Types, NSDI 2011, March 2011 11. Spark: In-Memory Cluster Computing for Iterative and Interactive Applications, Stanford University, Stanford, CA, February 2011
  • 91. ThankYou Visit My LinkedIn Profile https://in.linkedin.com/in/girishkhanzode

Editor's Notes

  1. Why is this faster? Windows & micro-batching… Do you really need sub half second streaming?