SlideShare a Scribd company logo
Evolution of Apache Spark
Journey of Spark in 1.x series
● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Spark 1.0
● State of Big data
● Change in ecosystem
● Dawn of structured data
● Working with structured sources
● Dawn of custom memory management
● Evolution of Libraries
Spark 1.0
● Release on May 2014 [1]
● First production ready, backward compatible release
● Contains
○ Spark batch
○ Spark streaming
○ Shark
○ MLLib and Graphx
● Developed over 4 years
● Better hadoop
State of Big data Industry
● Map/Reduce was the way to do big data processing
● HDFS was primary source of the data
● Tools like Sqoop developed for moving data to hdfs and
hdfs acted like single point of source
● Every data by default assumed to be unstructured and
structure was laid on top of it
● Hive and Pig were popular ways to do structured and
semi structured data processing on top of Map/Reduce
Spark 1.0 Ideas
● RDD abstraction was supported to do Map/Reduce style
programming
● Primary source supported was HDFS and memory as
the speedup layer
● Spark-streaming viewed as faster batch processing
rather than as streaming
● To support Hive, Shark was created to generate RDD
code rather than Map/Reduce
Changes from 2014
● Big data industry has gone through many radical
changes in thinking in last two years
● Some of those changes started in spark and some other
are influenced by other frameworks
● These changes are important to understand why Spark
2.0 abstractions are radically different than Spark 1.0
● Many of these are already discussed in earlier meetups,
links to the videos are in reference
Dawn of Structured Data
Usage of Big data in 2014.
● Most of the people were using higher level tools like
Hive and Pig to process data rather using Map/Reduce
● Most of the data was residing in the RDBMS databases
and user ETL data from mysql to hive to query
● So lot of use cases were analysing structured data
rather than basic assumption of unstructured in big data
world
● Huge time is consumed for ETL and non optimized
workflows from Hive
Spark with Structured Data in 1.2
● Spark recognised need of structured data in the market
and started to evolve the platform to support that use
case
● First attempt was to have a specialised RDD called
SchemaRDD in Spark 1.2 which represented that
schema
● But this approach was not clean
● Also even though there was InputFormat to read from
structured data, there was no direct API to read from
Spark
DataSource API in Spark 1.3
● First API to provide an unified API to read from
structured and semi structured sources
● Can read from RDBMS, NoSql databases like
Mongodb,Cassandra etc
● Advanced API like InputFormat which gives lot of
control to source to optimize locality of data
● So in Spark 1.3, spark addressed the need of structured
data being first class in Big data ecosystem
● For more info refer to, Anatomy of DataSource API talk[2]
DataFrame abstraction in Spark
● Spark understood modifying the RDD abstraction is not
good enough
● Many frameworks like Hive, Pig tried and failed mapping
querying efficiently on Map/Reduce
● So Spark came up with Dataframe abstraction which
goes through a complete different pipeline that of RDD
which is highly optimized
● For more info refer to, Anatomy of DataFrame API talk [3]
Evolution of InMemory processing
In memory in Spark 1.0
● Spark was the first open source big data framework to
embrace in memory computing
● With cheaper hardware and abstractions like RDD
allowed spark to exploit memory in efficient way than all
other hadoop ecosystem projects
● The first implementation of in memory computing
followed typical cache approach of keeping serialized
java bytes
● This proved to be challenging in future
Challenges of in memory in Java
● As more and more big data frameworks started to
exploit memory, soon they realised few limitation of
Java memory model
● Java memory is tuned for short lived objects and
complete control of memory is given to JVM
● But big data system started using JVM for long term
storage, JVM memory model started feel inadequate
● Also as java heap grew, to cache more data, GC
pauses started to kill performance
Custom memory management
● Apache Flink is first big data system to implement
custom memory management in java
● Flink follows Dataframe like API with custom memory
model
● The custom memory model with non GC based
approach proved to be highly successful
● By observing trends in community, optly Spark also
adopted same in Spark 1.4
Tungsten in Spark 1.4
● Spark release first version of custom memory
management in 1.4 version
● It was only supported DF as they need custom memory
model
● Custom memory management greatly improved use of
spark in higher vm size and fewer GC paused
● Solved OOM issues which plagued earlier versions of
spark
● For more info refer to, Anatomy of In memory
management in Spark talk [4]
DSL’s for data processing
RDD and Map/Reduce API API
● RDD API of spark follows functional programming
paradigm which is similar to Map/Reduce
● RDD API passes around opaque function objects which
is great for programming but bad for system based
optimization
● Map/Reduce API of Java also follows same patterns but
less elegant than scala ones
● Hard to optimise compared to Pig/Hive
● So we saw a steady increase in custom DSL’s in
hadoop world
Need of DSL’s in Hadoop
● DSL’s like Pig or Hive are much more easier to
understand compare to Java API
● Less error prone and helps to be very specific
● Can be easily optimised, as DSL only focuses on what
to do not how to do
● As Java Map/Reduce mixes what with how, it’s hard to
optimize compare to Hive and Pig
● So more and more people prefered these DSL over
platform level API’s
Challenges of DSL in Hadoop
● Hive and Pig DSL do not integrate well with
Map/Reduce API’s
● DSL often lack the flexibility of complete programming
language
● Hive/Pig DSL don’t define single abstraction to share so
you will be not able mix
● DSL are powerful for optimization but soon become
limited in terms of functionality
Scala as language to host DSL
● Scala is one of the first language to embrace DSL as
the first class citizens
● Scala features like implicits, higher order functions,
structured types etc allow easily build DSL’s and
integrate with language
● This allows any library on scala to integrate DSL and
harness full power of language
● Many libraries define their own DSL outside big data. Ex
: Slick, Akka-http, Sbt
DF DSL and Spark SQL DSL
● To harness power of custom memory management and
hive like optimizes spark encourages to write DF and
spark sql DSL over spark RDD code
● Whenever we write this DSL, all the features of scala
language and its libraries are available,which makes it
more powerful that Pig/ Hive
● Other frameworks like Flink, Beam follow same ideas on
scala, Java 8 etc
● You can easily mix and match DSL with RDD API
Dataset DSL in Spark 1.6
● Dataframe DSL introduced in 1.4 and stabilised in 1.5
● As spark observed the user and performance benefits of
DSL based programming, it wanted to make as import
pillar of Spark
● So in Spark 1.6, Spark released Dataset DSL which is
poised to complete RDD API from user land
● This indicates a big shift in thinking as we are more and
more moving away from 1.0 Map/Reduce and
unstructured mindset.
Evolution of Libraries
Evolution of libraries vs frameworks
● Spark is one of the first big data framework to build
platform rather than collection of frameworks
● Single abstraction results in multiple libraries not
multiple frameworks
● All these libraries get benefits from the improvements in
run time
● This made spark to build lot of ecosystem in very less
time
● To understand the meaning of platform, refer to
Introduction to Flink talk [5]
Data exchange between Libraries
● As more and more libraries are added to spark, having
common way to exchange data became important
● Initially libraries started using RDD as data exchange
format, but soon discovered some limitations
● Limitations of RDD as data exchange format is
○ No defined schema. Need to come up with domain
object for each library
○ Too low level
○ Custom serialization is hard to integrate
DataFrame as data exchange format
● From last few release, spark is making Dataframe as
new data exchange format of Spark
● Dataframe has schema and can be easily passed
around between libraries
● Dataframe is higher level abstraction compared RDD
● As Dataframe are serialized using platform specific
code generation, all libraries will be following same
serialization
● Dataset will follow the same advantages
Learnings from Spark 1.x
● Structured/Semi structured data is the first class of Big
data processing system
● Custom memory management and code generated
serialization gives best performance on JVM
● DataFrame/ Dataset are the new abstraction layers to
build next generation big data processing system
● DSL is the way forward over Map/Reduce like API’s
● Having high level structured abstractions make libraries
coexist happily on a platform
References
1. http://spark.apache.org/news/spark-1-0-0-released.html
2. https://www.youtube.com/watch?v=ckX6fT3kYG0
3. https://www.youtube.com/watch?v=iKOGBr-kOks
4. https://www.youtube.com/watch?v=7nIMpD5TyNs
5. https://www.youtube.com/watch?v=jErEhxP8LYQ

More Related Content

What's hot

Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Migrating Oracle database to PostgreSQL
Migrating Oracle database to PostgreSQLMigrating Oracle database to PostgreSQL
Migrating Oracle database to PostgreSQL
Umair Mansoob
 
Apache spark
Apache sparkApache spark
Apache spark
shima jafari
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Elastic search Walkthrough
Elastic search WalkthroughElastic search Walkthrough
Elastic search Walkthrough
Suhel Meman
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
Yugabyte
 
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataRealtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
ScyllaDB
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databases
Ashwani Kumar
 
Introduction to MongoDB.pptx
Introduction to MongoDB.pptxIntroduction to MongoDB.pptx
Introduction to MongoDB.pptx
Surya937648
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
Amazon Web Services
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
Amazon Web Services
 
03 spark rdd operations
03 spark rdd operations03 spark rdd operations
03 spark rdd operations
Venkat Datla
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and cons
Fabio Fumarola
 
Log analytics with ELK stack
Log analytics with ELK stackLog analytics with ELK stack
Log analytics with ELK stack
AWS User Group Bengaluru
 

What's hot (20)

Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Migrating Oracle database to PostgreSQL
Migrating Oracle database to PostgreSQLMigrating Oracle database to PostgreSQL
Migrating Oracle database to PostgreSQL
 
Apache spark
Apache sparkApache spark
Apache spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Elastic search Walkthrough
Elastic search WalkthroughElastic search Walkthrough
Elastic search Walkthrough
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataRealtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databases
 
Introduction to MongoDB.pptx
Introduction to MongoDB.pptxIntroduction to MongoDB.pptx
Introduction to MongoDB.pptx
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
03 spark rdd operations
03 spark rdd operations03 spark rdd operations
03 spark rdd operations
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and cons
 
Log analytics with ELK stack
Log analytics with ELK stackLog analytics with ELK stack
Log analytics with ELK stack
 

Viewers also liked

Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
datamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
datamantra
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
datamantra
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
datamantra
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
datamantra
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
datamantra
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
datamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
datamantra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
datamantra
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
Ramez Al-Fayez
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
datamantra
 
Mesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overviewMesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overview
Krishna-Kumar
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
datamantra
 
Predictive modeling healthcare
Predictive modeling healthcarePredictive modeling healthcare
Predictive modeling healthcare
Taposh Roy
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with Spark
Sylvain Zimmer
 

Viewers also liked (20)

Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Mesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overviewMesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overview
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Predictive modeling healthcare
Predictive modeling healthcarePredictive modeling healthcare
Predictive modeling healthcare
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with Spark
 

Similar to Evolution of apache spark

Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
datamantra
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Spark 101
Spark 101Spark 101
Apache spark on Hadoop Yarn Resource Manager
Apache spark on Hadoop Yarn Resource ManagerApache spark on Hadoop Yarn Resource Manager
Apache spark on Hadoop Yarn Resource Manager
haridasnss
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013
scorlosquet
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
Ramesh Mudunuri
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 

Similar to Evolution of apache spark (20)

Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark 101
Spark 101Spark 101
Spark 101
 
Apache spark on Hadoop Yarn Resource Manager
Apache spark on Hadoop Yarn Resource ManagerApache spark on Hadoop Yarn Resource Manager
Apache spark on Hadoop Yarn Resource Manager
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
 

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
datamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
datamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
datamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
datamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
datamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
datamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
datamantra
 

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 

Recently uploaded

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 

Recently uploaded (20)

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 

Evolution of apache spark

  • 1. Evolution of Apache Spark Journey of Spark in 1.x series
  • 2. ● Madhukara Phatak ● Technical Lead at Tellius ● Consultant and Trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Spark 1.0 ● State of Big data ● Change in ecosystem ● Dawn of structured data ● Working with structured sources ● Dawn of custom memory management ● Evolution of Libraries
  • 4. Spark 1.0 ● Release on May 2014 [1] ● First production ready, backward compatible release ● Contains ○ Spark batch ○ Spark streaming ○ Shark ○ MLLib and Graphx ● Developed over 4 years ● Better hadoop
  • 5. State of Big data Industry ● Map/Reduce was the way to do big data processing ● HDFS was primary source of the data ● Tools like Sqoop developed for moving data to hdfs and hdfs acted like single point of source ● Every data by default assumed to be unstructured and structure was laid on top of it ● Hive and Pig were popular ways to do structured and semi structured data processing on top of Map/Reduce
  • 6. Spark 1.0 Ideas ● RDD abstraction was supported to do Map/Reduce style programming ● Primary source supported was HDFS and memory as the speedup layer ● Spark-streaming viewed as faster batch processing rather than as streaming ● To support Hive, Shark was created to generate RDD code rather than Map/Reduce
  • 7. Changes from 2014 ● Big data industry has gone through many radical changes in thinking in last two years ● Some of those changes started in spark and some other are influenced by other frameworks ● These changes are important to understand why Spark 2.0 abstractions are radically different than Spark 1.0 ● Many of these are already discussed in earlier meetups, links to the videos are in reference
  • 9. Usage of Big data in 2014. ● Most of the people were using higher level tools like Hive and Pig to process data rather using Map/Reduce ● Most of the data was residing in the RDBMS databases and user ETL data from mysql to hive to query ● So lot of use cases were analysing structured data rather than basic assumption of unstructured in big data world ● Huge time is consumed for ETL and non optimized workflows from Hive
  • 10. Spark with Structured Data in 1.2 ● Spark recognised need of structured data in the market and started to evolve the platform to support that use case ● First attempt was to have a specialised RDD called SchemaRDD in Spark 1.2 which represented that schema ● But this approach was not clean ● Also even though there was InputFormat to read from structured data, there was no direct API to read from Spark
  • 11. DataSource API in Spark 1.3 ● First API to provide an unified API to read from structured and semi structured sources ● Can read from RDBMS, NoSql databases like Mongodb,Cassandra etc ● Advanced API like InputFormat which gives lot of control to source to optimize locality of data ● So in Spark 1.3, spark addressed the need of structured data being first class in Big data ecosystem ● For more info refer to, Anatomy of DataSource API talk[2]
  • 12. DataFrame abstraction in Spark ● Spark understood modifying the RDD abstraction is not good enough ● Many frameworks like Hive, Pig tried and failed mapping querying efficiently on Map/Reduce ● So Spark came up with Dataframe abstraction which goes through a complete different pipeline that of RDD which is highly optimized ● For more info refer to, Anatomy of DataFrame API talk [3]
  • 13. Evolution of InMemory processing
  • 14. In memory in Spark 1.0 ● Spark was the first open source big data framework to embrace in memory computing ● With cheaper hardware and abstractions like RDD allowed spark to exploit memory in efficient way than all other hadoop ecosystem projects ● The first implementation of in memory computing followed typical cache approach of keeping serialized java bytes ● This proved to be challenging in future
  • 15. Challenges of in memory in Java ● As more and more big data frameworks started to exploit memory, soon they realised few limitation of Java memory model ● Java memory is tuned for short lived objects and complete control of memory is given to JVM ● But big data system started using JVM for long term storage, JVM memory model started feel inadequate ● Also as java heap grew, to cache more data, GC pauses started to kill performance
  • 16. Custom memory management ● Apache Flink is first big data system to implement custom memory management in java ● Flink follows Dataframe like API with custom memory model ● The custom memory model with non GC based approach proved to be highly successful ● By observing trends in community, optly Spark also adopted same in Spark 1.4
  • 17. Tungsten in Spark 1.4 ● Spark release first version of custom memory management in 1.4 version ● It was only supported DF as they need custom memory model ● Custom memory management greatly improved use of spark in higher vm size and fewer GC paused ● Solved OOM issues which plagued earlier versions of spark ● For more info refer to, Anatomy of In memory management in Spark talk [4]
  • 18. DSL’s for data processing
  • 19. RDD and Map/Reduce API API ● RDD API of spark follows functional programming paradigm which is similar to Map/Reduce ● RDD API passes around opaque function objects which is great for programming but bad for system based optimization ● Map/Reduce API of Java also follows same patterns but less elegant than scala ones ● Hard to optimise compared to Pig/Hive ● So we saw a steady increase in custom DSL’s in hadoop world
  • 20. Need of DSL’s in Hadoop ● DSL’s like Pig or Hive are much more easier to understand compare to Java API ● Less error prone and helps to be very specific ● Can be easily optimised, as DSL only focuses on what to do not how to do ● As Java Map/Reduce mixes what with how, it’s hard to optimize compare to Hive and Pig ● So more and more people prefered these DSL over platform level API’s
  • 21. Challenges of DSL in Hadoop ● Hive and Pig DSL do not integrate well with Map/Reduce API’s ● DSL often lack the flexibility of complete programming language ● Hive/Pig DSL don’t define single abstraction to share so you will be not able mix ● DSL are powerful for optimization but soon become limited in terms of functionality
  • 22. Scala as language to host DSL ● Scala is one of the first language to embrace DSL as the first class citizens ● Scala features like implicits, higher order functions, structured types etc allow easily build DSL’s and integrate with language ● This allows any library on scala to integrate DSL and harness full power of language ● Many libraries define their own DSL outside big data. Ex : Slick, Akka-http, Sbt
  • 23. DF DSL and Spark SQL DSL ● To harness power of custom memory management and hive like optimizes spark encourages to write DF and spark sql DSL over spark RDD code ● Whenever we write this DSL, all the features of scala language and its libraries are available,which makes it more powerful that Pig/ Hive ● Other frameworks like Flink, Beam follow same ideas on scala, Java 8 etc ● You can easily mix and match DSL with RDD API
  • 24. Dataset DSL in Spark 1.6 ● Dataframe DSL introduced in 1.4 and stabilised in 1.5 ● As spark observed the user and performance benefits of DSL based programming, it wanted to make as import pillar of Spark ● So in Spark 1.6, Spark released Dataset DSL which is poised to complete RDD API from user land ● This indicates a big shift in thinking as we are more and more moving away from 1.0 Map/Reduce and unstructured mindset.
  • 26. Evolution of libraries vs frameworks ● Spark is one of the first big data framework to build platform rather than collection of frameworks ● Single abstraction results in multiple libraries not multiple frameworks ● All these libraries get benefits from the improvements in run time ● This made spark to build lot of ecosystem in very less time ● To understand the meaning of platform, refer to Introduction to Flink talk [5]
  • 27. Data exchange between Libraries ● As more and more libraries are added to spark, having common way to exchange data became important ● Initially libraries started using RDD as data exchange format, but soon discovered some limitations ● Limitations of RDD as data exchange format is ○ No defined schema. Need to come up with domain object for each library ○ Too low level ○ Custom serialization is hard to integrate
  • 28. DataFrame as data exchange format ● From last few release, spark is making Dataframe as new data exchange format of Spark ● Dataframe has schema and can be easily passed around between libraries ● Dataframe is higher level abstraction compared RDD ● As Dataframe are serialized using platform specific code generation, all libraries will be following same serialization ● Dataset will follow the same advantages
  • 29. Learnings from Spark 1.x ● Structured/Semi structured data is the first class of Big data processing system ● Custom memory management and code generated serialization gives best performance on JVM ● DataFrame/ Dataset are the new abstraction layers to build next generation big data processing system ● DSL is the way forward over Map/Reduce like API’s ● Having high level structured abstractions make libraries coexist happily on a platform
  • 30. References 1. http://spark.apache.org/news/spark-1-0-0-released.html 2. https://www.youtube.com/watch?v=ckX6fT3kYG0 3. https://www.youtube.com/watch?v=iKOGBr-kOks 4. https://www.youtube.com/watch?v=7nIMpD5TyNs 5. https://www.youtube.com/watch?v=jErEhxP8LYQ