SlideShare a Scribd company logo
Chetan Khatri, Lead - Data Science. Accionlabs India.
Paris Scala User Group (PSUG), Paris - France.
23rd May, 2019.
Twitter: @khatri_chetan,
Email: chetan.khatri@live.com
chetan.khatri@accionlabs.com
LinkedIn: https://www.linkedin.com/in/chetkhatri
Github: chetkhatri
Lead - Data Science, Technology Evangelist @ Accion labs India Pvt. Ltd.
Contributor @ Apache Spark, Apache HBase, Elixir Lang.
Co-Authored University Curriculum @ University of Kachchh, India.
Data Engineering @: Nazara Games, Eccella Corporation.
M.Sc. - Computer Science from University of Kachchh, India.
● Apache Spark
● Primary data structures (RDD, DataSet, Dataframe)
● Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
● Parallel read from JDBC: Challenges and best practices.
● Bulk Load API vs JDBC write
● An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
● Avoid unnecessary shuffle
● Alternative to spark default sort
● Why dropDuplicates() doesn’t result consistency, What is alternative
● Optimize Spark stage generation plan
● Predicate pushdown with partitioning and bucketing
● Why not to use Scala Concurrent ‘Future’ explicitly!
● Apache Spark is a fast and general-purpose cluster computing system / Unified Engine for massive data
processing.
● It provides high level API for Scala, Java, Python and R and optimized engine that supports general
execution graphs.
Structured Data / SQL - Spark SQL Graph Processing - GraphX
Machine Learning - MLlib Streaming - Spark Streaming,
Structured Streaming
RDD RDD RDD RDD
Logical Model Across Distributed Storage on Cluster
HDFS, S3
RDD RDD RDD
T T
RDD -> T -> RDD -> T -> RDD
T = Transformation
Integer RDD
String or Text RDD
Double or Binary RDD
RDD RDD RDD
T T
RDD RDD RDD
T A
RDD - T - RDD - T - RDD - T - RDD - A - RDD
T = Transformation
A = Action
Operations
Transformation
Action
TRANSFORMATIONSACTIONS
General Math / Statistical Set Theory / Relational Data Structure / I/O
map
gilter
flatMap
mapPartitions
mapPartitionsWithIndex
groupBy
sortBy
sample
randomSplit
union
intersection
subtract
distinct
cartesian
zip
keyBy
zipWithIndex
zipWithUniqueID
zipPartitions
coalesce
repartition
repartitionAndSortWithinPartitions
pipe
reduce
collect
aggregate
fold
first
take
forEach
top
treeAggregate
treeReduce
forEachPartition
collectAsMap
count
takeSample
max
min
sum
histogram
mean
variance
stdev
sampleVariance
countApprox
countApproxDistinct
takeOrdered
saveAsTextFile
saveAsSequenceFile
saveAsObjectFile
saveAsHadoopDataset
saveAsHadoopFile
saveAsNewAPIHadoopDataset
saveAsNewAPIHadoopFile
You care about control of dataset and knows how data looks like, you care
about low level API.
Don’t care about lot’s of lambda functions than DSL.
Don’t care about Schema or Structure of Data.
Don’t care about optimization, performance & inefficiencies!
Very slow for non-JVM languages like Python, R.
Don’t care about Inadvertent inefficiencies.
parsedRDD.filter { case (project, sprint, numStories) => project == "finance" }.
map { case (_, sprint, numStories) => (sprint, numStories) }.
reduceByKey(_ + _).
filter { case (sprint, _) => !isSpecialSprint(sprint) }.
take(100).foreach { case (project, stories) => println(s"project: $stories") }
val employeesDF = spark.read.json("employees.json")
// Convert data to domain objects.
case class Employee(name: String, age: Int)
val employeesDS: Dataset[Employee] = employeesDF.as[Employee]
val filterDS = employeesDS.filter(p => p.age > 3)
Type-safe: operate on domain
objects with compiled lambda
functions.
DataFrames
Datasets
Strongly Typing
Ability to use powerful lambda functions.
Spark SQL’s optimized execution engine (catalyst, tungsten)
Can be constructed from JVM objects & manipulated using Functional
transformations (map, filter, flatMap etc)
A DataFrame is a Dataset organized into named columns
DataFrame is simply a type alias of Dataset[Row]
SQL DataFrames Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
Analysis errors are caught before a job runs on cluster
DataFrame
Dataset
Untyped API
Typed API
Dataset
(2016)
DataFrame = Dataset [Row]
Alias
Dataset [T]
// convert RDD -> DF with column names
val parsedDF = parsedRDD.toDF("project", "sprint", "numStories")
//filter, groupBy, sum, and then agg()
parsedDF.filter($"project" === "finance").
groupBy($"sprint").
agg(sum($"numStories").as("count")).
limit(100).
show(100)
project sprint numStories
finance 3 20
finance 4 22
parsedDF.createOrReplaceTempView("audits")
val results = spark.sql(
"""SELECT sprint, sum(numStories)
AS count FROM audits WHERE project = 'finance' GROUP BY sprint
LIMIT 100""")
results.show(100)
project sprint numStories
finance 3 20
finance 4 22
// DataFrame
data.groupBy("dept").avg("age")
// SQL
select dept, avg(age) from data group by 1
// RDD
data.map { case (dept, age) => dept -> (age, 1) }
.reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) }
.map { case (dept, (age, c)) => dept -> age / c }
SQL AST
DataFrame
Datasets
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Physical
Plans
CostModel
Selected
Physical
Plan
RDD
employees.join(events, employees("id") === events("eid"))
.filter(events("date") > "2015-01-01")
events file
employees
table
join
filter
Logical Plan
scan
(employees)
filter
Scan
(events)
join
Physical Plan
Optimized
scan
(events)
Optimized
scan
(employees)
join
Physical Plan
With Predicate Pushdown
and Column Pruning
Source: Databricks
Source: Databricks
Executors
Cores
Containers
Stage
Job
Task
Job - Each transformation and action mapping in Spark would create a separate jobs.
Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor.
Task - Lowest level of Concurrent and Parallel execution Unit.
Each stage is split into #number-of-partitions tasks,
i.e Number of Tasks = stage * number of partitions in the stage
yarn.scheduler.minimum-allocation-vcores = 1
Yarn.scheduler.maximum-allocation-vcores = 6
Yarn.scheduler.minimum-allocation-mb = 4096
Yarn.scheduler.maximum-allocation-mb = 28832
Yarn.nodemanager.resource.memory-mb = 54000
Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 /
Yarn.scheduler.minimum-allocation-mb = 4096) = 13
What happens when you run this code?
What would be the impact at Database engine side?
JoinSelection execution planning strategy uses
spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size
of a dataset before broadcasting it to all worker nodes when performing a join.
val threshold = spark.conf.get("spark.sql.autoBroadcastJoinThreshold").toInt
scala> threshold / 1024 / 1024
res0: Int = 10
// logical plan with tree numbered
sampleDF.queryExecution.logical.numberedTreeString
// Query plan
sampleDF.explain
Repartition: Boost the Parallelism, by increasing the number of Partitions. Partition on Joins, to get
same key joins faster.
// Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster.
employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT"))
For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions
and each partition will write 2500 records Parallely.
Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.
1. // disable autoBroadcastJoin
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
2. // Order doesn't matter
table1.leftjoin(table2) or table2.leftjoin(table1)
3. force broadcast, if one DataFrame is not small!
4. Minimize shuffling & Boost Parallism, Partitioning, Bucketing, coalesce, repartition,
HashPartitioner
Code!
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production

More Related Content

What's hot

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
Robert Meyer- pypet
Robert Meyer- pypetRobert Meyer- pypet
Robert Meyer- pypet
PyData
 
Machine learning using spark
Machine learning using sparkMachine learning using spark
Machine learning using spark
Ran Silberman
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
Amund Tveit
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
Purna Chander
 
Dr. Andreas Lattner- Setting up predictive services with Palladium
Dr. Andreas Lattner- Setting up predictive services with PalladiumDr. Andreas Lattner- Setting up predictive services with Palladium
Dr. Andreas Lattner- Setting up predictive services with Palladium
PyData
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
Sakthi Dasans
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
Kelly Technologies
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Alexander Hendorf
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
Padma shree. T
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
Julian Hyde
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Steve Watt
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
Namgee Lee
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
AmanBhalla14
 

What's hot (20)

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
Robert Meyer- pypet
Robert Meyer- pypetRobert Meyer- pypet
Robert Meyer- pypet
 
Machine learning using spark
Machine learning using sparkMachine learning using spark
Machine learning using spark
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
 
Dr. Andreas Lattner- Setting up predictive services with Palladium
Dr. Andreas Lattner- Setting up predictive services with PalladiumDr. Andreas Lattner- Setting up predictive services with Palladium
Dr. Andreas Lattner- Setting up predictive services with Palladium
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 

Similar to No more struggles with Apache Spark workloads in production

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionPyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
Chetan Khatri
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
Chetan Khatri
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
IndicThreads
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Spark Kafka summit 2017
Spark Kafka summit 2017Spark Kafka summit 2017
Spark Kafka summit 2017
ajay_ei
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Matt Stubbs
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
MarcoYuriFujiiMelo
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
Ravindra kumar
 

Similar to No more struggles with Apache Spark workloads in production (20)

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionPyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Spark Kafka summit 2017
Spark Kafka summit 2017Spark Kafka summit 2017
Spark Kafka summit 2017
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 

More from Chetan Khatri

Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Chetan Khatri
 
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Chetan Khatri
 
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
Chetan Khatri
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
HBase with Apache Spark POC Demo
HBase with Apache Spark POC DemoHBase with Apache Spark POC Demo
HBase with Apache Spark POC Demo
Chetan Khatri
 
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
Chetan Khatri
 
Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...
Chetan Khatri
 
An Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learningAn Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learning
Chetan Khatri
 
Introduction to Computer Science
Introduction to Computer ScienceIntroduction to Computer Science
Introduction to Computer Science
Chetan Khatri
 
An introduction to Git with Atlassian Suite
An introduction to Git with Atlassian SuiteAn introduction to Git with Atlassian Suite
An introduction to Git with Atlassian Suite
Chetan Khatri
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
Chetan Khatri
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
Chetan Khatri
 
Voltage measurement using arduino
Voltage measurement using arduinoVoltage measurement using arduino
Voltage measurement using arduino
Chetan Khatri
 
Design & Building Smart Energy Meter
Design & Building Smart Energy MeterDesign & Building Smart Energy Meter
Design & Building Smart Energy Meter
Chetan Khatri
 
Data Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonData Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - Python
Chetan Khatri
 
Internet of things initiative-cskskv
Internet of things   initiative-cskskvInternet of things   initiative-cskskv
Internet of things initiative-cskskv
Chetan Khatri
 
High level architecture solar power plant
High level architecture solar power plantHigh level architecture solar power plant
High level architecture solar power plant
Chetan Khatri
 
Alumni talk-university-of-kachchh
Alumni talk-university-of-kachchhAlumni talk-university-of-kachchh
Alumni talk-university-of-kachchh
Chetan Khatri
 
Pycon 2016-open-space
Pycon 2016-open-spacePycon 2016-open-space
Pycon 2016-open-space
Chetan Khatri
 
Pycon india-2016-success-story
Pycon india-2016-success-storyPycon india-2016-success-story
Pycon india-2016-success-story
Chetan Khatri
 

More from Chetan Khatri (20)

Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
 
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
 
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
HBase with Apache Spark POC Demo
HBase with Apache Spark POC DemoHBase with Apache Spark POC Demo
HBase with Apache Spark POC Demo
 
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
 
Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...
 
An Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learningAn Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learning
 
Introduction to Computer Science
Introduction to Computer ScienceIntroduction to Computer Science
Introduction to Computer Science
 
An introduction to Git with Atlassian Suite
An introduction to Git with Atlassian SuiteAn introduction to Git with Atlassian Suite
An introduction to Git with Atlassian Suite
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
 
Voltage measurement using arduino
Voltage measurement using arduinoVoltage measurement using arduino
Voltage measurement using arduino
 
Design & Building Smart Energy Meter
Design & Building Smart Energy MeterDesign & Building Smart Energy Meter
Design & Building Smart Energy Meter
 
Data Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonData Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - Python
 
Internet of things initiative-cskskv
Internet of things   initiative-cskskvInternet of things   initiative-cskskv
Internet of things initiative-cskskv
 
High level architecture solar power plant
High level architecture solar power plantHigh level architecture solar power plant
High level architecture solar power plant
 
Alumni talk-university-of-kachchh
Alumni talk-university-of-kachchhAlumni talk-university-of-kachchh
Alumni talk-university-of-kachchh
 
Pycon 2016-open-space
Pycon 2016-open-spacePycon 2016-open-space
Pycon 2016-open-space
 
Pycon india-2016-success-story
Pycon india-2016-success-storyPycon india-2016-success-story
Pycon india-2016-success-story
 

Recently uploaded

一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 

Recently uploaded (20)

一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 

No more struggles with Apache Spark workloads in production

  • 1. Chetan Khatri, Lead - Data Science. Accionlabs India. Paris Scala User Group (PSUG), Paris - France. 23rd May, 2019. Twitter: @khatri_chetan, Email: chetan.khatri@live.com chetan.khatri@accionlabs.com LinkedIn: https://www.linkedin.com/in/chetkhatri Github: chetkhatri
  • 2. Lead - Data Science, Technology Evangelist @ Accion labs India Pvt. Ltd. Contributor @ Apache Spark, Apache HBase, Elixir Lang. Co-Authored University Curriculum @ University of Kachchh, India. Data Engineering @: Nazara Games, Eccella Corporation. M.Sc. - Computer Science from University of Kachchh, India.
  • 3. ● Apache Spark ● Primary data structures (RDD, DataSet, Dataframe) ● Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark. ● Parallel read from JDBC: Challenges and best practices. ● Bulk Load API vs JDBC write ● An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin ● Avoid unnecessary shuffle ● Alternative to spark default sort ● Why dropDuplicates() doesn’t result consistency, What is alternative ● Optimize Spark stage generation plan ● Predicate pushdown with partitioning and bucketing ● Why not to use Scala Concurrent ‘Future’ explicitly!
  • 4. ● Apache Spark is a fast and general-purpose cluster computing system / Unified Engine for massive data processing. ● It provides high level API for Scala, Java, Python and R and optimized engine that supports general execution graphs. Structured Data / SQL - Spark SQL Graph Processing - GraphX Machine Learning - MLlib Streaming - Spark Streaming, Structured Streaming
  • 5.
  • 6. RDD RDD RDD RDD Logical Model Across Distributed Storage on Cluster HDFS, S3
  • 7. RDD RDD RDD T T RDD -> T -> RDD -> T -> RDD T = Transformation
  • 8. Integer RDD String or Text RDD Double or Binary RDD
  • 9. RDD RDD RDD T T RDD RDD RDD T A RDD - T - RDD - T - RDD - T - RDD - A - RDD T = Transformation A = Action
  • 11. TRANSFORMATIONSACTIONS General Math / Statistical Set Theory / Relational Data Structure / I/O map gilter flatMap mapPartitions mapPartitionsWithIndex groupBy sortBy sample randomSplit union intersection subtract distinct cartesian zip keyBy zipWithIndex zipWithUniqueID zipPartitions coalesce repartition repartitionAndSortWithinPartitions pipe reduce collect aggregate fold first take forEach top treeAggregate treeReduce forEachPartition collectAsMap count takeSample max min sum histogram mean variance stdev sampleVariance countApprox countApproxDistinct takeOrdered saveAsTextFile saveAsSequenceFile saveAsObjectFile saveAsHadoopDataset saveAsHadoopFile saveAsNewAPIHadoopDataset saveAsNewAPIHadoopFile
  • 12. You care about control of dataset and knows how data looks like, you care about low level API. Don’t care about lot’s of lambda functions than DSL. Don’t care about Schema or Structure of Data. Don’t care about optimization, performance & inefficiencies! Very slow for non-JVM languages like Python, R. Don’t care about Inadvertent inefficiencies.
  • 13. parsedRDD.filter { case (project, sprint, numStories) => project == "finance" }. map { case (_, sprint, numStories) => (sprint, numStories) }. reduceByKey(_ + _). filter { case (sprint, _) => !isSpecialSprint(sprint) }. take(100).foreach { case (project, stories) => println(s"project: $stories") }
  • 14. val employeesDF = spark.read.json("employees.json") // Convert data to domain objects. case class Employee(name: String, age: Int) val employeesDS: Dataset[Employee] = employeesDF.as[Employee] val filterDS = employeesDS.filter(p => p.age > 3) Type-safe: operate on domain objects with compiled lambda functions.
  • 16. Strongly Typing Ability to use powerful lambda functions. Spark SQL’s optimized execution engine (catalyst, tungsten) Can be constructed from JVM objects & manipulated using Functional transformations (map, filter, flatMap etc) A DataFrame is a Dataset organized into named columns DataFrame is simply a type alias of Dataset[Row]
  • 17. SQL DataFrames Datasets Syntax Errors Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time Analysis errors are caught before a job runs on cluster
  • 19. // convert RDD -> DF with column names val parsedDF = parsedRDD.toDF("project", "sprint", "numStories") //filter, groupBy, sum, and then agg() parsedDF.filter($"project" === "finance"). groupBy($"sprint"). agg(sum($"numStories").as("count")). limit(100). show(100) project sprint numStories finance 3 20 finance 4 22
  • 20. parsedDF.createOrReplaceTempView("audits") val results = spark.sql( """SELECT sprint, sum(numStories) AS count FROM audits WHERE project = 'finance' GROUP BY sprint LIMIT 100""") results.show(100) project sprint numStories finance 3 20 finance 4 22
  • 21. // DataFrame data.groupBy("dept").avg("age") // SQL select dept, avg(age) from data group by 1 // RDD data.map { case (dept, age) => dept -> (age, 1) } .reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) } .map { case (dept, (age, c)) => dept -> age / c }
  • 22. SQL AST DataFrame Datasets Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plans CostModel Selected Physical Plan RDD
  • 23. employees.join(events, employees("id") === events("eid")) .filter(events("date") > "2015-01-01") events file employees table join filter Logical Plan scan (employees) filter Scan (events) join Physical Plan Optimized scan (events) Optimized scan (employees) join Physical Plan With Predicate Pushdown and Column Pruning
  • 26.
  • 28. Job - Each transformation and action mapping in Spark would create a separate jobs. Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor. Task - Lowest level of Concurrent and Parallel execution Unit. Each stage is split into #number-of-partitions tasks, i.e Number of Tasks = stage * number of partitions in the stage
  • 29.
  • 30.
  • 31.
  • 32.
  • 33. yarn.scheduler.minimum-allocation-vcores = 1 Yarn.scheduler.maximum-allocation-vcores = 6 Yarn.scheduler.minimum-allocation-mb = 4096 Yarn.scheduler.maximum-allocation-mb = 28832 Yarn.nodemanager.resource.memory-mb = 54000 Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 / Yarn.scheduler.minimum-allocation-mb = 4096) = 13
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41. What happens when you run this code? What would be the impact at Database engine side?
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51. JoinSelection execution planning strategy uses spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size of a dataset before broadcasting it to all worker nodes when performing a join. val threshold = spark.conf.get("spark.sql.autoBroadcastJoinThreshold").toInt scala> threshold / 1024 / 1024 res0: Int = 10 // logical plan with tree numbered sampleDF.queryExecution.logical.numberedTreeString // Query plan sampleDF.explain
  • 52. Repartition: Boost the Parallelism, by increasing the number of Partitions. Partition on Joins, to get same key joins faster. // Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster. employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT")) For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions and each partition will write 2500 records Parallely. Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.
  • 53. 1. // disable autoBroadcastJoin spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) 2. // Order doesn't matter table1.leftjoin(table2) or table2.leftjoin(table1) 3. force broadcast, if one DataFrame is not small! 4. Minimize shuffling & Boost Parallism, Partitioning, Bucketing, coalesce, repartition, HashPartitioner
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64. Code!