SlideShare a Scribd company logo
No more struggles with Apache Spark (PySpark)
workloads in production
Chetan Khatri, Solution Architect - Data Science.
Accionlabs India.
PyconZA 2019, The Wanderers Club in Illovo.
Johannesburg, South Africa
11th Oct, 2019
Twitter: @khatri_chetan,
Email: chetan.khatri@live.com
chetan.khatri@accionlabs.com
LinkedIn: https://www.linkedin.com/in/chetkhatri
Github: chetkhatri
Who am I?
Solution Architect - Data Science @ Accion labs India Pvt. Ltd.
Contributor @ Apache Spark, Apache HBase, Elixir Lang.
Co-Authored University Curriculum @ University of Kachchh, India.
Ex - Data Engineering @: Nazara Games, Eccella Corporation.
Masters - Computer Science from University of Kachchh, India.
Daily Activity?
Functional Programming, Distributed Computing, Python, Scala, Haskell, Data
Science, Product Development
Helping organizations create innovative
product and solutions using the emerging
technologies
An Innovation Focused
Technology Services
Firm
Employees
Clients
Accelerators
Global
Offices
Development
Centers
2300+
75+
20+
12+
7
Accion Labs - Introduction
● A Global Technology Services firm focussed Emerging Technologies
○ 12 offices, 7 dev centers, 2300+ employees, 75+ active clients
● Profitable, venture-backed company
○ 3 rounds of funding, 8 acquisitions to bolster emerging tech capability and leadership
● Flexible Outcome-based Engagement Models
○ Projects, Extended teams, Shared IP, Co-development, Professional Services
● Framework Based Approach to Accelerate Digital Transformation
○ A collection of tools and frameworks, Breeze Digital Blueprint helps gain 25-30% efficiency
● Action-oriented Leadership Team
○ Fastest growing firm from Pittsburgh (2014, 2015, 2016), E&Y award 2015, PTC Finalist 2018
4
Accion’s Emerging Tech Capabilities
Adaptive UI, UX Engineering
NLP, Voice Interface &
Chat Bots
Artificial Intelligence and
Machine Learning
Data Lake &
Big Data Analytics
Blockchain, Payment
Technologies
Cloud Strategy and
Transformation
Mobile Development
MicroServices and
Serverless Computing
QA Engineering, RPA and
DevOps Automation
SFDC, ServiceNow, IBM
Solutions, Azure
5
Agenda
● Apache Spark
● Primary data structures (RDD, DataSet, Dataframe)
● Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
● Parallel read from JDBC: Challenges and best practices.
● Bulk Load API vs JDBC write
● An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
● Avoid unnecessary shuffle
● Optimize Spark stage generation plan
● Predicate pushdown with partitioning and bucketing
● Airflow DAG scheduling for Apache Spark worflow. - Design, Architecture, Demo.
What is Apache Spark?
● Apache Spark is a fast and general-purpose cluster computing system / Unified Engine for massive data
processing.
● It provides high level API for Scala, Java, Python and R and optimized engine that supports general
execution graphs.
Structured Data / SQL - Spark SQL Graph Processing - GraphX
Machine Learning - MLlib Streaming - Spark Streaming,
Structured Streaming
What are RDDs ?
1. Distributed Data Abstraction
RDD RDD RDD RDD
Logical Model Across Distributed Storage on Cluster
HDFS, S3
2. Resilient & Immutable
RDD RDD RDD
T T
RDD -> T -> RDD -> T -> RDD
T = Transformation
3. Compile-time Type Safe / Strongly type inference
Integer RDD
String or Text RDD
Double or Binary RDD
4. Lazy evaluation
RDD RDD RDD
T T
RDD RDD RDD
T A
RDD - T - RDD - T - RDD - T - RDD - A - RDD
T = Transformation
A = Action
Apache Spark Operations
Operations
Transformation
Action
Essential Spark Operations
TRANSFORMATIONSACTIONS
General Math / Statistical Set Theory / Relational Data Structure / I/O
map
gilter
flatMap
mapPartitions
mapPartitionsWithIndex
groupBy
sortBy
sample
randomSplit
union
intersection
subtract
distinct
cartesian
zip
keyBy
zipWithIndex
zipWithUniqueID
zipPartitions
coalesce
repartition
repartitionAndSortWithinPartitions
pipe
reduce
collect
aggregate
fold
first
take
forEach
top
treeAggregate
treeReduce
forEachPartition
collectAsMap
count
takeSample
max
min
sum
histogram
mean
variance
stdev
sampleVariance
countApprox
countApproxDistinct
takeOrdered
saveAsTextFile
saveAsSequenceFile
saveAsObjectFile
saveAsHadoopDataset
saveAsHadoopFile
saveAsNewAPIHadoopDataset
saveAsNewAPIHadoopFile
When to use RDDs ?
You care about control of dataset and knows how data looks like, you care
about low level API.
Don’t care about lot’s of lambda functions than DSL.
Don’t care about Schema or Structure of Data.
Don’t care about optimization, performance & inefficiencies!
Very slow for non-JVM languages like Python, R.
Don’t care about Inadvertent inefficiencies.
Inadvertent inefficiencies in RDDs
Structured in Spark
DataFrames
Datasets
Structured APIs in Apache Spark
SQL DataFrames Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
Analysis errors are caught before a job runs on cluster
DataFrame API Code
// convert RDD -> DF with column names
parsedDF = parsedRDD.toDF("project", "sprint", "numStories")
// filter, groupBy, sum, and then agg()
parsedDF.filter(lambda x: x[1] === "finance")
.groupBy("sprint")
.agg(sum("numStories").as("count"))
.limit(100)
.show(100)
project sprint numStories
finance 3 20
finance 4 22
DataFrame -> SQL View -> SQL Query
parsedDF.createOrReplaceTempView("audits")
results = spark.sql(
"""SELECT sprint, sum(numStories)
AS count FROM audits WHERE project = 'finance' GROUP BY sprint
LIMIT 100""")
results.show(100)
project sprint numStories
finance 3 20
finance 4 22
Catalyst in Spark
SQL AST
DataFrame
Datasets
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Physical
Plans
CostModel
Selected
Physical
Plan
RDD
Example: DataFrame Optimization
employees.join(events, employees("id") === events("eid"))
.filter(events("date") > "2015-01-01")
events file
employees
table
join
filter
Logical Plan
scan
(employees)
filter
Scan
(events)
join
Physical Plan
Optimized
scan
(events)
Optimized
scan
(employees)
join
Physical Plan
With Predicate Pushdown
and Column Pruning
DataFrames are Faster than RDDs
Source: Databricks
Pragmatic
Approach
Executors
Cores
Containers
Stage
Job
Task
Spark Internals terminology
Job - Each transformation and action mapping in Spark would create a separate jobs.
Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor.
Task - Lowest level of Concurrent and Parallel execution Unit.
Each stage is split into #number-of-partitions tasks,
i.e Number of Tasks = stage * number of partitions in the stage
Spark Internals: Jobs
Spark Internals: Stage
Spark Internals: Stage
Spark Internals: Tasks
Spark on Yarn Internals terminology
yarn.scheduler.minimum-allocation-vcores = 1
Yarn.scheduler.maximum-allocation-vcores = 6
Yarn.scheduler.minimum-allocation-mb = 4096
Yarn.scheduler.maximum-allocation-mb = 28832
Yarn.nodemanager.resource.memory-mb = 54000
Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 /
Yarn.scheduler.minimum-allocation-mb = 4096) = 13
Spark on Yarn Internals terminology
Resource Manager (Yarn) Tuning
Resource Manager (Yarn) Tuning
Resource Manager (Yarn) Tuning
Resource Manager (Yarn) Tuning
Spark Scheduler FIFO to FAIR
Parallel read from JDBC: Challenges
and best practices.
Spark JDBC Read
What happens when you run this code?
What would be the impact at Database engine side?
Spark JDBC Read: Impact on Database engine e.g MSSQL Server
Spark JDBC Read: Impact on Database engine e.g MSSQL Server
Spark Parallel JDBC Read
Spark Parallel JDBC Read
Spark Parallel JDBC Read
Impact on Database after Spark Parallel Read
Bulk Load API vs JDBC write
Bulk Load API vs JDBC write
Bulk Load API vs JDBC write
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
JoinSelection execution planning strategy uses
spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size
of a dataset before broadcasting it to all worker nodes when performing a join.
# check broadcast join threshold
>>> int(spark.conf.get("spark.sql.autoBroadcastJoinThreshold")) / 1024 / 1024
10
# logical plan with tree numbered
sampleDF.queryExecution.logical.numberedTreeString
# Query plan
sampleDF.explain
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
Repartition: Boost the Parallelism, by increasing the number of Partitions. Partition on Joins, to get
same key joins faster.
// Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster.
employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT"))
For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions
and each partition will write 2500 records Parallely.
Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
1. // disable autoBroadcastJoin
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
2. // Order doesn't matter
table1.leftjoin(table2) or table2.leftjoin(table1)
3. force broadcast, if one DataFrame is not small!
4. Minimize shuffling & Boost Parallism, Partitioning, Bucketing, coalesce, repartition,
HashPartitioner
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Spark Submit Hyper-parameters and Dynamic Allocation
./bin/spark-submit 
--conf spark.yarn.maxAppAttempts=1 
--name PyConLT19 
--master yarn 
--deploy-mode cluster 
--driver-memory 18g 
--executor-memory 24g 
--num-executors 4 
--executor-cores 6 
--conf spark.yarn.maxAppAttempts=1 
--conf spark.speculation=false 
--conf spark.broadcast.compress=true 
--conf spark.sql.broadcastTimeout=36000 
--conf spark.network.timeout=2500s 
--conf spark.dynamicAllocation.executorAllocationRatio=1 
--conf spark.executor.heartbeatInterval=30s 
--conf spark.dynamicAllocation.executorIdleTimeout=60s 
--conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=15s 
--conf spark.network.timeout=1200s 
--conf spark.dynamicAllocation.schedulerBacklogTimeout=15s 
--conf spark.yarn.maxAppAttempts=1 
--conf spark.shuffle.service.enabled=true 
--conf spark.dynamicAllocation.enabled=True 
--conf spark.dynamicAllocation.minExecutors=2 
--conf spark.dynamicAllocation.initialExecutors=2 
--conf spark.dynamicAllocation.maxExecutors=6 
examples/src/main/python/pi.py
Case Study: High Level Architecture
OLTP
Shadow
Data
Source
Apache
Spark
Spark
SQL
Sqoop
HDFS
Parquet
Yarn Cluster manager
Customer
Specific
Reporting
DB
Bulk
Load
Parallelism Orchestration: Airflow
Spark Streaming
Code!
Ref.
https://github.com/chetkhatri/getting-started-airflow-for-spark/blob/master/spark
_streaming_kafka.py
Key role of Apache Airflow for Scheduling Data
Pipelines
Codebase: https://github.com/chetkhatri/getting-started-airflow-for-spark
Trigger the Airflow DAG from API
curl -d ' {"conf":"{"retail_id":"29" , "env_type":"dev", "size_is":"medium"}", "run_id": "retailer_1111"}' -H "Content-Type:
application/json" -X POST http://localhost:8000/api/experimental/dags/nextgen_data_platforms/dag_runs
Ref. https://github.com/teamclairvoyant/airflow-rest-api-plugin
Spark Submit Operator inherited from BashOperator
https://github.com/apache/airflow/blob/master/airflow/contrib/operators/spark_s
ubmit_operator.py
Airflow - config.txt
Airflow spark_config.txt
Airflow - spark_hyperparameters.json
Airflow - nextgen_data_platform DAG
Airflow - nextgen_data_platform DAG
Airflow - nextgen_data_platform DAG
Airflow - nextgen_data_platform DAG
Airflow - nextgen_data_platform DAG
Airflow - nextgen_data_platform DAG
Airflow - nextgen_data_platform DAG
Airflow - nextgen_data_platform DAG
Airflow - nextgen_data_master_tables_subdag
Airflow - nextgen_data_master_tables_subdag
Airflow - common_util
Airflow - common_util
Airflow - common_util
Airflow - common_util
Airflow - common_util
Airflow - common_util
References
[1] How to Setup Airflow Multi-Node Cluster with Celery & RabbitMQ.
[URL]
https://medium.com/@khatri_chetan/challenges-and-struggle-while-setting-up-multi-node-airflow-clu
ster-7f19e998ebb
[2] Setup and Configure Multi Node Airflow Cluster with HDP Ambari and Celery for Data Pipelines.
[URL]
https://medium.com/@khatri_chetan/setup-and-configure-multi-node-airflow-cluster-with-hdp-ambari-
and-celery-for-data-pipelines-dc1e96f3d773
[3] Challenges and Struggle while Setting up Multi-Node Airflow Cluster
[URL]
https://medium.com/@khatri_chetan/how-to-setup-airflow-multi-node-cluster-with-celery-rabbitmq-cf
de7756bb6a
[4] Leveraging Spark Speculation To Identify And Re-Schedule Slow Running Tasks.
https://blog.yuvalitzchakov.com/leveraging-spark-speculation-to-identify-and-re-schedule-slow-run
ning-tasks/
Questions ?
Thank you!
PyCon ZA Organizers and South Africa Python Community.

More Related Content

What's hot

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Edureka!
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
aftab alam
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
Holden Karau
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
Craig Chao
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Databricks
 
Spark sql
Spark sqlSpark sql
Spark sql
Freeman Zhang
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
prateek kumar
 
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Edureka!
 
Apache spark
Apache sparkApache spark
Apache spark
TEJPAL GAUTAM
 
Improving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark interoperability with Apache ArrowImproving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark interoperability with Apache Arrow
Li Jin
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Mammoth Data
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
AutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital DecisionsAutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital Decisions
Steven Gustafson
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
Tuhin Mahmud
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Big data clustering
Big data clusteringBig data clustering
Big data clustering
Jagadeesan A S
 

What's hot (20)

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
 
Spark sql
Spark sqlSpark sql
Spark sql
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
 
Apache spark
Apache sparkApache spark
Apache spark
 
Improving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark interoperability with Apache ArrowImproving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark interoperability with Apache Arrow
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
AutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital DecisionsAutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital Decisions
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Big data clustering
Big data clusteringBig data clustering
Big data clustering
 

Similar to PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
Chetan Khatri
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Austin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - SparkAustin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - Spark
Steve Blackmon
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
Data Con LA
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
Chetan Khatri
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
Sri Ambati
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 

Similar to PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow (20)

ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Austin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - SparkAustin Data Meetup 092014 - Spark
Austin Data Meetup 092014 - Spark
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 

More from Chetan Khatri

Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Chetan Khatri
 
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Chetan Khatri
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionPyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
Chetan Khatri
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
Chetan Khatri
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
HBase with Apache Spark POC Demo
HBase with Apache Spark POC DemoHBase with Apache Spark POC Demo
HBase with Apache Spark POC Demo
Chetan Khatri
 
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
Chetan Khatri
 
Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...
Chetan Khatri
 
An Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learningAn Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learning
Chetan Khatri
 
Introduction to Computer Science
Introduction to Computer ScienceIntroduction to Computer Science
Introduction to Computer Science
Chetan Khatri
 
An introduction to Git with Atlassian Suite
An introduction to Git with Atlassian SuiteAn introduction to Git with Atlassian Suite
An introduction to Git with Atlassian Suite
Chetan Khatri
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
Chetan Khatri
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
Chetan Khatri
 
Voltage measurement using arduino
Voltage measurement using arduinoVoltage measurement using arduino
Voltage measurement using arduino
Chetan Khatri
 
Design & Building Smart Energy Meter
Design & Building Smart Energy MeterDesign & Building Smart Energy Meter
Design & Building Smart Energy Meter
Chetan Khatri
 
Data Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonData Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - Python
Chetan Khatri
 
Internet of things initiative-cskskv
Internet of things   initiative-cskskvInternet of things   initiative-cskskv
Internet of things initiative-cskskv
Chetan Khatri
 
High level architecture solar power plant
High level architecture solar power plantHigh level architecture solar power plant
High level architecture solar power plant
Chetan Khatri
 
Alumni talk-university-of-kachchh
Alumni talk-university-of-kachchhAlumni talk-university-of-kachchh
Alumni talk-university-of-kachchh
Chetan Khatri
 

More from Chetan Khatri (20)

Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
 
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionPyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
HBase with Apache Spark POC Demo
HBase with Apache Spark POC DemoHBase with Apache Spark POC Demo
HBase with Apache Spark POC Demo
 
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
 
Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...
 
An Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learningAn Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learning
 
Introduction to Computer Science
Introduction to Computer ScienceIntroduction to Computer Science
Introduction to Computer Science
 
An introduction to Git with Atlassian Suite
An introduction to Git with Atlassian SuiteAn introduction to Git with Atlassian Suite
An introduction to Git with Atlassian Suite
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
 
Voltage measurement using arduino
Voltage measurement using arduinoVoltage measurement using arduino
Voltage measurement using arduino
 
Design & Building Smart Energy Meter
Design & Building Smart Energy MeterDesign & Building Smart Energy Meter
Design & Building Smart Energy Meter
 
Data Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonData Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - Python
 
Internet of things initiative-cskskv
Internet of things   initiative-cskskvInternet of things   initiative-cskskv
Internet of things initiative-cskskv
 
High level architecture solar power plant
High level architecture solar power plantHigh level architecture solar power plant
High level architecture solar power plant
 
Alumni talk-university-of-kachchh
Alumni talk-university-of-kachchhAlumni talk-university-of-kachchh
Alumni talk-university-of-kachchh
 

Recently uploaded

Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
janvikumar4133
 
Biometric Question Bank 2021 - 1 Soln-1.pdf
Biometric Question Bank 2021 - 1 Soln-1.pdfBiometric Question Bank 2021 - 1 Soln-1.pdf
Biometric Question Bank 2021 - 1 Soln-1.pdf
Joel Ngushwai
 
CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdfCMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
IndranilDasgupta19
 
Experience, Excellence & Commitment are the characteristics that describe Fla...
Experience, Excellence & Commitment are the characteristics that describe Fla...Experience, Excellence & Commitment are the characteristics that describe Fla...
Experience, Excellence & Commitment are the characteristics that describe Fla...
kittycrispy617
 
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
tanupasswan6
 
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
45unexpected
 
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
kinni singh$A17
 
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
6459astrid
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
Kanchana Weerasinghe
 
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
fatima shekh$A17
 
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
kinni singh$A17
 
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
revolutionary575
 
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdfWhy_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Alexander Teggin
 
CHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptxCHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptx
girewiy968
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
tanupasswan6
 
potential development of the A* search algorithm specifically
potential development of the A* search algorithm specificallypotential development of the A* search algorithm specifically
potential development of the A* search algorithm specifically
huseindihon
 
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
gargnatasha985
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Samuel Jackson
 
ch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ssch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ss
MinThetLwin1
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
NABLAS株式会社
 

Recently uploaded (20)

Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
 
Biometric Question Bank 2021 - 1 Soln-1.pdf
Biometric Question Bank 2021 - 1 Soln-1.pdfBiometric Question Bank 2021 - 1 Soln-1.pdf
Biometric Question Bank 2021 - 1 Soln-1.pdf
 
CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdfCMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
CMO MRM_May 2024 WITH BREAKDOWN AND IMPROVEMENTDATA.pdf
 
Experience, Excellence & Commitment are the characteristics that describe Fla...
Experience, Excellence & Commitment are the characteristics that describe Fla...Experience, Excellence & Commitment are the characteristics that describe Fla...
Experience, Excellence & Commitment are the characteristics that describe Fla...
 
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
 
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
Female Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service A...
 
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
 
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Premium Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
 
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
 
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
Noida Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Ava...
 
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
 
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdfWhy_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
 
CHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptxCHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptx
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
 
potential development of the A* search algorithm specifically
potential development of the A* search algorithm specificallypotential development of the A* search algorithm specifically
potential development of the A* search algorithm specifically
 
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
 
ch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ssch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ss
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
 

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

  • 1. No more struggles with Apache Spark (PySpark) workloads in production Chetan Khatri, Solution Architect - Data Science. Accionlabs India. PyconZA 2019, The Wanderers Club in Illovo. Johannesburg, South Africa 11th Oct, 2019 Twitter: @khatri_chetan, Email: chetan.khatri@live.com chetan.khatri@accionlabs.com LinkedIn: https://www.linkedin.com/in/chetkhatri Github: chetkhatri
  • 2. Who am I? Solution Architect - Data Science @ Accion labs India Pvt. Ltd. Contributor @ Apache Spark, Apache HBase, Elixir Lang. Co-Authored University Curriculum @ University of Kachchh, India. Ex - Data Engineering @: Nazara Games, Eccella Corporation. Masters - Computer Science from University of Kachchh, India. Daily Activity? Functional Programming, Distributed Computing, Python, Scala, Haskell, Data Science, Product Development
  • 3. Helping organizations create innovative product and solutions using the emerging technologies An Innovation Focused Technology Services Firm Employees Clients Accelerators Global Offices Development Centers 2300+ 75+ 20+ 12+ 7
  • 4. Accion Labs - Introduction ● A Global Technology Services firm focussed Emerging Technologies ○ 12 offices, 7 dev centers, 2300+ employees, 75+ active clients ● Profitable, venture-backed company ○ 3 rounds of funding, 8 acquisitions to bolster emerging tech capability and leadership ● Flexible Outcome-based Engagement Models ○ Projects, Extended teams, Shared IP, Co-development, Professional Services ● Framework Based Approach to Accelerate Digital Transformation ○ A collection of tools and frameworks, Breeze Digital Blueprint helps gain 25-30% efficiency ● Action-oriented Leadership Team ○ Fastest growing firm from Pittsburgh (2014, 2015, 2016), E&Y award 2015, PTC Finalist 2018 4
  • 5. Accion’s Emerging Tech Capabilities Adaptive UI, UX Engineering NLP, Voice Interface & Chat Bots Artificial Intelligence and Machine Learning Data Lake & Big Data Analytics Blockchain, Payment Technologies Cloud Strategy and Transformation Mobile Development MicroServices and Serverless Computing QA Engineering, RPA and DevOps Automation SFDC, ServiceNow, IBM Solutions, Azure 5
  • 6. Agenda ● Apache Spark ● Primary data structures (RDD, DataSet, Dataframe) ● Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark. ● Parallel read from JDBC: Challenges and best practices. ● Bulk Load API vs JDBC write ● An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin ● Avoid unnecessary shuffle ● Optimize Spark stage generation plan ● Predicate pushdown with partitioning and bucketing ● Airflow DAG scheduling for Apache Spark worflow. - Design, Architecture, Demo.
  • 7. What is Apache Spark? ● Apache Spark is a fast and general-purpose cluster computing system / Unified Engine for massive data processing. ● It provides high level API for Scala, Java, Python and R and optimized engine that supports general execution graphs. Structured Data / SQL - Spark SQL Graph Processing - GraphX Machine Learning - MLlib Streaming - Spark Streaming, Structured Streaming
  • 9. 1. Distributed Data Abstraction RDD RDD RDD RDD Logical Model Across Distributed Storage on Cluster HDFS, S3
  • 10. 2. Resilient & Immutable RDD RDD RDD T T RDD -> T -> RDD -> T -> RDD T = Transformation
  • 11. 3. Compile-time Type Safe / Strongly type inference Integer RDD String or Text RDD Double or Binary RDD
  • 12. 4. Lazy evaluation RDD RDD RDD T T RDD RDD RDD T A RDD - T - RDD - T - RDD - T - RDD - A - RDD T = Transformation A = Action
  • 14. Essential Spark Operations TRANSFORMATIONSACTIONS General Math / Statistical Set Theory / Relational Data Structure / I/O map gilter flatMap mapPartitions mapPartitionsWithIndex groupBy sortBy sample randomSplit union intersection subtract distinct cartesian zip keyBy zipWithIndex zipWithUniqueID zipPartitions coalesce repartition repartitionAndSortWithinPartitions pipe reduce collect aggregate fold first take forEach top treeAggregate treeReduce forEachPartition collectAsMap count takeSample max min sum histogram mean variance stdev sampleVariance countApprox countApproxDistinct takeOrdered saveAsTextFile saveAsSequenceFile saveAsObjectFile saveAsHadoopDataset saveAsHadoopFile saveAsNewAPIHadoopDataset saveAsNewAPIHadoopFile
  • 15. When to use RDDs ? You care about control of dataset and knows how data looks like, you care about low level API. Don’t care about lot’s of lambda functions than DSL. Don’t care about Schema or Structure of Data. Don’t care about optimization, performance & inefficiencies! Very slow for non-JVM languages like Python, R. Don’t care about Inadvertent inefficiencies.
  • 18. Structured APIs in Apache Spark SQL DataFrames Datasets Syntax Errors Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time Analysis errors are caught before a job runs on cluster
  • 19. DataFrame API Code // convert RDD -> DF with column names parsedDF = parsedRDD.toDF("project", "sprint", "numStories") // filter, groupBy, sum, and then agg() parsedDF.filter(lambda x: x[1] === "finance") .groupBy("sprint") .agg(sum("numStories").as("count")) .limit(100) .show(100) project sprint numStories finance 3 20 finance 4 22
  • 20. DataFrame -> SQL View -> SQL Query parsedDF.createOrReplaceTempView("audits") results = spark.sql( """SELECT sprint, sum(numStories) AS count FROM audits WHERE project = 'finance' GROUP BY sprint LIMIT 100""") results.show(100) project sprint numStories finance 3 20 finance 4 22
  • 21. Catalyst in Spark SQL AST DataFrame Datasets Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plans CostModel Selected Physical Plan RDD
  • 22. Example: DataFrame Optimization employees.join(events, employees("id") === events("eid")) .filter(events("date") > "2015-01-01") events file employees table join filter Logical Plan scan (employees) filter Scan (events) join Physical Plan Optimized scan (events) Optimized scan (employees) join Physical Plan With Predicate Pushdown and Column Pruning
  • 23. DataFrames are Faster than RDDs Source: Databricks
  • 25. Spark Internals terminology Job - Each transformation and action mapping in Spark would create a separate jobs. Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor. Task - Lowest level of Concurrent and Parallel execution Unit. Each stage is split into #number-of-partitions tasks, i.e Number of Tasks = stage * number of partitions in the stage
  • 30. Spark on Yarn Internals terminology yarn.scheduler.minimum-allocation-vcores = 1 Yarn.scheduler.maximum-allocation-vcores = 6 Yarn.scheduler.minimum-allocation-mb = 4096 Yarn.scheduler.maximum-allocation-mb = 28832 Yarn.nodemanager.resource.memory-mb = 54000 Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 / Yarn.scheduler.minimum-allocation-mb = 4096) = 13
  • 31. Spark on Yarn Internals terminology
  • 37. Parallel read from JDBC: Challenges and best practices.
  • 38. Spark JDBC Read What happens when you run this code? What would be the impact at Database engine side?
  • 39. Spark JDBC Read: Impact on Database engine e.g MSSQL Server
  • 40. Spark JDBC Read: Impact on Database engine e.g MSSQL Server
  • 44. Impact on Database after Spark Parallel Read
  • 45. Bulk Load API vs JDBC write
  • 46. Bulk Load API vs JDBC write
  • 47. Bulk Load API vs JDBC write
  • 48. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin JoinSelection execution planning strategy uses spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size of a dataset before broadcasting it to all worker nodes when performing a join. # check broadcast join threshold >>> int(spark.conf.get("spark.sql.autoBroadcastJoinThreshold")) / 1024 / 1024 10 # logical plan with tree numbered sampleDF.queryExecution.logical.numberedTreeString # Query plan sampleDF.explain
  • 49. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin Repartition: Boost the Parallelism, by increasing the number of Partitions. Partition on Joins, to get same key joins faster. // Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster. employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT")) For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions and each partition will write 2500 records Parallely. Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.
  • 50. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin 1. // disable autoBroadcastJoin spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) 2. // Order doesn't matter table1.leftjoin(table2) or table2.leftjoin(table1) 3. force broadcast, if one DataFrame is not small! 4. Minimize shuffling & Boost Parallism, Partitioning, Bucketing, coalesce, repartition, HashPartitioner
  • 51. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
  • 57. Spark Submit Hyper-parameters and Dynamic Allocation ./bin/spark-submit --conf spark.yarn.maxAppAttempts=1 --name PyConLT19 --master yarn --deploy-mode cluster --driver-memory 18g --executor-memory 24g --num-executors 4 --executor-cores 6 --conf spark.yarn.maxAppAttempts=1 --conf spark.speculation=false --conf spark.broadcast.compress=true --conf spark.sql.broadcastTimeout=36000 --conf spark.network.timeout=2500s --conf spark.dynamicAllocation.executorAllocationRatio=1 --conf spark.executor.heartbeatInterval=30s --conf spark.dynamicAllocation.executorIdleTimeout=60s --conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=15s --conf spark.network.timeout=1200s --conf spark.dynamicAllocation.schedulerBacklogTimeout=15s --conf spark.yarn.maxAppAttempts=1 --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=True --conf spark.dynamicAllocation.minExecutors=2 --conf spark.dynamicAllocation.initialExecutors=2 --conf spark.dynamicAllocation.maxExecutors=6 examples/src/main/python/pi.py
  • 58. Case Study: High Level Architecture OLTP Shadow Data Source Apache Spark Spark SQL Sqoop HDFS Parquet Yarn Cluster manager Customer Specific Reporting DB Bulk Load Parallelism Orchestration: Airflow
  • 60. Key role of Apache Airflow for Scheduling Data Pipelines Codebase: https://github.com/chetkhatri/getting-started-airflow-for-spark
  • 61. Trigger the Airflow DAG from API curl -d ' {"conf":"{"retail_id":"29" , "env_type":"dev", "size_is":"medium"}", "run_id": "retailer_1111"}' -H "Content-Type: application/json" -X POST http://localhost:8000/api/experimental/dags/nextgen_data_platforms/dag_runs Ref. https://github.com/teamclairvoyant/airflow-rest-api-plugin Spark Submit Operator inherited from BashOperator https://github.com/apache/airflow/blob/master/airflow/contrib/operators/spark_s ubmit_operator.py
  • 81. References [1] How to Setup Airflow Multi-Node Cluster with Celery & RabbitMQ. [URL] https://medium.com/@khatri_chetan/challenges-and-struggle-while-setting-up-multi-node-airflow-clu ster-7f19e998ebb [2] Setup and Configure Multi Node Airflow Cluster with HDP Ambari and Celery for Data Pipelines. [URL] https://medium.com/@khatri_chetan/setup-and-configure-multi-node-airflow-cluster-with-hdp-ambari- and-celery-for-data-pipelines-dc1e96f3d773 [3] Challenges and Struggle while Setting up Multi-Node Airflow Cluster [URL] https://medium.com/@khatri_chetan/how-to-setup-airflow-multi-node-cluster-with-celery-rabbitmq-cf de7756bb6a [4] Leveraging Spark Speculation To Identify And Re-Schedule Slow Running Tasks. https://blog.yuvalitzchakov.com/leveraging-spark-speculation-to-identify-and-re-schedule-slow-run ning-tasks/
  • 83. Thank you! PyCon ZA Organizers and South Africa Python Community.