SlideShare a Scribd company logo
Chetan Khatri, Lead - Data Science. Accionlabs India.
Scala Toronto Group
24th July, 2019.
Twitter: @khatri_chetan,
Email: chetan.khatri@live.com
chetan.khatri@accionlabs.com
LinkedIn: https://www.linkedin.com/in/chetkhatri
Github: chetkhatri
Lead - Data Science, Technology Evangelist @ Accion labs India Pvt. Ltd.
Contributor @ Apache Spark, Apache HBase, Elixir Lang.
Co-Authored University Curriculum @ University of Kachchh, India.
Data Engineering @: Nazara Games, Eccella Corporation.
M.Sc. - Computer Science from University of Kachchh, India.
An Innovation Focused
Technology Services
Firm
2100+
70+
20+
12+
6
● A Global Technology Services firm focussed Emerging Technologies
○ 12 offices, 6 dev centers, 2100+ employees, 70+ active clients
● Profitable, venture-backed company
○ 3 rounds of funding, 8 acquisitions to bolster emerging tech capability and leadership
● Flexible Outcome-based Engagement Models
○ Projects, Extended teams, Shared IP, Co-development, Professional Services
● Framework Based Approach to Accelerate Digital Transformation
○ A collection of tools and frameworks, Breeze Digital Blueprint helps gain 25-30% efficiency
● Action-oriented Leadership Team
○ Fastest growing firm from Pittsburgh (2014, 2015, 2016), E&Y award 2015, PTC Finalist 2018
● Apache Spark
● Primary data structures (RDD, DataSet, Dataframe)
● Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
● Parallel read from JDBC: Challenges and best practices.
● Bulk Load API vs JDBC write
● An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
● Avoid unnecessary shuffle
● Alternative to spark default sort
● Why dropDuplicates() doesn’t result consistency, What is alternative
● Optimize Spark stage generation plan
● Predicate pushdown with partitioning and bucketing
● Why not to use Scala Concurrent ‘Future’ explicitly!
● Apache Spark is a fast and general-purpose cluster computing system / Unified Engine for massive data
processing.
● It provides high level API for Scala, Java, Python and R and optimized engine that supports general
execution graphs.
Structured Data / SQL - Spark SQL Graph Processing - GraphX
Machine Learning - MLlib Streaming - Spark Streaming,
Structured Streaming
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
RDD RDD RDD RDD
Logical Model Across Distributed Storage on Cluster
HDFS, S3
RDD RDD RDD
T T
RDD -> T -> RDD -> T -> RDD
T = Transformation
Integer RDD
String or Text RDD
Double or Binary RDD
RDD RDD RDD
T T
RDD RDD RDD
T A
RDD - T - RDD - T - RDD - T - RDD - A - RDD
T = Transformation
A = Action
Operations
Transformation
Action
TRANSFORMATIONSACTIONS
General Math / Statistical Set Theory / Relational Data Structure / I/O
map
gilter
flatMap
mapPartitions
mapPartitionsWithIndex
groupBy
sortBy
sample
randomSplit
union
intersection
subtract
distinct
cartesian
zip
keyBy
zipWithIndex
zipWithUniqueID
zipPartitions
coalesce
repartition
repartitionAndSortWithinPartitions
pipe
reduce
collect
aggregate
fold
first
take
forEach
top
treeAggregate
treeReduce
forEachPartition
collectAsMap
count
takeSample
max
min
sum
histogram
mean
variance
stdev
sampleVariance
countApprox
countApproxDistinct
takeOrdered
saveAsTextFile
saveAsSequenceFile
saveAsObjectFile
saveAsHadoopDataset
saveAsHadoopFile
saveAsNewAPIHadoopDataset
saveAsNewAPIHadoopFile
You care about control of dataset and knows how data looks like, you care
about low level API.
Don’t care about lot’s of lambda functions than DSL.
Don’t care about Schema or Structure of Data.
Don’t care about optimization, performance & inefficiencies!
Very slow for non-JVM languages like Python, R.
Don’t care about Inadvertent inefficiencies.
parsedRDD.filter { case (project, sprint, numStories) => project == "finance" }.
map { case (_, sprint, numStories) => (sprint, numStories) }.
reduceByKey(_ + _).
filter { case (sprint, _) => !isSpecialSprint(sprint) }.
take(100).foreach { case (project, stories) => println(s"project: $stories") }
val employeesDF = spark.read.json("employees.json")
// Convert data to domain objects.
case class Employee(name: String, age: Int)
val employeesDS: Dataset[Employee] = employeesDF.as[Employee]
val filterDS = employeesDS.filter(p => p.age > 3)
Type-safe: operate on domain
objects with compiled lambda
functions.
DataFrames
Datasets
Strongly Typing
Ability to use powerful lambda functions.
Spark SQL’s optimized execution engine (catalyst, tungsten)
Can be constructed from JVM objects & manipulated using Functional
transformations (map, filter, flatMap etc)
A DataFrame is a Dataset organized into named columns
DataFrame is simply a type alias of Dataset[Row]
SQL DataFrames Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
Analysis errors are caught before a job runs on cluster
DataFrame
Dataset
Untyped API
Typed API
Dataset
(2016)
DataFrame = Dataset [Row]
Alias
Dataset [T]
// convert RDD -> DF with column names
val parsedDF = parsedRDD.toDF("project", "sprint", "numStories")
//filter, groupBy, sum, and then agg()
parsedDF.filter($"project" === "finance").
groupBy($"sprint").
agg(sum($"numStories").as("count")).
limit(100).
show(100)
project sprint numStories
finance 3 20
finance 4 22
parsedDF.createOrReplaceTempView("audits")
val results = spark.sql(
"""SELECT sprint, sum(numStories)
AS count FROM audits WHERE project = 'finance' GROUP BY sprint
LIMIT 100""")
results.show(100)
project sprint numStories
finance 3 20
finance 4 22
// DataFrame
data.groupBy("dept").avg("age")
// SQL
select dept, avg(age) from data group by 1
// RDD
data.map { case (dept, age) => dept -> (age, 1) }
.reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) }
.map { case (dept, (age, c)) => dept -> age / c }
SQL AST
DataFrame
Datasets
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Physical
Plans
CostModel
Selected
Physical
Plan
RDD
employees.join(events, employees("id") === events("eid"))
.filter(events("date") > "2015-01-01")
events file
employees
table
join
filter
Logical Plan
scan
(employees)
filter
Scan
(events)
join
Physical Plan
Optimized
scan
(events)
Optimized
scan
(employees)
join
Physical Plan
With Predicate Pushdown
and Column Pruning
Source: Databricks
Source: Databricks
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Executors
Cores
Containers
Stage
Job
Task
Job - Each transformation and action mapping in Spark would create a separate jobs.
Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor.
Task - Lowest level of Concurrent and Parallel execution Unit.
Each stage is split into #number-of-partitions tasks,
i.e Number of Tasks = stage * number of partitions in the stage
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
yarn.scheduler.minimum-allocation-vcores = 1
Yarn.scheduler.maximum-allocation-vcores = 6
Yarn.scheduler.minimum-allocation-mb = 4096
Yarn.scheduler.maximum-allocation-mb = 28832
Yarn.nodemanager.resource.memory-mb = 54000
Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 /
Yarn.scheduler.minimum-allocation-mb = 4096) = 13
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
~14 Years old written ETL platform in 2 .sql large files (with 20,000 lines each) to
Scala, Spark and Airflow based Highly Scalable, Non-Blocking reengineering
Implementation.
Technologies Stack:
Programming Languages Scala, Python
Data Access Library Spark-Core, Spark-SQL, Sqoop
Data Processing engine Apache Spark
Orchestration Airflow
Cluster Management Yarn
Storage HDFS (Parquet)
Hadoop Distribution Hortonworks
High level Architecture
OLTP
Shadow
Data
Source
Apache
Spark
Spark
SQL
Sqoop
HDFS
Parquet
Yarn Cluster manager
Customer
Specific
Reporting
DB
Bulk
Load
Parallelism Orchestration: Airflow
Apache Airflow - DAG Overview
Avoid Joins (Before ..)
Avoid Joins (After ..)
Join plans
1. Size of the
Dataframe
2. Number of records
in Dataframe
Compare to primary
DataFrame.
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
What happens when you run this code?
What would be the impact at Database engine side?
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
JoinSelection execution planning strategy uses
spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size
of a dataset before broadcasting it to all worker nodes when performing a join.
val threshold = spark.conf.get("spark.sql.autoBroadcastJoinThreshold").toInt
scala> threshold / 1024 / 1024
res0: Int = 10
// logical plan with tree numbered
sampleDF.queryExecution.logical.numberedTreeString
// Query plan
sampleDF.explain
Repartition: Boost the Parallelism, by increasing the number of Partitions. Partition on Joins, to get
same key joins faster.
// Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster.
employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT"))
For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions
and each partition will write 2500 records Parallely.
Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.
1. // disable autoBroadcastJoin
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
2. // Order doesn't matter
table1.leftjoin(table2) or table2.leftjoin(table1)
3. force broadcast, if one DataFrame is not small!
4. Minimize shuffling & Boost Parallelism, Partitioning, Bucketing, coalesce, repartition,
HashPartitioner
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Task - controlling possibly lazy & asynchronous computations, useful for controlling side-effects.
Coeval - controlling synchronous, possibly lazy evaluation, useful for describing lazy expressions
and for controlling side-effects.
Code!
Cats Family
Http4 Doobie Circe FS2 PureConfig
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production

More Related Content

What's hot

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
Carol McDonald
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
Paco Nathan
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
Yan Zhou
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
Data Con LA
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
Spark Summit
 
Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?
ArangoDB Database
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
jeykottalam
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
Apache Spark Side of Funnels
Apache Spark Side of FunnelsApache Spark Side of Funnels
Apache Spark Side of Funnels
Databricks
 
Neo4j Morpheus: Interweaving Documents, Tables and and Graph Data in Spark wi...
Neo4j Morpheus: Interweaving Documents, Tables and and Graph Data in Spark wi...Neo4j Morpheus: Interweaving Documents, Tables and and Graph Data in Spark wi...
Neo4j Morpheus: Interweaving Documents, Tables and and Graph Data in Spark wi...
Databricks
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
Paco Nathan
 
Custom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBCustom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDB
ArangoDB Database
 
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jExtending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Databricks
 
Pivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalRPivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalR
go-pivotal
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Paco Nathan
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde
 

What's hot (20)

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Apache Spark Side of Funnels
Apache Spark Side of FunnelsApache Spark Side of Funnels
Apache Spark Side of Funnels
 
Neo4j Morpheus: Interweaving Documents, Tables and and Graph Data in Spark wi...
Neo4j Morpheus: Interweaving Documents, Tables and and Graph Data in Spark wi...Neo4j Morpheus: Interweaving Documents, Tables and and Graph Data in Spark wi...
Neo4j Morpheus: Interweaving Documents, Tables and and Graph Data in Spark wi...
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
 
Custom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBCustom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDB
 
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jExtending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
 
Pivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalRPivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalR
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 

Similar to ScalaTo July 2019 - No more struggles with Apache Spark workloads in production

No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
Chetan Khatri
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
IndicThreads
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR Overview
Khalid Salama
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
Data Science Thailand
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and Furure
DataStax Academy
 

Similar to ScalaTo July 2019 - No more struggles with Apache Spark workloads in production (20)

No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR Overview
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and Furure
 

More from Chetan Khatri

Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Chetan Khatri
 
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Chetan Khatri
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionPyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
Chetan Khatri
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
Chetan Khatri
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
HBase with Apache Spark POC Demo
HBase with Apache Spark POC DemoHBase with Apache Spark POC Demo
HBase with Apache Spark POC Demo
Chetan Khatri
 
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
Chetan Khatri
 
Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...
Chetan Khatri
 
An Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learningAn Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learning
Chetan Khatri
 
Introduction to Computer Science
Introduction to Computer ScienceIntroduction to Computer Science
Introduction to Computer Science
Chetan Khatri
 
An introduction to Git with Atlassian Suite
An introduction to Git with Atlassian SuiteAn introduction to Git with Atlassian Suite
An introduction to Git with Atlassian Suite
Chetan Khatri
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
Chetan Khatri
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
Chetan Khatri
 
Voltage measurement using arduino
Voltage measurement using arduinoVoltage measurement using arduino
Voltage measurement using arduino
Chetan Khatri
 
Design & Building Smart Energy Meter
Design & Building Smart Energy MeterDesign & Building Smart Energy Meter
Design & Building Smart Energy Meter
Chetan Khatri
 
Data Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonData Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - Python
Chetan Khatri
 
Internet of things initiative-cskskv
Internet of things   initiative-cskskvInternet of things   initiative-cskskv
Internet of things initiative-cskskv
Chetan Khatri
 
High level architecture solar power plant
High level architecture solar power plantHigh level architecture solar power plant
High level architecture solar power plant
Chetan Khatri
 
Alumni talk-university-of-kachchh
Alumni talk-university-of-kachchhAlumni talk-university-of-kachchh
Alumni talk-university-of-kachchh
Chetan Khatri
 

More from Chetan Khatri (20)

Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
 
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionPyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
HBase with Apache Spark POC Demo
HBase with Apache Spark POC DemoHBase with Apache Spark POC Demo
HBase with Apache Spark POC Demo
 
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
 
Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...
 
An Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learningAn Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learning
 
Introduction to Computer Science
Introduction to Computer ScienceIntroduction to Computer Science
Introduction to Computer Science
 
An introduction to Git with Atlassian Suite
An introduction to Git with Atlassian SuiteAn introduction to Git with Atlassian Suite
An introduction to Git with Atlassian Suite
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
 
Voltage measurement using arduino
Voltage measurement using arduinoVoltage measurement using arduino
Voltage measurement using arduino
 
Design & Building Smart Energy Meter
Design & Building Smart Energy MeterDesign & Building Smart Energy Meter
Design & Building Smart Energy Meter
 
Data Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonData Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - Python
 
Internet of things initiative-cskskv
Internet of things   initiative-cskskvInternet of things   initiative-cskskv
Internet of things initiative-cskskv
 
High level architecture solar power plant
High level architecture solar power plantHigh level architecture solar power plant
High level architecture solar power plant
 
Alumni talk-university-of-kachchh
Alumni talk-university-of-kachchhAlumni talk-university-of-kachchh
Alumni talk-university-of-kachchh
 

Recently uploaded

🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
kuldeepsharmaks8120
 
DU degree offer diploma Transcript
DU degree offer diploma TranscriptDU degree offer diploma Transcript
DU degree offer diploma Transcript
uapta
 
CHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptxCHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptx
girewiy968
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
vrvipin164
 
Nipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma TranscriptNipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma Transcript
zyqedad
 
High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
saadkhan1485265
 
Universidad Camilo José Cela degree offer diploma Transcript
Universidad Camilo José Cela  degree offer diploma TranscriptUniversidad Camilo José Cela  degree offer diploma Transcript
Universidad Camilo José Cela degree offer diploma Transcript
taqyea
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
tanupasswan6
 
Oracle Database Desupported Features on 23ai (Part A)
Oracle Database Desupported Features on 23ai (Part A)Oracle Database Desupported Features on 23ai (Part A)
Oracle Database Desupported Features on 23ai (Part A)
Alireza Kamrani
 
Potential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriatePotential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriate
huseindihon
 
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
birajmohan012
 
Data analytics and Access Program Recommendations
Data analytics and Access Program RecommendationsData analytics and Access Program Recommendations
Data analytics and Access Program Recommendations
hemantsharmaus
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
GaneshGanesh399816
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
chetankumar9855
 
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).docbai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
PhngThLmHnh
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
Kanchana Weerasinghe
 
Universidad de Valladolid degree offer diploma Transcript
Universidad de Valladolid  degree offer diploma TranscriptUniversidad de Valladolid  degree offer diploma Transcript
Universidad de Valladolid degree offer diploma Transcript
taqyea
 
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
ASISHSABAT3
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
kihus38
 
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
TARIKU ENDALE
 

Recently uploaded (20)

🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
 
DU degree offer diploma Transcript
DU degree offer diploma TranscriptDU degree offer diploma Transcript
DU degree offer diploma Transcript
 
CHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptxCHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptx
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
 
Nipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma TranscriptNipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma Transcript
 
High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
 
Universidad Camilo José Cela degree offer diploma Transcript
Universidad Camilo José Cela  degree offer diploma TranscriptUniversidad Camilo José Cela  degree offer diploma Transcript
Universidad Camilo José Cela degree offer diploma Transcript
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
 
Oracle Database Desupported Features on 23ai (Part A)
Oracle Database Desupported Features on 23ai (Part A)Oracle Database Desupported Features on 23ai (Part A)
Oracle Database Desupported Features on 23ai (Part A)
 
Potential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriatePotential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriate
 
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
 
Data analytics and Access Program Recommendations
Data analytics and Access Program RecommendationsData analytics and Access Program Recommendations
Data analytics and Access Program Recommendations
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
 
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).docbai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
bai-tap-tieng-anh-lop-12-unit-4-the-mass-media (1).doc
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
 
Universidad de Valladolid degree offer diploma Transcript
Universidad de Valladolid  degree offer diploma TranscriptUniversidad de Valladolid  degree offer diploma Transcript
Universidad de Valladolid degree offer diploma Transcript
 
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
 
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
 

ScalaTo July 2019 - No more struggles with Apache Spark workloads in production

  • 1. Chetan Khatri, Lead - Data Science. Accionlabs India. Scala Toronto Group 24th July, 2019. Twitter: @khatri_chetan, Email: chetan.khatri@live.com chetan.khatri@accionlabs.com LinkedIn: https://www.linkedin.com/in/chetkhatri Github: chetkhatri
  • 2. Lead - Data Science, Technology Evangelist @ Accion labs India Pvt. Ltd. Contributor @ Apache Spark, Apache HBase, Elixir Lang. Co-Authored University Curriculum @ University of Kachchh, India. Data Engineering @: Nazara Games, Eccella Corporation. M.Sc. - Computer Science from University of Kachchh, India.
  • 3. An Innovation Focused Technology Services Firm 2100+ 70+ 20+ 12+ 6
  • 4. ● A Global Technology Services firm focussed Emerging Technologies ○ 12 offices, 6 dev centers, 2100+ employees, 70+ active clients ● Profitable, venture-backed company ○ 3 rounds of funding, 8 acquisitions to bolster emerging tech capability and leadership ● Flexible Outcome-based Engagement Models ○ Projects, Extended teams, Shared IP, Co-development, Professional Services ● Framework Based Approach to Accelerate Digital Transformation ○ A collection of tools and frameworks, Breeze Digital Blueprint helps gain 25-30% efficiency ● Action-oriented Leadership Team ○ Fastest growing firm from Pittsburgh (2014, 2015, 2016), E&Y award 2015, PTC Finalist 2018
  • 5. ● Apache Spark ● Primary data structures (RDD, DataSet, Dataframe) ● Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark. ● Parallel read from JDBC: Challenges and best practices. ● Bulk Load API vs JDBC write ● An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin ● Avoid unnecessary shuffle ● Alternative to spark default sort ● Why dropDuplicates() doesn’t result consistency, What is alternative ● Optimize Spark stage generation plan ● Predicate pushdown with partitioning and bucketing ● Why not to use Scala Concurrent ‘Future’ explicitly!
  • 6. ● Apache Spark is a fast and general-purpose cluster computing system / Unified Engine for massive data processing. ● It provides high level API for Scala, Java, Python and R and optimized engine that supports general execution graphs. Structured Data / SQL - Spark SQL Graph Processing - GraphX Machine Learning - MLlib Streaming - Spark Streaming, Structured Streaming
  • 8. RDD RDD RDD RDD Logical Model Across Distributed Storage on Cluster HDFS, S3
  • 9. RDD RDD RDD T T RDD -> T -> RDD -> T -> RDD T = Transformation
  • 10. Integer RDD String or Text RDD Double or Binary RDD
  • 11. RDD RDD RDD T T RDD RDD RDD T A RDD - T - RDD - T - RDD - T - RDD - A - RDD T = Transformation A = Action
  • 13. TRANSFORMATIONSACTIONS General Math / Statistical Set Theory / Relational Data Structure / I/O map gilter flatMap mapPartitions mapPartitionsWithIndex groupBy sortBy sample randomSplit union intersection subtract distinct cartesian zip keyBy zipWithIndex zipWithUniqueID zipPartitions coalesce repartition repartitionAndSortWithinPartitions pipe reduce collect aggregate fold first take forEach top treeAggregate treeReduce forEachPartition collectAsMap count takeSample max min sum histogram mean variance stdev sampleVariance countApprox countApproxDistinct takeOrdered saveAsTextFile saveAsSequenceFile saveAsObjectFile saveAsHadoopDataset saveAsHadoopFile saveAsNewAPIHadoopDataset saveAsNewAPIHadoopFile
  • 14. You care about control of dataset and knows how data looks like, you care about low level API. Don’t care about lot’s of lambda functions than DSL. Don’t care about Schema or Structure of Data. Don’t care about optimization, performance & inefficiencies! Very slow for non-JVM languages like Python, R. Don’t care about Inadvertent inefficiencies.
  • 15. parsedRDD.filter { case (project, sprint, numStories) => project == "finance" }. map { case (_, sprint, numStories) => (sprint, numStories) }. reduceByKey(_ + _). filter { case (sprint, _) => !isSpecialSprint(sprint) }. take(100).foreach { case (project, stories) => println(s"project: $stories") }
  • 16. val employeesDF = spark.read.json("employees.json") // Convert data to domain objects. case class Employee(name: String, age: Int) val employeesDS: Dataset[Employee] = employeesDF.as[Employee] val filterDS = employeesDS.filter(p => p.age > 3) Type-safe: operate on domain objects with compiled lambda functions.
  • 18. Strongly Typing Ability to use powerful lambda functions. Spark SQL’s optimized execution engine (catalyst, tungsten) Can be constructed from JVM objects & manipulated using Functional transformations (map, filter, flatMap etc) A DataFrame is a Dataset organized into named columns DataFrame is simply a type alias of Dataset[Row]
  • 19. SQL DataFrames Datasets Syntax Errors Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time Analysis errors are caught before a job runs on cluster
  • 21. // convert RDD -> DF with column names val parsedDF = parsedRDD.toDF("project", "sprint", "numStories") //filter, groupBy, sum, and then agg() parsedDF.filter($"project" === "finance"). groupBy($"sprint"). agg(sum($"numStories").as("count")). limit(100). show(100) project sprint numStories finance 3 20 finance 4 22
  • 22. parsedDF.createOrReplaceTempView("audits") val results = spark.sql( """SELECT sprint, sum(numStories) AS count FROM audits WHERE project = 'finance' GROUP BY sprint LIMIT 100""") results.show(100) project sprint numStories finance 3 20 finance 4 22
  • 23. // DataFrame data.groupBy("dept").avg("age") // SQL select dept, avg(age) from data group by 1 // RDD data.map { case (dept, age) => dept -> (age, 1) } .reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) } .map { case (dept, (age, c)) => dept -> age / c }
  • 24. SQL AST DataFrame Datasets Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plans CostModel Selected Physical Plan RDD
  • 25. employees.join(events, employees("id") === events("eid")) .filter(events("date") > "2015-01-01") events file employees table join filter Logical Plan scan (employees) filter Scan (events) join Physical Plan Optimized scan (events) Optimized scan (employees) join Physical Plan With Predicate Pushdown and Column Pruning
  • 30. Job - Each transformation and action mapping in Spark would create a separate jobs. Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor. Task - Lowest level of Concurrent and Parallel execution Unit. Each stage is split into #number-of-partitions tasks, i.e Number of Tasks = stage * number of partitions in the stage
  • 35. yarn.scheduler.minimum-allocation-vcores = 1 Yarn.scheduler.maximum-allocation-vcores = 6 Yarn.scheduler.minimum-allocation-mb = 4096 Yarn.scheduler.maximum-allocation-mb = 28832 Yarn.nodemanager.resource.memory-mb = 54000 Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 / Yarn.scheduler.minimum-allocation-mb = 4096) = 13
  • 42. ~14 Years old written ETL platform in 2 .sql large files (with 20,000 lines each) to Scala, Spark and Airflow based Highly Scalable, Non-Blocking reengineering Implementation.
  • 43. Technologies Stack: Programming Languages Scala, Python Data Access Library Spark-Core, Spark-SQL, Sqoop Data Processing engine Apache Spark Orchestration Airflow Cluster Management Yarn Storage HDFS (Parquet) Hadoop Distribution Hortonworks
  • 44. High level Architecture OLTP Shadow Data Source Apache Spark Spark SQL Sqoop HDFS Parquet Yarn Cluster manager Customer Specific Reporting DB Bulk Load Parallelism Orchestration: Airflow
  • 45. Apache Airflow - DAG Overview
  • 48. Join plans 1. Size of the Dataframe 2. Number of records in Dataframe Compare to primary DataFrame.
  • 50. What happens when you run this code? What would be the impact at Database engine side?
  • 60. JoinSelection execution planning strategy uses spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size of a dataset before broadcasting it to all worker nodes when performing a join. val threshold = spark.conf.get("spark.sql.autoBroadcastJoinThreshold").toInt scala> threshold / 1024 / 1024 res0: Int = 10 // logical plan with tree numbered sampleDF.queryExecution.logical.numberedTreeString // Query plan sampleDF.explain
  • 61. Repartition: Boost the Parallelism, by increasing the number of Partitions. Partition on Joins, to get same key joins faster. // Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster. employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT")) For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions and each partition will write 2500 records Parallely. Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.
  • 62. 1. // disable autoBroadcastJoin spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) 2. // Order doesn't matter table1.leftjoin(table2) or table2.leftjoin(table1) 3. force broadcast, if one DataFrame is not small! 4. Minimize shuffling & Boost Parallelism, Partitioning, Bucketing, coalesce, repartition, HashPartitioner
  • 73. Task - controlling possibly lazy & asynchronous computations, useful for controlling side-effects. Coeval - controlling synchronous, possibly lazy evaluation, useful for describing lazy expressions and for controlling side-effects. Code!
  • 74. Cats Family Http4 Doobie Circe FS2 PureConfig