SlideShare a Scribd company logo
1 of 42
Download to read offline
Luca Canali, CERN
Apache Spark for RDBMS
Practitioners:
How I Learned to Stop Worrying
and Love to Scale
#SAISDev11
About Luca
• Data Engineer and team lead at CERN
– Hadoop and Spark service, database services
– 18+ years of experience with data(base) services
– Performance, architecture, tools, internals
• Sharing and community
– Blog, notes, tools, contributions to Apache Spark
@LucaCanaliDB – http://cern.ch/canali
2#SAISDev11
3#SAISDev11
The Large Hadron Collider (LHC)
3
LHC and Data
4#SAISDev11
LHC data processing
teams (WLCG) have
built custom solutions
able to operate at
very large scale (data
scale and world-wide
operations)
Overview of Data at CERN
• Physics Data ~ 300 PB -> EB
• “Big Data” on Spark and Hadoop ~10 PB
– Analytics for accelerator controls and logging
– Monitoring use cases, this includes use of Spark streaming
– Analytics on aggregated logs
– Explorations on the use of Spark for high energy physics
• Relational databases ~ 2 PB
– Control systems, metadata for physics data
processing, administration and general-purpose
5#SAISDev11
Apache Spark @ CERN
6#SAISDev11
Cluster Name Configuration Software Version
Accelerator logging 20 nodes (Cores 480, Mem - 8 TB, Storage – 5 PB, 96GB in SSD) Spark 2.2.2 – 2.3.1
General Purpose 48 nodes (Cores – 892,Mem – 7.5TB,Storage – 6 PB) Spark 2.2.2 – 2.3.1
Development cluster 14 nodes (Cores – 196,Mem – 768GB,Storage – 2.15 PB) Spark 2.2.2 – 2.3.1
ATLAS Event Index 18 nodes (Cores – 288,Mem – 912GB,Storage – 1.29 PB) Spark 2.2.2 – 2.3.1
• Spark on YARN/HDFS
• In total ~1850 physical cores and 15 PB capacity
• Spark on Kubernetes
• Deployed on CERN cloud using OpenStack
• See also, at this conference: “Experience of Running Spark on
Kubernetes on OpenStack for High Energy Physics Workloads”
Motivations/Outline of the
Presentation
• Report experience of using Apache Spark at
CERN DB group
– Value of Spark for data coming from DBs
• Highlights of learning experience
7#SAISDev11
Data Engineer
DBA/Dev Big Data
Goals
• Entice more DBAs and RDBMS Devs into entering
Spark community
• Focus on added value of scalable platforms and
experience on RDBMS/data practices
– SQL/DataFrame API very valuable for many data workloads
– Performance instrumentation of the code is key: use metric
and tools beyond simple time measurement
8#SAISDev11
Spark, Swiss Army Knife of Big Data
One tool, many uses
• Data extraction and manipulation
• Large ecosystem of data sources
• Engine for distributed computing
• Runs SQL
• Streaming
• Machine Learning
• GraphX
• ..
9#SAISDev11
Image Credit: Vector Pocket Knife from Clipart.me
Problem We Want to Solve
• Running reports on relational
DATA
– DB optimized for transactions and online
– Reports can overload storage system
• RDBMS specialized
– for random I/O, CPU, storage and
memory
– HW and SW are optimized at a
premium for performance and
availability
10#SAISDev11
Offload from RDBMS
• Building hybrid transactional and reporting
systems is expensive
– Solution: offload to a system optimized for capacity
– In our case: move data to Hadoop + SQL-based engine
11#SAISDev11
ETL
Queries
Report
Queries
Transactions
on RDBMS
Spark Read from RDBMS via JDBC
val df = spark.read.format("jdbc")
.option("url","jdbc:oracle:thin:@server:port/service”)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("dbtable", "DBTABLE")
.option("user", “XX").option("password", "YY")
.option("fetchsize",10000)
.option("sessionInitStatement",preambleSQL)
.load()
12#SAISDev11
Optional DB-specific
Optimizations
SPARK-21519
option("sessionInitStatement", """BEGIN execute immediate 'alter session set "_serial_direct_read"=true'; END;""")
Challenges Reading from RDMBs
• Important amount of CPU consumed by serialization-
deserialization transferring to/from RDMS data format
– Particularly important for tables with short rows
• Need to take care of data partitioning, parallelism of
the data extraction queries, etc
– A few operational challenges
– Configure parallelism to run at speed
– But also be careful not to overload the source when reading
13#SAISDev11
Apache Sqoop to Copy Data
• Scoop is optimized to read from RDBMS
• Sqoop connector for Oracle DBs has several key
optimizations
• We use it for incremental refresh of data from
production RDBMS to HDFS “data lake”
– Typically we copy latest partition of the data
– Users run reporting jobs on offloaded data from YARN/HDFS
14#SAISDev11
Sqoop to Copy Data (from Oracle)
sqoop import 
--connect jdbc:oracle:thin:@server.cern.ch:port/service
--username ..
-P 
-Doraoop.chunk.method=ROWID
--direct 
--fetch-size 10000 
--num-mappers XX 
--target-dir XXXX/mySqoopDestDir 
--table TABLESNAME 
--map-column-java FILEID=Integer,JOBID=Integer,CREATIONDATE=String 
--compress --compression-codec snappy 
--as-parquetfile
15#SAISDev11
Sqoop has a rich choice of options to customize the data transfer
Oracle-specific optimizations
Streaming and ETL
• Besides traditional ETL also streaming
important and used by production use cases
16#SAISDev11
Streaming
data into
Apache Kafka
Spark and Data Formats
• Data formats are key for performance
• Columnar formats are a big boost for many
reporting use cases
– Parquet and ORC
– Column pruning
– Easier compression/encoding
• Partitioning and bucketing also important
17#SAISDev11
Data Access Optimizations
Use Apache Parquet or ORC
• Profit of column projection, filter push down,
compression and encoding, partition pruning
18#SAISDev11
Col1 Col2 Col3 Col4
Partition=1
Min
Max
Count
..
Parquet Metadata Exploration
Tools for Parquet metadata exploration, useful for
learning and troubleshooting
19#SAISDev11
Useful tools:
parquet-tools
parquet-reader
Info at:
https://github.com/LucaCanali/Misc
ellaneous/blob/master/Spark_Note
s/Tools_Parquet_Diagnostics.md
SQL Now
After moving the data (ETL),
add the queries to the new system
Spark SQL/DataFrame API is a powerful engine
– Optimized for Parquet (recently improved ORC)
– Partitioning, Filter push down
– Feature rich, code generation, also has CBO (cost
based optimizer)
– Note: it’s a rich ecosystem, options are available (we
also used Impala and evaluated Presto)
20#SAISDev11
Spark SQL Interfaces: Many Choices
• PySpark, spark-shell,
Spark jobs
– in Python/Scala/Java
• Notebooks
– Jupyter
– Web service for data
analysis at CERN
swan.cern.ch
• Thrift server
21#SAISDev11
Lessons Learned
Reading Parquet is CPU intensive
• Parquet format uses encoding and compression
• Processing Parquet -> hungry of CPU cycles
and memory bandwidth
• Test on dual socket CPU
– Up to 3.4 Gbytes/s reads
22#SAISDev11
Benchmarking Spark SQL
Running benchmarks can be useful (and fun):
• See how the system behaves at scale
• Stress-test new infrastructure
• Experiment with monitoring and troubleshooting tools
TPCDS benchmark
• Popular and easy to set up, can run at scale
• See https://github.com/databricks/spark-sql-perf
23#SAISDev11
Example from TPCDS Benchmark
24#SAISDev11
Lessons Learned from Running
TPCDS
Useful for commissioning of new systems
• It has helped us capture some system parameter
misconfiguration before prod
• Comparing with other similar systems
• Many of the long-running TPCDS queries
– Use shuffle intensively
– Can profit of using SSDs for local dirs (shuffle)
• There are improvements (and regressions) of
– TPCDS queries across Spark versions
25#SAISDev11
Learning Something New About SQL
Similarities but also differences with RDBMS:
• Dataframe abstraction, vs. database table/view
• SQL or DataFrame operation syntax
// Dataframe operations
val df = spark.read.parquet("..path..")
df.select("id").groupBy("id").count().show()
// SQL
df.createOrReplaceTempView("t1")
spark.sql("select id, count(*) from t1 group by
id").show()
26#SAISDev11
..and Something OLD About SQL
Using SQL or DataFrame operations API is equivalent
• Just different ways to express the computation
• Look at execution plans to see what Spark executes
– Spark Catalyst optimizer will generate the same plan for the 2 examples
== Physical Plan ==
*(2) HashAggregate(keys=[id#98L], functions=[count(1)])
+- Exchange hashpartitioning(id#98L, 200)
+- *(1) HashAggregate(keys=[id#98L], functions=[partial_count(1)])
+- *(1) FileScan parquet [id#98L] Batched: true, Format: Parquet,
Location: InMemoryFileIndex[file:/tmp/t1.prq], PartitionFilters: [],
PushedFilters: [], ReadSchema: struct<id:bigint>
27#SAISDev11
Monitoring and Performance
Troubleshooting
Performance is key for data processing at scale
• Lessons learned from DB systems:
– Use metrics and time-based profiling
– It’s a boon to rapidly pin-point bottleneck resources
• Instrumentation and tools are essential
28#SAISDev11
Spark and Monitoring Tools
• Spark instrumentation
– Web UI
– REST API
– Eventlog
– Executor/Task Metrics
– Dropwizard metrics library
• Complement with
– OS tools
– For large clusters, deploy tools that ease working at cluster-level
• https://spark.apache.org/docs/latest/monitoring.html
29#SAISDev11
WEB UI
• Part of Spark: Info on Jobs, Stages, Executors, Metrics,
SQL,..
– Start with: point web browser driver_host, port 4040
30#SAISDev11
Spark Executor Metrics System
Metrics collected at
execution time
Exposed by WEB UI/
History server,
EventLog, custom tools
https://github.com/ce
rndb/sparkMeasure
31#SAISDev11
Spark Executor Metrics System
What is available with Spark Metrics, SPARK-25170
32#SAISDev11
Spark Executor Task Metric
name
Short description
executorRunTime
Elapsed time the executor spent running this task. This includes time fetching shuffle
data. The value is expressed in milliseconds.
executorCpuTime
CPU time the executor spent running this task. This includes time fetching shuffle data.
The value is expressed in nanoseconds.
executorDeserializeTime Elapsed time spent to deserialize this task. The value is expressed in milliseconds.
executorDeserializeCpuTime
CPU time taken on the executor to deserialize this task. The value is expressed in
nanoseconds.
resultSize The number of bytes this task transmitted back to the driver as the TaskResult.
jvmGCTime
Elapsed time the JVM spent in garbage collection while executing this task. The value
is expressed in milliseconds.
+ I/OMetrics, Shuffle metrics , …
26 metrics available, see
https://github.com/apache/spark/blob/master/docs/monitoring.md
Spark Metrics - Dashboards
Use to produce dashboards to measure online:
• Number of active tasks, JVM metrics, CPU usage SPARK-25228,
Spark task metrics SPARK-22190, etc.
33#SAISDev11
Dropwizard metrics
Graphite sink
Configure: --conf spark.metrics.conf=metrics.properties
More Use Cases: Scale Up!
• Spark proved valuable and flexible
• Ecosystem and tooling in place
• Opens to exploring more use cases and..
scaling up!
34#SAISDev11
Spark APIs for Physics data analysis
• POC: processed 1 PB of data in 10 hours
Use Case: Spark for Physics
35#SAISDev11
Data
reduction
(for further
analysis
Histograms
and plots of
interest
Apply
computation at
scale (UDF)
Read into Spark
(spark-root
connector)
Input:
Physics data
in CERN/HEP
storage
See also, at this conference: “HEP Data Processing with Apache Spark” and “CERN’s Next Generation Data
Analysis Platform with Apache Spark”
Use Case: Machine Learning at
Scale
36#SAISDev11
Data and
models
from
Physics
Input:
labeled
data and
DL models
Feature
engineer
ing at
scale
Distributed
model training
Output: particle
selector model
AUC=
0.9867
Hyperparameter
optimization
(Random/Grid
search)
Summary and High-Level View of
the Lessons Learned
37#SAISDev11
Transferrable Skills, Spark - DB
Familiar ground coming from RDBMS world
• SQL
• CBO
• Execution plans, hints, joins, grouping, windows, etc
• Partitioning
Performance troubleshooting, instrumentation, SQL
optimizations
• Carry over ideas and practices from DBs
38#SAISDev11
Learning Curve Coming from DB
• Manage heterogeneous environment and usage:
Python, Scala, Java, batch jobs, notebooks, etc
• Spark is fundamentally a library not a server
component like a RDBMS engine
• Understand differences tables vs DataFrame APIs
• Understand data formats
• Understand clustering and environment (YARN,
Kubernets, …)
• Understand tooling and utilities in the ecosystem
39#SAISDev11
New Dimensions and Value
• With Spark you can easily scale up your SQL
workloads
– Integrate with many sources (DBs and more) and storage
(Cloud storage solutions or Hadoop)
• Run DB and ML workloads and at scale
- CERN and HEP are developing solutions with Spark for use
cases for physics data analysis and for ML/DL at scale
• Quickly growing and open community
– You will be able to comment, propose, implement features,
issues and patches
40#SAISDev11
Conclusions
- Apache Spark very useful and versatile
- for many data workloads, including DB-like
workloads
- DBAs and RDBMS Devs
- Can profit of Spark and ecosystem to scale up
some of their workloads from DBs and do more!
- Can bring experience and good practices over from
DB world
41#SAISDev11
Acknowledgements
• Members of Hadoop and Spark service at CERN
and HEP users community, CMS Big Data project,
CERN openlab
• Many lessons learned over the years from the
RDBMS community, notably www.oaktable.net
• Apache Spark community: helpful discussions and
support with PRs
• Links
– More info: @LucaCanaliDB – http://cern.ch/canali
42#SAISdev11

More Related Content

What's hot

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
 
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
Continuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert XueContinuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert XueDatabricks
 
Oracle db performance tuning
Oracle db performance tuningOracle db performance tuning
Oracle db performance tuningSimon Huang
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Awr + 12c performance tuning
Awr + 12c performance tuningAwr + 12c performance tuning
Awr + 12c performance tuningAiougVizagChapter
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
 

What's hot (20)

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Continuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert XueContinuous Application with FAIR Scheduler with Robert Xue
Continuous Application with FAIR Scheduler with Robert Xue
 
Oracle db performance tuning
Oracle db performance tuningOracle db performance tuning
Oracle db performance tuning
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Awr + 12c performance tuning
Awr + 12c performance tuningAwr + 12c performance tuning
Awr + 12c performance tuning
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 

Similar to Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Love to Scale with Luca Canali

Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiConquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiDatabricks
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with SparkMd. Mahedi Kaysar
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsDataWorks Summit
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...DataStax
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platformmartinbpeters
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsYousun Jeong
 
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...Tim Vaillancourt
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 

Similar to Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Love to Scale with Luca Canali (20)

Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiConquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 

Recently uploaded (20)

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Love to Scale with Luca Canali

  • 1. Luca Canali, CERN Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Love to Scale #SAISDev11
  • 2. About Luca • Data Engineer and team lead at CERN – Hadoop and Spark service, database services – 18+ years of experience with data(base) services – Performance, architecture, tools, internals • Sharing and community – Blog, notes, tools, contributions to Apache Spark @LucaCanaliDB – http://cern.ch/canali 2#SAISDev11
  • 3. 3#SAISDev11 The Large Hadron Collider (LHC) 3
  • 4. LHC and Data 4#SAISDev11 LHC data processing teams (WLCG) have built custom solutions able to operate at very large scale (data scale and world-wide operations)
  • 5. Overview of Data at CERN • Physics Data ~ 300 PB -> EB • “Big Data” on Spark and Hadoop ~10 PB – Analytics for accelerator controls and logging – Monitoring use cases, this includes use of Spark streaming – Analytics on aggregated logs – Explorations on the use of Spark for high energy physics • Relational databases ~ 2 PB – Control systems, metadata for physics data processing, administration and general-purpose 5#SAISDev11
  • 6. Apache Spark @ CERN 6#SAISDev11 Cluster Name Configuration Software Version Accelerator logging 20 nodes (Cores 480, Mem - 8 TB, Storage – 5 PB, 96GB in SSD) Spark 2.2.2 – 2.3.1 General Purpose 48 nodes (Cores – 892,Mem – 7.5TB,Storage – 6 PB) Spark 2.2.2 – 2.3.1 Development cluster 14 nodes (Cores – 196,Mem – 768GB,Storage – 2.15 PB) Spark 2.2.2 – 2.3.1 ATLAS Event Index 18 nodes (Cores – 288,Mem – 912GB,Storage – 1.29 PB) Spark 2.2.2 – 2.3.1 • Spark on YARN/HDFS • In total ~1850 physical cores and 15 PB capacity • Spark on Kubernetes • Deployed on CERN cloud using OpenStack • See also, at this conference: “Experience of Running Spark on Kubernetes on OpenStack for High Energy Physics Workloads”
  • 7. Motivations/Outline of the Presentation • Report experience of using Apache Spark at CERN DB group – Value of Spark for data coming from DBs • Highlights of learning experience 7#SAISDev11 Data Engineer DBA/Dev Big Data
  • 8. Goals • Entice more DBAs and RDBMS Devs into entering Spark community • Focus on added value of scalable platforms and experience on RDBMS/data practices – SQL/DataFrame API very valuable for many data workloads – Performance instrumentation of the code is key: use metric and tools beyond simple time measurement 8#SAISDev11
  • 9. Spark, Swiss Army Knife of Big Data One tool, many uses • Data extraction and manipulation • Large ecosystem of data sources • Engine for distributed computing • Runs SQL • Streaming • Machine Learning • GraphX • .. 9#SAISDev11 Image Credit: Vector Pocket Knife from Clipart.me
  • 10. Problem We Want to Solve • Running reports on relational DATA – DB optimized for transactions and online – Reports can overload storage system • RDBMS specialized – for random I/O, CPU, storage and memory – HW and SW are optimized at a premium for performance and availability 10#SAISDev11
  • 11. Offload from RDBMS • Building hybrid transactional and reporting systems is expensive – Solution: offload to a system optimized for capacity – In our case: move data to Hadoop + SQL-based engine 11#SAISDev11 ETL Queries Report Queries Transactions on RDBMS
  • 12. Spark Read from RDBMS via JDBC val df = spark.read.format("jdbc") .option("url","jdbc:oracle:thin:@server:port/service”) .option("driver", "oracle.jdbc.driver.OracleDriver") .option("dbtable", "DBTABLE") .option("user", “XX").option("password", "YY") .option("fetchsize",10000) .option("sessionInitStatement",preambleSQL) .load() 12#SAISDev11 Optional DB-specific Optimizations SPARK-21519 option("sessionInitStatement", """BEGIN execute immediate 'alter session set "_serial_direct_read"=true'; END;""")
  • 13. Challenges Reading from RDMBs • Important amount of CPU consumed by serialization- deserialization transferring to/from RDMS data format – Particularly important for tables with short rows • Need to take care of data partitioning, parallelism of the data extraction queries, etc – A few operational challenges – Configure parallelism to run at speed – But also be careful not to overload the source when reading 13#SAISDev11
  • 14. Apache Sqoop to Copy Data • Scoop is optimized to read from RDBMS • Sqoop connector for Oracle DBs has several key optimizations • We use it for incremental refresh of data from production RDBMS to HDFS “data lake” – Typically we copy latest partition of the data – Users run reporting jobs on offloaded data from YARN/HDFS 14#SAISDev11
  • 15. Sqoop to Copy Data (from Oracle) sqoop import --connect jdbc:oracle:thin:@server.cern.ch:port/service --username .. -P -Doraoop.chunk.method=ROWID --direct --fetch-size 10000 --num-mappers XX --target-dir XXXX/mySqoopDestDir --table TABLESNAME --map-column-java FILEID=Integer,JOBID=Integer,CREATIONDATE=String --compress --compression-codec snappy --as-parquetfile 15#SAISDev11 Sqoop has a rich choice of options to customize the data transfer Oracle-specific optimizations
  • 16. Streaming and ETL • Besides traditional ETL also streaming important and used by production use cases 16#SAISDev11 Streaming data into Apache Kafka
  • 17. Spark and Data Formats • Data formats are key for performance • Columnar formats are a big boost for many reporting use cases – Parquet and ORC – Column pruning – Easier compression/encoding • Partitioning and bucketing also important 17#SAISDev11
  • 18. Data Access Optimizations Use Apache Parquet or ORC • Profit of column projection, filter push down, compression and encoding, partition pruning 18#SAISDev11 Col1 Col2 Col3 Col4 Partition=1 Min Max Count ..
  • 19. Parquet Metadata Exploration Tools for Parquet metadata exploration, useful for learning and troubleshooting 19#SAISDev11 Useful tools: parquet-tools parquet-reader Info at: https://github.com/LucaCanali/Misc ellaneous/blob/master/Spark_Note s/Tools_Parquet_Diagnostics.md
  • 20. SQL Now After moving the data (ETL), add the queries to the new system Spark SQL/DataFrame API is a powerful engine – Optimized for Parquet (recently improved ORC) – Partitioning, Filter push down – Feature rich, code generation, also has CBO (cost based optimizer) – Note: it’s a rich ecosystem, options are available (we also used Impala and evaluated Presto) 20#SAISDev11
  • 21. Spark SQL Interfaces: Many Choices • PySpark, spark-shell, Spark jobs – in Python/Scala/Java • Notebooks – Jupyter – Web service for data analysis at CERN swan.cern.ch • Thrift server 21#SAISDev11
  • 22. Lessons Learned Reading Parquet is CPU intensive • Parquet format uses encoding and compression • Processing Parquet -> hungry of CPU cycles and memory bandwidth • Test on dual socket CPU – Up to 3.4 Gbytes/s reads 22#SAISDev11
  • 23. Benchmarking Spark SQL Running benchmarks can be useful (and fun): • See how the system behaves at scale • Stress-test new infrastructure • Experiment with monitoring and troubleshooting tools TPCDS benchmark • Popular and easy to set up, can run at scale • See https://github.com/databricks/spark-sql-perf 23#SAISDev11
  • 24. Example from TPCDS Benchmark 24#SAISDev11
  • 25. Lessons Learned from Running TPCDS Useful for commissioning of new systems • It has helped us capture some system parameter misconfiguration before prod • Comparing with other similar systems • Many of the long-running TPCDS queries – Use shuffle intensively – Can profit of using SSDs for local dirs (shuffle) • There are improvements (and regressions) of – TPCDS queries across Spark versions 25#SAISDev11
  • 26. Learning Something New About SQL Similarities but also differences with RDBMS: • Dataframe abstraction, vs. database table/view • SQL or DataFrame operation syntax // Dataframe operations val df = spark.read.parquet("..path..") df.select("id").groupBy("id").count().show() // SQL df.createOrReplaceTempView("t1") spark.sql("select id, count(*) from t1 group by id").show() 26#SAISDev11
  • 27. ..and Something OLD About SQL Using SQL or DataFrame operations API is equivalent • Just different ways to express the computation • Look at execution plans to see what Spark executes – Spark Catalyst optimizer will generate the same plan for the 2 examples == Physical Plan == *(2) HashAggregate(keys=[id#98L], functions=[count(1)]) +- Exchange hashpartitioning(id#98L, 200) +- *(1) HashAggregate(keys=[id#98L], functions=[partial_count(1)]) +- *(1) FileScan parquet [id#98L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/t1.prq], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> 27#SAISDev11
  • 28. Monitoring and Performance Troubleshooting Performance is key for data processing at scale • Lessons learned from DB systems: – Use metrics and time-based profiling – It’s a boon to rapidly pin-point bottleneck resources • Instrumentation and tools are essential 28#SAISDev11
  • 29. Spark and Monitoring Tools • Spark instrumentation – Web UI – REST API – Eventlog – Executor/Task Metrics – Dropwizard metrics library • Complement with – OS tools – For large clusters, deploy tools that ease working at cluster-level • https://spark.apache.org/docs/latest/monitoring.html 29#SAISDev11
  • 30. WEB UI • Part of Spark: Info on Jobs, Stages, Executors, Metrics, SQL,.. – Start with: point web browser driver_host, port 4040 30#SAISDev11
  • 31. Spark Executor Metrics System Metrics collected at execution time Exposed by WEB UI/ History server, EventLog, custom tools https://github.com/ce rndb/sparkMeasure 31#SAISDev11
  • 32. Spark Executor Metrics System What is available with Spark Metrics, SPARK-25170 32#SAISDev11 Spark Executor Task Metric name Short description executorRunTime Elapsed time the executor spent running this task. This includes time fetching shuffle data. The value is expressed in milliseconds. executorCpuTime CPU time the executor spent running this task. This includes time fetching shuffle data. The value is expressed in nanoseconds. executorDeserializeTime Elapsed time spent to deserialize this task. The value is expressed in milliseconds. executorDeserializeCpuTime CPU time taken on the executor to deserialize this task. The value is expressed in nanoseconds. resultSize The number of bytes this task transmitted back to the driver as the TaskResult. jvmGCTime Elapsed time the JVM spent in garbage collection while executing this task. The value is expressed in milliseconds. + I/OMetrics, Shuffle metrics , … 26 metrics available, see https://github.com/apache/spark/blob/master/docs/monitoring.md
  • 33. Spark Metrics - Dashboards Use to produce dashboards to measure online: • Number of active tasks, JVM metrics, CPU usage SPARK-25228, Spark task metrics SPARK-22190, etc. 33#SAISDev11 Dropwizard metrics Graphite sink Configure: --conf spark.metrics.conf=metrics.properties
  • 34. More Use Cases: Scale Up! • Spark proved valuable and flexible • Ecosystem and tooling in place • Opens to exploring more use cases and.. scaling up! 34#SAISDev11
  • 35. Spark APIs for Physics data analysis • POC: processed 1 PB of data in 10 hours Use Case: Spark for Physics 35#SAISDev11 Data reduction (for further analysis Histograms and plots of interest Apply computation at scale (UDF) Read into Spark (spark-root connector) Input: Physics data in CERN/HEP storage See also, at this conference: “HEP Data Processing with Apache Spark” and “CERN’s Next Generation Data Analysis Platform with Apache Spark”
  • 36. Use Case: Machine Learning at Scale 36#SAISDev11 Data and models from Physics Input: labeled data and DL models Feature engineer ing at scale Distributed model training Output: particle selector model AUC= 0.9867 Hyperparameter optimization (Random/Grid search)
  • 37. Summary and High-Level View of the Lessons Learned 37#SAISDev11
  • 38. Transferrable Skills, Spark - DB Familiar ground coming from RDBMS world • SQL • CBO • Execution plans, hints, joins, grouping, windows, etc • Partitioning Performance troubleshooting, instrumentation, SQL optimizations • Carry over ideas and practices from DBs 38#SAISDev11
  • 39. Learning Curve Coming from DB • Manage heterogeneous environment and usage: Python, Scala, Java, batch jobs, notebooks, etc • Spark is fundamentally a library not a server component like a RDBMS engine • Understand differences tables vs DataFrame APIs • Understand data formats • Understand clustering and environment (YARN, Kubernets, …) • Understand tooling and utilities in the ecosystem 39#SAISDev11
  • 40. New Dimensions and Value • With Spark you can easily scale up your SQL workloads – Integrate with many sources (DBs and more) and storage (Cloud storage solutions or Hadoop) • Run DB and ML workloads and at scale - CERN and HEP are developing solutions with Spark for use cases for physics data analysis and for ML/DL at scale • Quickly growing and open community – You will be able to comment, propose, implement features, issues and patches 40#SAISDev11
  • 41. Conclusions - Apache Spark very useful and versatile - for many data workloads, including DB-like workloads - DBAs and RDBMS Devs - Can profit of Spark and ecosystem to scale up some of their workloads from DBs and do more! - Can bring experience and good practices over from DB world 41#SAISDev11
  • 42. Acknowledgements • Members of Hadoop and Spark service at CERN and HEP users community, CMS Big Data project, CERN openlab • Many lessons learned over the years from the RDBMS community, notably www.oaktable.net • Apache Spark community: helpful discussions and support with PRs • Links – More info: @LucaCanaliDB – http://cern.ch/canali 42#SAISdev11