SlideShare a Scribd company logo
1 of 21
Download to read offline
When Apache Spark meets TiDB
=> TiSpark
maxiaoyu@pingcap.com
Who am I
● Shawn Ma@PingCAP
● Tech Lead of OLAP Team
● Working on OLAP related products and features
● Previously tech lead of Big Data infra team@Netease
● Focus on SQL on Hadoop and Big Data related stuff
Agenda
● A little bit about TiDB / TiKV
● What is TiSpark
● Architecture
● Benefit
● What’s Next
What’s TiDB
● Open source distributed RDBMS
● Inspired by Google Spanner
● Horizontal Scalability
● ACID Transaction
● High Availability
● Auto-Failover
● SQL at scale
● Widely used in different industries, including Internet, Gaming,
Banking, Finance, Manufacture and so on (200+ users)
A little bit about TiDB and TiKV
TiKV TiKV TiKV TiKV
Raft Raft Raft
TiDB TiDB TiDB
... ......
... ...
Placement
Driver (PD)
Control flow:
Balance / Failover
Metadata / Timestamp request
Stateless SQL Layer
Distributed Storage Layer
gRPC
gRPC
gRPC
TiKV: The whole picture
Client
Store 1
Region 1
Region 3
Region 5
Region 4
Store 3
Region 3
Region 5
Region 2
Store 2
Region 1
Region 3
Region 2
Region 4
Store 4
Region 1
Region 5
Region 2
Region 4
RPC RPC RPC RPC
TiKV node 1 TiKV node 2 TiKV node 3 TiKV node 4
Placement
Driver
PD 1
PD 2
PD 3
Raft
Group
TiKV is powered by RocksDB
What is TiSpark
● TiSpark = Spark SQL on TiKV
○ Spark SQL directly on top of a distributed Database
Storage
● Hybrid Transactional/Analytical Processing (HTAP) rocks
○ Provide strong OLAP capacity together with TiDB
What is TiSpark
● Complex Calculation Pushdown
● Key Range pruning
● Index support
○ Clustered index / Non-Clustered index
○ Index Only Query
● Cost Based Optimization
○ Histogram
○ Pick up right Access Path
Spark Exec
Architecture
Spark Exec
Spark Driver
Spark Exec
TiKV TiKV TiKV TiKV
TiSpark
TiSpark TiSpark TiSpark
TiKV
Placement
Driver (PD)
gRPC
Distributed Storage Layer
gRPC
retrieve data location
retrieve real data from TiKV
Architecture
● On Spark Driver
○ Translate metadata from TiDB into Spark meta info
○ Transform Spark SQL logical plan, pick up elements to be
leverage by storage (TiKV) and rewrite the plan
○ Locate Data based on Region info from Placement Driver
and split partitions;
● On Spark Executor
○ Encode Spark SQL plan into TiKV’s coprocessor request
○ Decode TiKV / Coprocessor result and transform result
into Spark SQL Rows
How everything made possible
● Extension points for Spark SQL Internal
● Extra Strategies allow us to inject our own physical executor and that’s
what we leveraged for TiSpark
● Trying best to keep Spark Internal untouched to avoid compatibility issue
How everything made possible
● A fat java client module, paying the price of bypassing TiDB
○ Parsing Schema, Type system, encoding / decoding, coprocessor
○ Almost full featured TiKV client (without write support for now)
○ Predicates / Index - Key Range related logic
○ Aggregates pushdown related
○ Limit, Order, Stats related
● A thin layer inside Spark SQL
○ TiStrategy for Spark SQL plan transformation
○ And other utilities for mapping things from Spark SQL to TiKV
client library
○ Physical Operators like IndexScan
○ Thin enough for not bothering much of compatibility with Spark
SQL
Too Abstract? Let’s get concrete
select class, avg(score) from student
WHERE school = ‘engineering’ and lottery(name) = ‘picked’
and studentId >= 8000 and studentId < 10100
group by class ;
● Above is a table on TiDB named student
● Clustered index on StudentId and a secondary index on
School column
● Lottery is an Spark SQL UDF which pick up a name and
output ‘picked’ if RNG decided so
Construct Tasks
Predicates Processing
WHERE school = ‘engineering’ and lottery(name) = ‘picked’
and studentId >= 8000 and studentId < 10100
Region 1
StudentId
[0-5000)
Region 2
StudentId
[5000-10000)
Region 3
StudentId
[10000-15000)
StudentId >= 8000 StudentId < 10100 Key Range: [8000, 10100)
Predicates are converted into key ranges based on indexes
Spark Task 1
Region2
[8000, 10000)
COP Request
Spark Task 2
Region3
[10000, 10100)
COP Request
1. Append remaining predicates if supported by
coprocessor
2. Push back whatever needs to be computed by Spark
SQL, e.g. UDFs, prefix index predicates
3. Cut them into tasks according to Region/Range
4. Encode into coprocessor request
gRPC via Spark worker
school = ‘engineering’
School = ‘engineering’
Lottery(name) = ‘picked’
WHERE school = ‘engineering’ and lottery(name) = ‘picked’
and (studentId >= 8000 and studentId < 10100)
Index Scan
TiKV Region Data TiKV Region Data TiKV Region Data
Index Data for student_school Row Data for student
Executor
[1,5) 5,7,9 10 88Batch Scan for index according
to predicates range
Sort and cut row keys into
ranges according to
Key range in region
● Secondary Index is encode as key-value pair
○ Key is comparable bytes format of
all index keys in defined order
○ Value is the row ID pointing to table
row data
● Reading data via Secondary Index usually
requires a double read.
○ First, read secondary index in range
just like reading primary keys in
previous slide.
○ Shuffle Row IDs according to region
○ Sort all row IDs retrieved and
combine them into ranges if
possible
○ Encoding row IDs into row keys for
the table
○ Send those mini requests in batch
concurrently
● Optimize away second read operation
○ If all required column covered by
index itself already
1,2,3,4,5,7,8,10,88
Executor
Index Selection
WHERE school = ‘engineering’ and lottery(name) = ‘picked’
and (studentId >= 8000 and studentId < 10100) or studentId in
(10323, 10327)
Histogram
Clustered Index on
StudentID +
predicates related
StudentId matched
Secondary Index on
School +
predicates related
School matched
1k Rows
800 Rows
1K * Clustered Index
Access Cost
<
800 * Secondary
Index Access Cost
● If the columns referred are all covered by index, then instead of retrieving
actual rows, we apply index only query and cost function will be different
● If histogram not exists, TiSpark using pseudo selection logic.
Construct Schema Transformation Rules
TiDB has totally different type system and infer rules
Aggregates Processing
select class, avg(score) from student
…….
group by class ;
Region 1
StudentId
[0-5000)
Region 2
StudentId
[5000-10000)
Region 3
StudentId
[10000-15000)
AVG(score) Group BY class
AVG are rewritten into SUM and COUNT
Map Task 1 Map Task 2
gRPC via Spark worker
Spark SQL plan received in TiStrategy
SUM(score) / COUNT(score) Group BY class
Spark Schema by its own type infer rules
[SUM, COUNT, class]
TiKV Schema to Spark Schema
[groupBy keys as bytes, SUM as Decimal, COUNT as BigInt ]
Reduce Task 1 Reduce Task 2
● After coprocessor preprocessing,
TiSpark still rely on normal Spark
aggregation strategy
Benefit
● Analytical / Transactional support all on one platform
○ No need for ETL and query data in real-time
○ High throughput and consistent snapshot read from
database
○ Simplify your platform and reduce maintenance cost
● Embrace Apache Spark and its eco-system
○ Support of complex transformation and analytics
beyond SQL
○ Cooperate with other projects in eco-system (like
Apache Zeppelin)
○ Apache Spark bridges your data sources
Ease of Use
● Working on your existing Spark Cluster
○ Just a single jar like other Spark connector
● Workable as standalone application, spark-shell,
thrift-server, pyspark and R
● Work just like another data source
val ti = new org.apache.spark.sql.TiContext(spark)
// Map all TiDB tables from database tpch as Spark SQL tables
ti.tidbMapDatabase("sampleDB")
spark.sql("select count(*) from sampleTable").show
What’s Next
● Batch Write Support (writing directly as TiKV native
format)
● JSON Type support (since TiDB already supported)
● Partition Table support (both Range and Hash)
● Join optimization based on range and partition table
● (Maybe) Join Reorder with TiDB’s own Histogram
● Another separate columnar storage project using Spark
as its execution engine (not released yet)
Thanks!
Contact me:
maxiaoyu@pingcap.com
www.pingcap.com
https://github.com/pingcap/tispark
https://github.com/pingcap/tidb
https://github.com/pingcap/tikv

More Related Content

What's hot

RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeDataWorks Summit
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyftTao Feng
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3SANG WON PARK
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...HostedbyConfluent
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBaseHBaseCon
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward
 
cLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHousecLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHouseAltinity Ltd
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 

What's hot (20)

RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and Time
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
cLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHousecLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHouse
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 

Similar to When Apache Spark Meets TiDB with Xiaoyu Ma

Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]Kevin Xu
 
Scale Relational Database with NewSQL
Scale Relational Database with NewSQLScale Relational Database with NewSQL
Scale Relational Database with NewSQLPingCAP
 
TiDB Introduction - San Francisco MySQL Meetup
TiDB Introduction - San Francisco MySQL MeetupTiDB Introduction - San Francisco MySQL Meetup
TiDB Introduction - San Francisco MySQL MeetupMorgan Tocker
 
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]Kevin Xu
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big DataPingCAP
 
TiDB Introduction - Boston MySQL Meetup Group
TiDB Introduction - Boston MySQL Meetup GroupTiDB Introduction - Boston MySQL Meetup Group
TiDB Introduction - Boston MySQL Meetup GroupMorgan Tocker
 
TiDB as an HTAP Database
TiDB as an HTAP DatabaseTiDB as an HTAP Database
TiDB as an HTAP DatabasePingCAP
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVPresentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVKevin Xu
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)PingCAP
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceHBaseCon
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce HBaseCon
 
Introducing TiDB @ SF DevOps Meetup
Introducing TiDB @ SF DevOps MeetupIntroducing TiDB @ SF DevOps Meetup
Introducing TiDB @ SF DevOps MeetupKevin Xu
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache SparkLucian Neghina
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelGarindra Prahandono
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
 
TiDB vs Aurora.pdf
TiDB vs Aurora.pdfTiDB vs Aurora.pdf
TiDB vs Aurora.pdfssuser3fb50b
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Introducing TiDB - Percona Live Frankfurt
Introducing TiDB - Percona Live FrankfurtIntroducing TiDB - Percona Live Frankfurt
Introducing TiDB - Percona Live FrankfurtMorgan Tocker
 

Similar to When Apache Spark Meets TiDB with Xiaoyu Ma (20)

Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
 
Scale Relational Database with NewSQL
Scale Relational Database with NewSQLScale Relational Database with NewSQL
Scale Relational Database with NewSQL
 
TiDB Introduction - San Francisco MySQL Meetup
TiDB Introduction - San Francisco MySQL MeetupTiDB Introduction - San Francisco MySQL Meetup
TiDB Introduction - San Francisco MySQL Meetup
 
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
Introducing TiDB [Delivered: 09/25/18 at Portland Cloud Native Meetup]
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big Data
 
TiDB Introduction - Boston MySQL Meetup Group
TiDB Introduction - Boston MySQL Meetup GroupTiDB Introduction - Boston MySQL Meetup Group
TiDB Introduction - Boston MySQL Meetup Group
 
TiDB as an HTAP Database
TiDB as an HTAP DatabaseTiDB as an HTAP Database
TiDB as an HTAP Database
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKVPresentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
Presentation at SF Kubernetes Meetup (10/30/18), Introducing TiDB/TiKV
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Introducing TiDB @ SF DevOps Meetup
Introducing TiDB @ SF DevOps MeetupIntroducing TiDB @ SF DevOps Meetup
Introducing TiDB @ SF DevOps Meetup
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
TiDB vs Aurora.pdf
TiDB vs Aurora.pdfTiDB vs Aurora.pdf
TiDB vs Aurora.pdf
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Introducing TiDB - Percona Live Frankfurt
Introducing TiDB - Percona Live FrankfurtIntroducing TiDB - Percona Live Frankfurt
Introducing TiDB - Percona Live Frankfurt
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 

Recently uploaded (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 

When Apache Spark Meets TiDB with Xiaoyu Ma

  • 1. When Apache Spark meets TiDB => TiSpark maxiaoyu@pingcap.com
  • 2. Who am I ● Shawn Ma@PingCAP ● Tech Lead of OLAP Team ● Working on OLAP related products and features ● Previously tech lead of Big Data infra team@Netease ● Focus on SQL on Hadoop and Big Data related stuff
  • 3. Agenda ● A little bit about TiDB / TiKV ● What is TiSpark ● Architecture ● Benefit ● What’s Next
  • 4. What’s TiDB ● Open source distributed RDBMS ● Inspired by Google Spanner ● Horizontal Scalability ● ACID Transaction ● High Availability ● Auto-Failover ● SQL at scale ● Widely used in different industries, including Internet, Gaming, Banking, Finance, Manufacture and so on (200+ users)
  • 5. A little bit about TiDB and TiKV TiKV TiKV TiKV TiKV Raft Raft Raft TiDB TiDB TiDB ... ...... ... ... Placement Driver (PD) Control flow: Balance / Failover Metadata / Timestamp request Stateless SQL Layer Distributed Storage Layer gRPC gRPC gRPC
  • 6. TiKV: The whole picture Client Store 1 Region 1 Region 3 Region 5 Region 4 Store 3 Region 3 Region 5 Region 2 Store 2 Region 1 Region 3 Region 2 Region 4 Store 4 Region 1 Region 5 Region 2 Region 4 RPC RPC RPC RPC TiKV node 1 TiKV node 2 TiKV node 3 TiKV node 4 Placement Driver PD 1 PD 2 PD 3 Raft Group TiKV is powered by RocksDB
  • 7. What is TiSpark ● TiSpark = Spark SQL on TiKV ○ Spark SQL directly on top of a distributed Database Storage ● Hybrid Transactional/Analytical Processing (HTAP) rocks ○ Provide strong OLAP capacity together with TiDB
  • 8. What is TiSpark ● Complex Calculation Pushdown ● Key Range pruning ● Index support ○ Clustered index / Non-Clustered index ○ Index Only Query ● Cost Based Optimization ○ Histogram ○ Pick up right Access Path
  • 9. Spark Exec Architecture Spark Exec Spark Driver Spark Exec TiKV TiKV TiKV TiKV TiSpark TiSpark TiSpark TiSpark TiKV Placement Driver (PD) gRPC Distributed Storage Layer gRPC retrieve data location retrieve real data from TiKV
  • 10. Architecture ● On Spark Driver ○ Translate metadata from TiDB into Spark meta info ○ Transform Spark SQL logical plan, pick up elements to be leverage by storage (TiKV) and rewrite the plan ○ Locate Data based on Region info from Placement Driver and split partitions; ● On Spark Executor ○ Encode Spark SQL plan into TiKV’s coprocessor request ○ Decode TiKV / Coprocessor result and transform result into Spark SQL Rows
  • 11. How everything made possible ● Extension points for Spark SQL Internal ● Extra Strategies allow us to inject our own physical executor and that’s what we leveraged for TiSpark ● Trying best to keep Spark Internal untouched to avoid compatibility issue
  • 12. How everything made possible ● A fat java client module, paying the price of bypassing TiDB ○ Parsing Schema, Type system, encoding / decoding, coprocessor ○ Almost full featured TiKV client (without write support for now) ○ Predicates / Index - Key Range related logic ○ Aggregates pushdown related ○ Limit, Order, Stats related ● A thin layer inside Spark SQL ○ TiStrategy for Spark SQL plan transformation ○ And other utilities for mapping things from Spark SQL to TiKV client library ○ Physical Operators like IndexScan ○ Thin enough for not bothering much of compatibility with Spark SQL
  • 13. Too Abstract? Let’s get concrete select class, avg(score) from student WHERE school = ‘engineering’ and lottery(name) = ‘picked’ and studentId >= 8000 and studentId < 10100 group by class ; ● Above is a table on TiDB named student ● Clustered index on StudentId and a secondary index on School column ● Lottery is an Spark SQL UDF which pick up a name and output ‘picked’ if RNG decided so
  • 14. Construct Tasks Predicates Processing WHERE school = ‘engineering’ and lottery(name) = ‘picked’ and studentId >= 8000 and studentId < 10100 Region 1 StudentId [0-5000) Region 2 StudentId [5000-10000) Region 3 StudentId [10000-15000) StudentId >= 8000 StudentId < 10100 Key Range: [8000, 10100) Predicates are converted into key ranges based on indexes Spark Task 1 Region2 [8000, 10000) COP Request Spark Task 2 Region3 [10000, 10100) COP Request 1. Append remaining predicates if supported by coprocessor 2. Push back whatever needs to be computed by Spark SQL, e.g. UDFs, prefix index predicates 3. Cut them into tasks according to Region/Range 4. Encode into coprocessor request gRPC via Spark worker school = ‘engineering’ School = ‘engineering’ Lottery(name) = ‘picked’
  • 15. WHERE school = ‘engineering’ and lottery(name) = ‘picked’ and (studentId >= 8000 and studentId < 10100) Index Scan TiKV Region Data TiKV Region Data TiKV Region Data Index Data for student_school Row Data for student Executor [1,5) 5,7,9 10 88Batch Scan for index according to predicates range Sort and cut row keys into ranges according to Key range in region ● Secondary Index is encode as key-value pair ○ Key is comparable bytes format of all index keys in defined order ○ Value is the row ID pointing to table row data ● Reading data via Secondary Index usually requires a double read. ○ First, read secondary index in range just like reading primary keys in previous slide. ○ Shuffle Row IDs according to region ○ Sort all row IDs retrieved and combine them into ranges if possible ○ Encoding row IDs into row keys for the table ○ Send those mini requests in batch concurrently ● Optimize away second read operation ○ If all required column covered by index itself already 1,2,3,4,5,7,8,10,88 Executor
  • 16. Index Selection WHERE school = ‘engineering’ and lottery(name) = ‘picked’ and (studentId >= 8000 and studentId < 10100) or studentId in (10323, 10327) Histogram Clustered Index on StudentID + predicates related StudentId matched Secondary Index on School + predicates related School matched 1k Rows 800 Rows 1K * Clustered Index Access Cost < 800 * Secondary Index Access Cost ● If the columns referred are all covered by index, then instead of retrieving actual rows, we apply index only query and cost function will be different ● If histogram not exists, TiSpark using pseudo selection logic.
  • 17. Construct Schema Transformation Rules TiDB has totally different type system and infer rules Aggregates Processing select class, avg(score) from student ……. group by class ; Region 1 StudentId [0-5000) Region 2 StudentId [5000-10000) Region 3 StudentId [10000-15000) AVG(score) Group BY class AVG are rewritten into SUM and COUNT Map Task 1 Map Task 2 gRPC via Spark worker Spark SQL plan received in TiStrategy SUM(score) / COUNT(score) Group BY class Spark Schema by its own type infer rules [SUM, COUNT, class] TiKV Schema to Spark Schema [groupBy keys as bytes, SUM as Decimal, COUNT as BigInt ] Reduce Task 1 Reduce Task 2 ● After coprocessor preprocessing, TiSpark still rely on normal Spark aggregation strategy
  • 18. Benefit ● Analytical / Transactional support all on one platform ○ No need for ETL and query data in real-time ○ High throughput and consistent snapshot read from database ○ Simplify your platform and reduce maintenance cost ● Embrace Apache Spark and its eco-system ○ Support of complex transformation and analytics beyond SQL ○ Cooperate with other projects in eco-system (like Apache Zeppelin) ○ Apache Spark bridges your data sources
  • 19. Ease of Use ● Working on your existing Spark Cluster ○ Just a single jar like other Spark connector ● Workable as standalone application, spark-shell, thrift-server, pyspark and R ● Work just like another data source val ti = new org.apache.spark.sql.TiContext(spark) // Map all TiDB tables from database tpch as Spark SQL tables ti.tidbMapDatabase("sampleDB") spark.sql("select count(*) from sampleTable").show
  • 20. What’s Next ● Batch Write Support (writing directly as TiKV native format) ● JSON Type support (since TiDB already supported) ● Partition Table support (both Range and Hash) ● Join optimization based on range and partition table ● (Maybe) Join Reorder with TiDB’s own Histogram ● Another separate columnar storage project using Spark as its execution engine (not released yet)