SlideShare a Scribd company logo
1 of 30
Download to read offline
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Brandon Hamric, Data Engineer at Eventbrite
Beck Cronin-Dixon, Data Engineer at Eventbrite
Near Real-Time Analytics
with Apache Spark
#UnifiedAnalytics #SparkAISummit
Who are we
3#UnifiedAnalytics #SparkAISummit
Brandon Hamric
Principal Data Engineer
Eventbrite
Beck Cronin-Dixon
Senior Data Engineer
Eventbrite
Overview
• Components of near real-time data
• Requirements
• Data ingestion approaches
• Benchmarks
• Considerations
4#UnifiedAnalytics #SparkAISummit
Components of Near Real-time Data
5#UnifiedAnalytics #SparkAISummit
• Data
– Size
– Mutability
– Schema evolution
• Deduplication
• Infrastructure
– Ad-hoc queries
– Batch
– Streaming queries
• Storage
– Folder/file partitioning and bucketing
– Format
• Compression
– at-rest
– in-transit
6
Components of Near Real-time Data
Common Requirements
7#UnifiedAnalytics #SparkAISummit
• Latency is specific to use case
• Most stakeholders ask for latency in minutes
• Latency of dependencies*
• Some implementations fit certain latency better than others
• Mixed grain*
Eventbrite Requirements
8#UnifiedAnalytics #SparkAISummit
• Web interaction data: billions of records per month
• Real-time DB: Multiple TB
• Event Organizer reporting requirements
• Business and accounting reporting
• Mixed Grain - Seconds to Daily
• Stack: Spark, Presto, S3, HDFS, Kafka
• Parquet
Ingestion Approaches
• Full overwrite
• Batch incremental merge
• Append only
• Key/value store
• Hybrid batch/stream view
9
Approaches: Full Overwrite
• Batch Spark process
• Overwrite entire table every run
• Complete copy of real-time DB
• Direct ETL
10
Approaches: Full Overwrite
• The Good
– Simple to implement
– Ad Hoc Queries are fast/simple
• The Bad
– Significant load on real-time DB
– High Latency
– High write I/O requirement
11
12
Full Overwrite Logic
Approaches: Batch Incremental Merge
• Get new/changed rows
• Union new rows to base
table
• Deduplicate to get latest
rows
• Overwrite entire table
• limited latency to how fast
you can write to data lake
13
Approaches: Batch Incremental Merge
14
• The Good
– Lower load on real-time DB
– Queries are fast/simple
• The Bad
– Relatively high Latency
– High write I/O requirement
– Requires reliable incremental fields
15
Batch Incremental Merge Logic
Approaches: Append-Only Tables
• Query the real-time db for
new/changed rows
• Coalesce and write new part
files
• Run compaction hourly,
then daily
16
Approaches: Append Only
17
• The Good
– Latency in minutes
– Ingestion is easy to implement
– Simplifies the data lake
– Great for immutable source tables
• The Bad
– Requires a compaction process
– Extra logic in the queries
– May require views
18
Append-Only Logic
Approaches: Key/Value Store
19
• Query the real-time db for
new/changed rows
• Upsert rows
Approaches: Key/Value Store
20
• The Good
– Straightforward to implement
– Built in idempotency
– Good bridge between data lake
and web services
• The Bad
– Batch writes to a key/value store
are slower than using HDFS
– Not optimized for large scans
21
Key/Value Store Logic
Approaches: Hybrid Batch/Stream View
22
• Batch ingest from DB transaction
logs in kafka
• Batch merge new rows to base rows
• Store transaction ID in the base table
• Ad-hoc queries merge base table
and latest transactions and
deduplicate on-the-fly
Approaches: Hybrid Batch/Stream View
23
• The Good
– Data within seconds*
– Batch job is relatively easy to implement
– Spark can do both tasks
• The Bad
– Streaming merge is complex
– Processing required on read
24
Hybrid Batch/Stream View
So Delta Lake is open source now...
• ACID transactions in Spark!!!!!!
• Frequent Ingestion Processes
• Simpler than other incremental
merge approaches
• Hybrid approach still bridges
latency gap
25
Read/Write Benchmark
26
Where is Spark Streaming?
27
• More frequent incremental writes
• Less stream to stream joins
• Decreasing batch intervals gives us more stream to batch joins
• Less stream to stream joins, means less memory and faster
joins
Considerations
• Use case
• Latency needs
• Data size
• Deduplication
• Storage
28
29
sample mysql jdbc reader
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
The picture can't be displayed.

More Related Content

What's hot

What's hot (20)

Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in Depth
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
 

Similar to Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Queries

Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiConquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Databricks
 
Building & Testing Scalable Rails Applications
Building & Testing Scalable Rails ApplicationsBuilding & Testing Scalable Rails Applications
Building & Testing Scalable Rails Applications
evilmike
 

Similar to Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Queries (20)

Operational-Analytics
Operational-AnalyticsOperational-Analytics
Operational-Analytics
 
Webinar: How MongoDB is Used to Manage Reference Data - May 2014
Webinar: How MongoDB is Used to Manage Reference Data - May 2014Webinar: How MongoDB is Used to Manage Reference Data - May 2014
Webinar: How MongoDB is Used to Manage Reference Data - May 2014
 
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiConquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
 
MongoBD London 2013: Real World MongoDB: Use Cases from Financial Services pr...
MongoBD London 2013: Real World MongoDB: Use Cases from Financial Services pr...MongoBD London 2013: Real World MongoDB: Use Cases from Financial Services pr...
MongoBD London 2013: Real World MongoDB: Use Cases from Financial Services pr...
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance
 
Webinar: Achieving Customer Centricity and High Margins in Financial Services...
Webinar: Achieving Customer Centricity and High Margins in Financial Services...Webinar: Achieving Customer Centricity and High Margins in Financial Services...
Webinar: Achieving Customer Centricity and High Margins in Financial Services...
 
SharePoint Saturday The Conference 2011 - SP2010 Performance
SharePoint Saturday The Conference 2011 - SP2010 PerformanceSharePoint Saturday The Conference 2011 - SP2010 Performance
SharePoint Saturday The Conference 2011 - SP2010 Performance
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
SharePoint Saturday San Antonio: SharePoint 2010 Performance
SharePoint Saturday San Antonio: SharePoint 2010 PerformanceSharePoint Saturday San Antonio: SharePoint 2010 Performance
SharePoint Saturday San Antonio: SharePoint 2010 Performance
 
General 05 integration design vs migration design
General 05   integration design vs migration designGeneral 05   integration design vs migration design
General 05 integration design vs migration design
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
 
Baha Mar's All in Bet on Red: The Story of Integrating Data and Master Data w...
Baha Mar's All in Bet on Red: The Story of Integrating Data and Master Data w...Baha Mar's All in Bet on Red: The Story of Integrating Data and Master Data w...
Baha Mar's All in Bet on Red: The Story of Integrating Data and Master Data w...
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Background processing with hangfire
Background processing with hangfireBackground processing with hangfire
Background processing with hangfire
 
Redis TimeSeries
Redis TimeSeries Redis TimeSeries
Redis TimeSeries
 
Building & Testing Scalable Rails Applications
Building & Testing Scalable Rails ApplicationsBuilding & Testing Scalable Rails Applications
Building & Testing Scalable Rails Applications
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion Days
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
LuisMiguelPaz5
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
23050636
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
yulianti213969
 

Recently uploaded (20)

Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Pentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIPentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AI
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 

Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Queries

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Brandon Hamric, Data Engineer at Eventbrite Beck Cronin-Dixon, Data Engineer at Eventbrite Near Real-Time Analytics with Apache Spark #UnifiedAnalytics #SparkAISummit
  • 3. Who are we 3#UnifiedAnalytics #SparkAISummit Brandon Hamric Principal Data Engineer Eventbrite Beck Cronin-Dixon Senior Data Engineer Eventbrite
  • 4. Overview • Components of near real-time data • Requirements • Data ingestion approaches • Benchmarks • Considerations 4#UnifiedAnalytics #SparkAISummit
  • 5. Components of Near Real-time Data 5#UnifiedAnalytics #SparkAISummit • Data – Size – Mutability – Schema evolution • Deduplication • Infrastructure – Ad-hoc queries – Batch – Streaming queries
  • 6. • Storage – Folder/file partitioning and bucketing – Format • Compression – at-rest – in-transit 6 Components of Near Real-time Data
  • 7. Common Requirements 7#UnifiedAnalytics #SparkAISummit • Latency is specific to use case • Most stakeholders ask for latency in minutes • Latency of dependencies* • Some implementations fit certain latency better than others • Mixed grain*
  • 8. Eventbrite Requirements 8#UnifiedAnalytics #SparkAISummit • Web interaction data: billions of records per month • Real-time DB: Multiple TB • Event Organizer reporting requirements • Business and accounting reporting • Mixed Grain - Seconds to Daily • Stack: Spark, Presto, S3, HDFS, Kafka • Parquet
  • 9. Ingestion Approaches • Full overwrite • Batch incremental merge • Append only • Key/value store • Hybrid batch/stream view 9
  • 10. Approaches: Full Overwrite • Batch Spark process • Overwrite entire table every run • Complete copy of real-time DB • Direct ETL 10
  • 11. Approaches: Full Overwrite • The Good – Simple to implement – Ad Hoc Queries are fast/simple • The Bad – Significant load on real-time DB – High Latency – High write I/O requirement 11
  • 13. Approaches: Batch Incremental Merge • Get new/changed rows • Union new rows to base table • Deduplicate to get latest rows • Overwrite entire table • limited latency to how fast you can write to data lake 13
  • 14. Approaches: Batch Incremental Merge 14 • The Good – Lower load on real-time DB – Queries are fast/simple • The Bad – Relatively high Latency – High write I/O requirement – Requires reliable incremental fields
  • 16. Approaches: Append-Only Tables • Query the real-time db for new/changed rows • Coalesce and write new part files • Run compaction hourly, then daily 16
  • 17. Approaches: Append Only 17 • The Good – Latency in minutes – Ingestion is easy to implement – Simplifies the data lake – Great for immutable source tables • The Bad – Requires a compaction process – Extra logic in the queries – May require views
  • 19. Approaches: Key/Value Store 19 • Query the real-time db for new/changed rows • Upsert rows
  • 20. Approaches: Key/Value Store 20 • The Good – Straightforward to implement – Built in idempotency – Good bridge between data lake and web services • The Bad – Batch writes to a key/value store are slower than using HDFS – Not optimized for large scans
  • 22. Approaches: Hybrid Batch/Stream View 22 • Batch ingest from DB transaction logs in kafka • Batch merge new rows to base rows • Store transaction ID in the base table • Ad-hoc queries merge base table and latest transactions and deduplicate on-the-fly
  • 23. Approaches: Hybrid Batch/Stream View 23 • The Good – Data within seconds* – Batch job is relatively easy to implement – Spark can do both tasks • The Bad – Streaming merge is complex – Processing required on read
  • 25. So Delta Lake is open source now... • ACID transactions in Spark!!!!!! • Frequent Ingestion Processes • Simpler than other incremental merge approaches • Hybrid approach still bridges latency gap 25
  • 27. Where is Spark Streaming? 27 • More frequent incremental writes • Less stream to stream joins • Decreasing batch intervals gives us more stream to batch joins • Less stream to stream joins, means less memory and faster joins
  • 28. Considerations • Use case • Latency needs • Data size • Deduplication • Storage 28
  • 30. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT The picture can't be displayed.