SlideShare a Scribd company logo
Scaling ETL on Hadoop: Bridging OLTP with OLAP
Agenda
 Data Ecosystem @ LinkedIn
 Problem : Bridging OLTP with OLAP
 Solution
 Details
 Conclusion and Future Work
2
Data Ecosystem @ LinkedIn
3
Data Ecosystem - Overview
4
Serving App
Online Stores
Espresso
Oracle
MySQL
Logs
Analytics Infra
Business
Engines
Serving
OLAP
Data Ecosystem – Data
5
 Tracking Data
 Tracks user activity at web site
 Append only
 Example: Page View
 Database Data
 Member provided data in online-stores
 Inserts, Updates and Deletes
 Example: Member Profiles, Likes, Comments
Problem
Scaling ETL on Hadoop
6
Bridging OLTP to OLAP
7
OLTP OLAP
 Integrating site-serving data stores with Hadoop
at scale with low latency.
 Critical to LinkedIn’s
 Member engagement
 Business decision making
Kafka
Engines
Serving
OLAP
Databases
Tracking Data
Espresso
Oracle
MySQL
Challenge - Scalable ETL
8
 600+ Tracking topics
 500+ Database tables
 XXX TB of Data at rest
 X TB of new data generated per day
 5000 Nodes, Several Hadoop clusters
Kafka
Engines
Serving
OLAP
Databases
Tracking Data
Espresso
Oracle
MySQL
OLTP OLAP
Challenge – Consistent Snapshot with SLA
9
 Apply updates, deletes
 Copy full tables
 But, resource overheads
 Small fraction of data changes
Kafka
Engines
Serving
OLAP
Databases
Tracking Data
Espresso
Oracle
MySQL
OLTP OLAP
Engines
Requirements
10
OLTP
Oracle Espresso
OLAP
 Refresh data on HDFS frequently
 Seamless handling of schema evolution
 Optimal resource usage
 Handle multi data centers
 Efficient change capture on source
 Ensure Last-Update semantics
 Handle deletes
Serving
OLAP
Database Data
Tracking Data
Solution
11
Lumos
12
Data Capture
 Can use commit logs
 Delta processing
 Latencies in minutes
 Schema agnostic framework
Databus
Others
Hadoop : Data Center
DB
Extract
Files
Data Center
Colo-1
Databases
Colo-2
Databases
Lumos
databases
(HDFS)
dbchanges
(HDFS)
Lumos – Multi-Datacenter
13
Data Capture
 Handle multi-datacenter stores
 Resolve updates via commit order
Databus
Others
Hadoop : Data Center
DB
Extract
Files
Data Center
Colo-1
Databases
Colo-2
Databases
Lumos
databases
(HDFS)
dbchanges
(HDFS)
Lumos – Data Organization
14
-
Virtual Snapshot
HDFS Layout
InputFormat
Pig&Hive
Loaders
 Database Snapshot
- Entire database on HDFS
- With added latency
 Database Virtual Snapshot
- Previous Snapshot + Delta
- Enables faster refresh
/db/table/snapshot-0
_delta
dir-1
dir-2
dir-3
Lumos - High Level Architecture
15

Virtual
Snapshot
Builder
ETL Hadoop Cluster
Staging
(internal)
Lazy
Snapshot
Builder
User
Jobs
HDFS
Published
Virtual
Snapshot
MR/Pig/Hiv
e
Loaders
Compactor
Change
Captur
e Increments
Pre-
Process
Full Drops
Alternative Approaches
 Sqoop
 Hbase
 Hive Streaming
16
Details
17
Change Capture – File Based
18
 File Format
 Compressed CSV
 Metadata
 Full Drop
 Via Fast Reader (Oracle, MySQL)
 Via MySQL backups (Espresso)
 Runs for hours with Dirty reads
 Increments
 Via SQL
 Transactional
Full Drop
1am 4am
Inc
h-1
Inc
h-2
Inc
h-3
2am 3am
Prev.
HW
New
High-water mark
DB
Files
Web
Service
HDFS
HTTPS
Pulls
Inc
H-4
Change Capture – Databus Based
19
Databus
Relay
Mapper
Databus
Consumer
dbchanges
(HDFS)
Reducer
Database
Mapper
Databus
Consumer
Reducer
 Reads Database commit logs
 Multi datacenter via Databus Relay
 Runs as MR Job
 Output : date-time partitioned with multiple versions
 True change capture (including hard deletes)
Databus
RelayDatabase
Hadoop
Pre-Processing
20
 Data format conversion
 Field level transformations
 Privacy
 Cleansing – Eg. Remove recursive schema
 Metadata annotation
 Add row counts for data validation
 Virtual
Snapshot
Builder
(HDFS)
Internal
Staging
Lazy
Snapshot
Builder
User Jobs
(HDFS)
Published
Virtual
Snapshot
MR/Pig/Hive
Loaders
Compactor
Change
Capture Increments
Pre-
Process
Full Drops
Snapshotting – Lazy Materializer
21
 One MR job per table, consumes full drops
 Supports dirty reads.
 Hash Partition on primary key
 Number of partitions based on data size
 Sorts on primary key
 Results published into staging directory
 Virtual
Snapshot
Builder
(HDFS)
Internal
Staging
Lazy
Snapshot
Builder
User Jobs
(HDFS)
Published
Virtual
Snapshot
MR/Pig/Hive
Loaders
Compactor
Change
Capture Increments
Pre-
Process
Full Drops
Snapshotting – Virtual Snapshot Builder
22
 One MR Job for all tables
 Identifies all existing snapshots, both published and staged
 Creates appropriate delta partitions for every snapshot
 Delta partition count equals Snapshot partition count
 Club multiple partition in one file
 Outputs latest row using delta column
 Publishes staged snapshots with new deltas
 Previously published snapshots updated with new deltas
 Virtual
Snapshot
Builder
(HDFS)
Internal
Staging
Lazy
Snapshot
Builder
User Jobs
(HDFS)
Published
Virtual
Snapshot
MR/Pig/Hive
Loaders
Compactor
Change
Capture Increments
Pre-
Process
Full Drops
Snapshotting – Virtual Snapshot Builder
23
/db/table/snapshot-0
(10 partitions, 10 Avro files)
_delta
inc-1
(10 partitions, 2 Avro file)
Part-0 . .
.Part-9
Index files
Inc-2
(10 partitions, 2 Avro file)
Part-0
Part-5
Part-0
 Incremental data is small
 Rolls increments
 Avoid creating small files
 Equi-partitions INC as Snapshot
 Seek and Read a partition
Partition-0
Part-0.avro File
Partition-4
Partition-5
Partition-9
Index file
Index files
Part-5
Index file
Part-5.avro File
Snapshotting – Loaders
24
 Custom InputFormat (MR)
 Uses the Index file to create Splits
 RecordReader merges partition-0 of Snapshot and
Delta
 Returns latest row from Delta if present
 Masks row if deleted
 Otherwise returns row from snapshot
 Pig Loader enables reading virtual snapshot via Pig
 Storage handler enables reading virtual snapshot via Hive
Snapshotting – Loaders (2)
25
/db/table/snapshot-0
(10 partitions, 10 Avro files)
_delta
Part-0
Part-9
Delta-1
(10 partitions, 2 Avro file)
Part-5
Part-0
Custom
InputFormat
Index files
Part-1
Part-2 . .
.
Mapper-0
Custom
InputFormat
Mapper-9
 Delta-1.Part-0 contains partitions 0 to 4
 Delta-2.Part-5 contains partitions 5 to 9
 Snapshot-0.Part-0 contains partition 0
 Both sorted on primary key
Snapshotting – Compactor
26
 Required when partition size exceeds threshold
 Materializes Virtual Snapshot to Snapshot
 With more partitions
 MR job with Reducer
 Virtual
Snapshot
Builder
(HDFS)
Internal
Staging
Lazy
Snapshot
Builder
User Jobs
(HDFS)
Published
Virtual
Snapshot
MR/Pig/Hive
Loaders
Compactor
Change
Capture
Increments
Pre-
Process
Full Drops
Operating billions of rows per day
 Dude, where’s my row?
– Automatic Data validation
 When data misses the bus
– Handling late data
– Look back window
 Cluster downtime
– Restart-ability
– Active-active
– Idempotent processing
27
Conclusion and Future Work
 Conclusion
 Lumos : Scalable ETL framework
 Battle tested in production
 Future Work
 Unify Internal and External data
 Open source
28
Q & A
29
Questions?
Appendix
30

More Related Content

What's hot

Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
BalajiVaradarajan13
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
BigDataCloud
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
Daniel Abadi
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
hadooparchbook
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
musrath mohammad
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6
longda feng
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
martinbpeters
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Asis Mohanty
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
Bigdatapump
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
DataWorks Summit
 
The Heterogeneous Data lake
The Heterogeneous Data lakeThe Heterogeneous Data lake
The Heterogeneous Data lake
DataWorks Summit/Hadoop Summit
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterBill Graham
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
DataWorks Summit
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
eakasit_dpu
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 

What's hot (20)

Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
The Heterogeneous Data lake
The Heterogeneous Data lakeThe Heterogeneous Data lake
The Heterogeneous Data lake
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 

Viewers also liked

Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Shirshanka Das
 
Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture Pipeline
Sunil Nagaraj
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
Amy W. Tang
 
Bridging the gap: e-learning research
Bridging the gap: e-learning researchBridging the gap: e-learning research
Bridging the gap: e-learning researchgrainne
 
Log ingestion kafka -- impala using apex
Log ingestion   kafka -- impala using apexLog ingestion   kafka -- impala using apex
Log ingestion kafka -- impala using apex
Apache Apex
 
Cs intro-ca
Cs intro-caCs intro-ca
Cs intro-ca
aniketbijwe143
 
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
In-Memory Computing Summit
 
Aesop change data propagation
Aesop change data propagationAesop change data propagation
Aesop change data propagation
Regunath B
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
MapR Technologies
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
mattlieber
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
Ivo Vachkov
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
The Ecosystem is too damn big
The Ecosystem is too damn big The Ecosystem is too damn big
The Ecosystem is too damn big
DataWorks Summit/Hadoop Summit
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
5 Reasons Why Healthcare Data is Unique and Difficult to Measure
5 Reasons Why Healthcare Data is Unique and Difficult to Measure5 Reasons Why Healthcare Data is Unique and Difficult to Measure
5 Reasons Why Healthcare Data is Unique and Difficult to Measure
Health Catalyst
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
Big data landscape v 3.0 - Matt Turck (FirstMark)
Big data landscape v 3.0 - Matt Turck (FirstMark) Big data landscape v 3.0 - Matt Turck (FirstMark)
Big data landscape v 3.0 - Matt Turck (FirstMark) Matt Turck
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to Databus
Amy W. Tang
 

Viewers also liked (20)

Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
 
Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture Pipeline
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Bridging the gap: e-learning research
Bridging the gap: e-learning researchBridging the gap: e-learning research
Bridging the gap: e-learning research
 
Log ingestion kafka -- impala using apex
Log ingestion   kafka -- impala using apexLog ingestion   kafka -- impala using apex
Log ingestion kafka -- impala using apex
 
Cs intro-ca
Cs intro-caCs intro-ca
Cs intro-ca
 
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
 
Aesop change data propagation
Aesop change data propagationAesop change data propagation
Aesop change data propagation
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
The Ecosystem is too damn big
The Ecosystem is too damn big The Ecosystem is too damn big
The Ecosystem is too damn big
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
5 Reasons Why Healthcare Data is Unique and Difficult to Measure
5 Reasons Why Healthcare Data is Unique and Difficult to Measure5 Reasons Why Healthcare Data is Unique and Difficult to Measure
5 Reasons Why Healthcare Data is Unique and Difficult to Measure
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
 
Big data landscape v 3.0 - Matt Turck (FirstMark)
Big data landscape v 3.0 - Matt Turck (FirstMark) Big data landscape v 3.0 - Matt Turck (FirstMark)
Big data landscape v 3.0 - Matt Turck (FirstMark)
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to Databus
 

Similar to Bringing OLTP woth OLAP: Lumos on Hadoop

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Vinoth Chandar
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
Alluxio, Inc.
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
Samuel Rash
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
HANA SITSP 2011
HANA SITSP 2011HANA SITSP 2011
HANA SITSP 2011
Henrique Pinto
 
The Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInThe Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedIn
rajappaiyer
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
SharePoint 2010 Boost your farm performance!
SharePoint 2010 Boost your farm performance!SharePoint 2010 Boost your farm performance!
SharePoint 2010 Boost your farm performance!
Brian Culver
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
Jonathan Holloway
 

Similar to Bringing OLTP woth OLAP: Lumos on Hadoop (20)

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
HANA SITSP 2011
HANA SITSP 2011HANA SITSP 2011
HANA SITSP 2011
 
The Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInThe Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedIn
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
SharePoint 2010 Boost your farm performance!
SharePoint 2010 Boost your farm performance!SharePoint 2010 Boost your farm performance!
SharePoint 2010 Boost your farm performance!
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 

Bringing OLTP woth OLAP: Lumos on Hadoop

  • 1. Scaling ETL on Hadoop: Bridging OLTP with OLAP
  • 2. Agenda  Data Ecosystem @ LinkedIn  Problem : Bridging OLTP with OLAP  Solution  Details  Conclusion and Future Work 2
  • 3. Data Ecosystem @ LinkedIn 3
  • 4. Data Ecosystem - Overview 4 Serving App Online Stores Espresso Oracle MySQL Logs Analytics Infra Business Engines Serving OLAP
  • 5. Data Ecosystem – Data 5  Tracking Data  Tracks user activity at web site  Append only  Example: Page View  Database Data  Member provided data in online-stores  Inserts, Updates and Deletes  Example: Member Profiles, Likes, Comments
  • 7. Bridging OLTP to OLAP 7 OLTP OLAP  Integrating site-serving data stores with Hadoop at scale with low latency.  Critical to LinkedIn’s  Member engagement  Business decision making Kafka Engines Serving OLAP Databases Tracking Data Espresso Oracle MySQL
  • 8. Challenge - Scalable ETL 8  600+ Tracking topics  500+ Database tables  XXX TB of Data at rest  X TB of new data generated per day  5000 Nodes, Several Hadoop clusters Kafka Engines Serving OLAP Databases Tracking Data Espresso Oracle MySQL OLTP OLAP
  • 9. Challenge – Consistent Snapshot with SLA 9  Apply updates, deletes  Copy full tables  But, resource overheads  Small fraction of data changes Kafka Engines Serving OLAP Databases Tracking Data Espresso Oracle MySQL OLTP OLAP
  • 10. Engines Requirements 10 OLTP Oracle Espresso OLAP  Refresh data on HDFS frequently  Seamless handling of schema evolution  Optimal resource usage  Handle multi data centers  Efficient change capture on source  Ensure Last-Update semantics  Handle deletes Serving OLAP Database Data Tracking Data
  • 12. Lumos 12 Data Capture  Can use commit logs  Delta processing  Latencies in minutes  Schema agnostic framework Databus Others Hadoop : Data Center DB Extract Files Data Center Colo-1 Databases Colo-2 Databases Lumos databases (HDFS) dbchanges (HDFS)
  • 13. Lumos – Multi-Datacenter 13 Data Capture  Handle multi-datacenter stores  Resolve updates via commit order Databus Others Hadoop : Data Center DB Extract Files Data Center Colo-1 Databases Colo-2 Databases Lumos databases (HDFS) dbchanges (HDFS)
  • 14. Lumos – Data Organization 14 - Virtual Snapshot HDFS Layout InputFormat Pig&Hive Loaders  Database Snapshot - Entire database on HDFS - With added latency  Database Virtual Snapshot - Previous Snapshot + Delta - Enables faster refresh /db/table/snapshot-0 _delta dir-1 dir-2 dir-3
  • 15. Lumos - High Level Architecture 15 Virtual Snapshot Builder ETL Hadoop Cluster Staging (internal) Lazy Snapshot Builder User Jobs HDFS Published Virtual Snapshot MR/Pig/Hiv e Loaders Compactor Change Captur e Increments Pre- Process Full Drops
  • 16. Alternative Approaches  Sqoop  Hbase  Hive Streaming 16
  • 18. Change Capture – File Based 18  File Format  Compressed CSV  Metadata  Full Drop  Via Fast Reader (Oracle, MySQL)  Via MySQL backups (Espresso)  Runs for hours with Dirty reads  Increments  Via SQL  Transactional Full Drop 1am 4am Inc h-1 Inc h-2 Inc h-3 2am 3am Prev. HW New High-water mark DB Files Web Service HDFS HTTPS Pulls Inc H-4
  • 19. Change Capture – Databus Based 19 Databus Relay Mapper Databus Consumer dbchanges (HDFS) Reducer Database Mapper Databus Consumer Reducer  Reads Database commit logs  Multi datacenter via Databus Relay  Runs as MR Job  Output : date-time partitioned with multiple versions  True change capture (including hard deletes) Databus RelayDatabase Hadoop
  • 20. Pre-Processing 20  Data format conversion  Field level transformations  Privacy  Cleansing – Eg. Remove recursive schema  Metadata annotation  Add row counts for data validation Virtual Snapshot Builder (HDFS) Internal Staging Lazy Snapshot Builder User Jobs (HDFS) Published Virtual Snapshot MR/Pig/Hive Loaders Compactor Change Capture Increments Pre- Process Full Drops
  • 21. Snapshotting – Lazy Materializer 21  One MR job per table, consumes full drops  Supports dirty reads.  Hash Partition on primary key  Number of partitions based on data size  Sorts on primary key  Results published into staging directory Virtual Snapshot Builder (HDFS) Internal Staging Lazy Snapshot Builder User Jobs (HDFS) Published Virtual Snapshot MR/Pig/Hive Loaders Compactor Change Capture Increments Pre- Process Full Drops
  • 22. Snapshotting – Virtual Snapshot Builder 22  One MR Job for all tables  Identifies all existing snapshots, both published and staged  Creates appropriate delta partitions for every snapshot  Delta partition count equals Snapshot partition count  Club multiple partition in one file  Outputs latest row using delta column  Publishes staged snapshots with new deltas  Previously published snapshots updated with new deltas Virtual Snapshot Builder (HDFS) Internal Staging Lazy Snapshot Builder User Jobs (HDFS) Published Virtual Snapshot MR/Pig/Hive Loaders Compactor Change Capture Increments Pre- Process Full Drops
  • 23. Snapshotting – Virtual Snapshot Builder 23 /db/table/snapshot-0 (10 partitions, 10 Avro files) _delta inc-1 (10 partitions, 2 Avro file) Part-0 . . .Part-9 Index files Inc-2 (10 partitions, 2 Avro file) Part-0 Part-5 Part-0  Incremental data is small  Rolls increments  Avoid creating small files  Equi-partitions INC as Snapshot  Seek and Read a partition Partition-0 Part-0.avro File Partition-4 Partition-5 Partition-9 Index file Index files Part-5 Index file Part-5.avro File
  • 24. Snapshotting – Loaders 24  Custom InputFormat (MR)  Uses the Index file to create Splits  RecordReader merges partition-0 of Snapshot and Delta  Returns latest row from Delta if present  Masks row if deleted  Otherwise returns row from snapshot  Pig Loader enables reading virtual snapshot via Pig  Storage handler enables reading virtual snapshot via Hive
  • 25. Snapshotting – Loaders (2) 25 /db/table/snapshot-0 (10 partitions, 10 Avro files) _delta Part-0 Part-9 Delta-1 (10 partitions, 2 Avro file) Part-5 Part-0 Custom InputFormat Index files Part-1 Part-2 . . . Mapper-0 Custom InputFormat Mapper-9  Delta-1.Part-0 contains partitions 0 to 4  Delta-2.Part-5 contains partitions 5 to 9  Snapshot-0.Part-0 contains partition 0  Both sorted on primary key
  • 26. Snapshotting – Compactor 26  Required when partition size exceeds threshold  Materializes Virtual Snapshot to Snapshot  With more partitions  MR job with Reducer Virtual Snapshot Builder (HDFS) Internal Staging Lazy Snapshot Builder User Jobs (HDFS) Published Virtual Snapshot MR/Pig/Hive Loaders Compactor Change Capture Increments Pre- Process Full Drops
  • 27. Operating billions of rows per day  Dude, where’s my row? – Automatic Data validation  When data misses the bus – Handling late data – Look back window  Cluster downtime – Restart-ability – Active-active – Idempotent processing 27
  • 28. Conclusion and Future Work  Conclusion  Lumos : Scalable ETL framework  Battle tested in production  Future Work  Unify Internal and External data  Open source 28

Editor's Notes

  1. Today, Talk about Scaling ETL in order to consolidate and democratize data and analytics on Hadoop at LinkedIn.
  2. Let’s start with the overall Data Ecosystem Then focus on the specific problem of integrating online data-stores with Hadoop and go over the solution
  3. Members interact with the site apps And they generate actions and data mutations Which gets persisted in LOGS store and ONLINE data stores Espresso, MySQL and Oracle are primary online data stores. Espresso is a document oriented partitioned data store with transactional support. It is home grown. Kafka is used as the LOG store. Online Data sources are periodically replicated to hadoop for creating cubes & enrichments. Cubes are used externally on the site as well as internally on the reports/insights for analysts. (Eg: “Who viewed your profile”, “Campaign performance reports”, Member sign-up reports) Cubes are delivered via Cube serving Engines. There are primarily 3 cube serving stack. Voldemort is a key-value store : used to deliver static reports with pre-computed metrics. Pinot : search technology : used for delivering some what dynamic reports with pre-compute metrics (drill) Finally, the traditional BI stack comprised of TD + Tableau + MSTR: deliver insights to business users.
  4. Explain interactively what action generated what data  real use case. Tracking: User activity at the site turns into tracking data Example -> Tracking -> PageView, AdClick Append -> each user activity generates new data Immutable -> Once generated, does not change but grows over time Usually organized by time and accessed over time range Database: is user provided data stored in online stores. This data is mutable over time Example -> Member Profile, Education Organized as full table as of some time and accessed in full
  5. The problem is simply replicating the data from ONLINE to HADOOP But, LNKD has 300m members and generates lots of data => humongous amount of data Fresh data directly impacts the member engagement and business decision making
  6. PROD data center that is accessible from outside HADOOP is CORP data center
  7. Deletes for compliance Move the data entirely, but it puts load on the source system, network and hadoop resources
  8. Commit time or Since tracking data generates is append only, it is easier to handler and arrange them in time window. DB data can have updates or deletes, and reflecting that on HDFS in low latency and with optimal resouce usage is a challenge
  9. TALK about schema evaluation
  10. TALK about schema evaluation
  11. This is not HDFS snaphsot not HBASE snapshot
  12. Schema changes + rewrite the complete data Sqoop: Cross-colo database connections are not allowed Sqoop: May put load on the production databases Hbase Write the change logs and periodically do a snapshot and replicate not all companies run Hbase as part of the standard deployment not clear if this will meet the low-latency requirement Hive Streaming looks similar to what we do caveat: it only supports ORCA
  13. Change to Data Extract
  14. Bottom right
  15. TODO: cluster of databases and Relay Reading off of databus With a picture Checkpoint  Scn to time mapping Backup slides towards the end
  16. Db Dump format to Avro Oracle data types Map-Only Job Field Level transformation Eliminate recursive schema Avro Schema Attribute JSON Meta info Key and delta column begin_date, end_date, drop_date, full_drop date Row counts
  17. Db Dump format to Avro Oracle data types Map-Only Job Field Level transformation Eliminate recursive schema Avro Schema Attribute JSON Meta info Key and delta column begin_date, end_date, drop_date, full_drop date Row counts
  18. Db Dump format to Avro Oracle data types Map-Only Job Field Level transformation Eliminate recursive schema Avro Schema Attribute JSON Meta info Key and delta column begin_date, end_date, drop_date, full_drop date Row counts
  19. Db Dump format to Avro Oracle data types Map-Only Job Field Level transformation Eliminate recursive schema Avro Schema Attribute JSON Meta info Key and delta column begin_date, end_date, drop_date, full_drop date Row counts
  20. Change to Data Extract