SlideShare a Scribd company logo
1 of 29
Download to read offline
How to build
a data warehouse?
Dmytro Popovych, SE @ Tubular
Theory vs practice
Цитата #441422
Пока инженеры в белых халатах прикручивают красивый двигатель к
идеальному крылу, бригада взлохмаченных придурков во главе с
безумным авантюристом пролетает над ними на конструкции из
микроавтобуса, забора и двух промышленных фенов, навстречу
второму туру инвестиций.
Красивые проекты не взлетают, потому что они не успевают взлететь.
Agenda
• Problem statement:
• Data Ingestion
• Data Normalisation
• Data Access
• Our way to solve the problem :)
About us
• Video intelligence for the cross-platform world
• 30+ video platforms including YouTube, Facebook, Instagram
• 7M creators
• 3B videos
• 2Tb of newly ingested data a day
• 150Tb of data in the warehouse
What is a data warehouse?
A central repository of data collected from disparate sources.
ANALYST
ENGINEER
SERVICE
DATA
WAREHOUSE
Key features
Ingestion
Store raw data extracted from disparate data sources
Normalisation
Cleanup / combine raw data
Access
Help user to retrieve data
What problems does it solve in Tubular?
• For engineers / analysts:
• data discovery
• prototyping / analyse
• For services:
• data exchange
Data Ingestion
Data Ingestion Problems
• Real time data:
• tweets, comments, shares, views
• Periodical snapshots:
• dump of real time data
• results of the data analysis
• databases from internal services (in some cases)
Real time data
DATABUS / event log / message queue
Powered by KAFKA
Data serialised with AVRO
Keeps all events for the last N days
SERVICE #1 SERVICE #2
PERMANENT
STORAGE
...
Why did we choose Kafka?
• Stores streams of records in a fault-tolerant way
• Designed to serve multiple consumers per topic
• Allows to keep the last N days of records
• Tested in very big companies Linkedin, Twitter, Uber, Airbnb...
• Strict schema definition
• Safe schema evolution
• Compact (binary serialisation format)
• Cross-technology format (Java, Python, …)
• Has some ecosystem around (Schema Registry, CLI consumers, …)
• Hadoop-friendly
Why did we choose Avro?
Periodical Snapshots
DATABUS
SERVICE #1
Powered by ELASTIC
PERMANENT STORAGE
SERVICE #2
Powered by CASSANDRA
SERVICE #3
Powered by MYSQL
...
DATA IMPORT TOOL
Powered by S3
Data serialised with PARQUET
Powered by SPARK
Why did we choose S3?
• There is no need to support it
• Compatible with Hadoop ecosystem
• Relatively stable & cheap
Why did we choose Parquet?
• Column-oriented format (perfect for analytics and partial reads)
• Supports complex data structures
• Compatible with Hadoop ecosystem
Why did we choose Spark?
• Scalable data processing engine
• Faster than Hadoop
• Has connectors to all popular storages: JDBC, Elastic, Cassandra, Kafka
• Has Python bindings
• Built-in support of Parquet
Data Normalisation
Data Normalisation Problems
• Cleanup duplicates
• Partition by year / month / date / hour
• Join various data sources
Normalisation of real time data (example)
SERVICE #1
Powered by ELASTIC
DATABUS
UI
PERMANENT STORAGE
The service joins multiple data streams by sending
partial updates to Elastic.
Note! It isn’t the only way to implement a real time
join, more generic solution could be implemented
with Apache Samza.
Why did we choose Elastic?
• Provides real time search and analytics
• Has relatively cheap partial updates
• Easy to scale
Normalisation of previously imported data
DATA NORMALISATION TOOL
PERMANENT STORAGE
Powered by Spark
Joins various datasets
Removes duplicates
Creates partitions by time range buckets
Why did we choose Spark?
• Scalable data processing engine
• Has built-in SQL api to transform data (perfect for joins and deduplication)
Data Access
Data Access Problems
• Datasets discovery
• Unified data access interface
Metadata Storage
PERMANENT STORAGE
Parquet
# 1
Avro
#1
CSV
#1
Parquet
# 2
...
METADATA STORAGE
Table # 1
Table # 2
Table # 3
...
Powered by Hive Metastore
Why did we choose Hive Metastore?
• Supported by Hadoop ecosystem
• Simple (Thrift api on top of MySQL table)
• Supported by Hue (UI to access metadata)
Let's summarize...
System Overview
ANALYST,
ENGINEER
PERMANENT STORAGE
DATABUS
METADATA
STORAGE
IMPORT
TOOL
NORMALISATION
TOOL
WAREHOUSE
SERVICES
* Data flows for Metadata Storage are explained verbally, too many arrows...
Thanks! Questions?
Check this out: https://github.com/Tubular/sparkly

More Related Content

What's hot

Northwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to CloudNorthwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to Cloud
Databricks
 
Strata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniStrata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash Ramineni
Avinash Ramineni
 

What's hot (20)

Eugene Polonichko "Architecture of modern data warehouse"
Eugene Polonichko "Architecture of modern data warehouse"Eugene Polonichko "Architecture of modern data warehouse"
Eugene Polonichko "Architecture of modern data warehouse"
 
Northwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to CloudNorthwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to Cloud
 
Converging Database Transactions and Analytics
Converging Database Transactions and Analytics Converging Database Transactions and Analytics
Converging Database Transactions and Analytics
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
 
Integration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data LakeIntegration Monday - Analysing StackExchange data with Azure Data Lake
Integration Monday - Analysing StackExchange data with Azure Data Lake
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Personalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingPersonalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud Streaming
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQL
 
Snaplogic Live: Big Data in Motion
Snaplogic Live: Big Data in MotionSnaplogic Live: Big Data in Motion
Snaplogic Live: Big Data in Motion
 
Community day ppt_kinesisv1.0
Community day ppt_kinesisv1.0Community day ppt_kinesisv1.0
Community day ppt_kinesisv1.0
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
 
Industrializing Machine Learning on an Enterprise Azure Platform with Databri...
Industrializing Machine Learning on an Enterprise Azure Platform with Databri...Industrializing Machine Learning on an Enterprise Azure Platform with Databri...
Industrializing Machine Learning on an Enterprise Azure Platform with Databri...
 
Strata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniStrata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash Ramineni
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
TechDays NL 2016 - Building your scalable secure IoT Solution on Azure
TechDays NL 2016 - Building your scalable secure IoT Solution on AzureTechDays NL 2016 - Building your scalable secure IoT Solution on Azure
TechDays NL 2016 - Building your scalable secure IoT Solution on Azure
 
Data Architecture Brief Overview
Data Architecture Brief OverviewData Architecture Brief Overview
Data Architecture Brief Overview
 
Using Premium Data - for Business Analysts
Using Premium Data - for Business AnalystsUsing Premium Data - for Business Analysts
Using Premium Data - for Business Analysts
 
LogStash: Concept Run-Through
LogStash: Concept Run-ThroughLogStash: Concept Run-Through
LogStash: Concept Run-Through
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 

Similar to Дмитрий Попович "How to build a data warehouse?"

Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
Claudiu Barbura
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
 

Similar to Дмитрий Попович "How to build a data warehouse?" (20)

Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
Yellowbrick Webcast with DBTA for Real-Time Analytics
Yellowbrick Webcast with DBTA for Real-Time AnalyticsYellowbrick Webcast with DBTA for Real-Time Analytics
Yellowbrick Webcast with DBTA for Real-Time Analytics
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lake
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
 

More from Fwdays

More from Fwdays (20)

"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y..."How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
 
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
 
"Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl..."Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl...
 
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T..."How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
 
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ..."The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
 
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu..."[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
 
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care..."[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
 
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"..."4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
 
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast..."Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
 
"Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others..."Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others...
 
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
 
"Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv..."Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv...
 
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin..."How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
 

Recently uploaded

Recently uploaded (20)

Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 

Дмитрий Попович "How to build a data warehouse?"

  • 1. How to build a data warehouse? Dmytro Popovych, SE @ Tubular
  • 2. Theory vs practice Цитата #441422 Пока инженеры в белых халатах прикручивают красивый двигатель к идеальному крылу, бригада взлохмаченных придурков во главе с безумным авантюристом пролетает над ними на конструкции из микроавтобуса, забора и двух промышленных фенов, навстречу второму туру инвестиций. Красивые проекты не взлетают, потому что они не успевают взлететь.
  • 3. Agenda • Problem statement: • Data Ingestion • Data Normalisation • Data Access • Our way to solve the problem :)
  • 4. About us • Video intelligence for the cross-platform world • 30+ video platforms including YouTube, Facebook, Instagram • 7M creators • 3B videos • 2Tb of newly ingested data a day • 150Tb of data in the warehouse
  • 5. What is a data warehouse? A central repository of data collected from disparate sources. ANALYST ENGINEER SERVICE DATA WAREHOUSE
  • 6. Key features Ingestion Store raw data extracted from disparate data sources Normalisation Cleanup / combine raw data Access Help user to retrieve data
  • 7. What problems does it solve in Tubular? • For engineers / analysts: • data discovery • prototyping / analyse • For services: • data exchange
  • 9. Data Ingestion Problems • Real time data: • tweets, comments, shares, views • Periodical snapshots: • dump of real time data • results of the data analysis • databases from internal services (in some cases)
  • 10. Real time data DATABUS / event log / message queue Powered by KAFKA Data serialised with AVRO Keeps all events for the last N days SERVICE #1 SERVICE #2 PERMANENT STORAGE ...
  • 11. Why did we choose Kafka? • Stores streams of records in a fault-tolerant way • Designed to serve multiple consumers per topic • Allows to keep the last N days of records • Tested in very big companies Linkedin, Twitter, Uber, Airbnb...
  • 12. • Strict schema definition • Safe schema evolution • Compact (binary serialisation format) • Cross-technology format (Java, Python, …) • Has some ecosystem around (Schema Registry, CLI consumers, …) • Hadoop-friendly Why did we choose Avro?
  • 13. Periodical Snapshots DATABUS SERVICE #1 Powered by ELASTIC PERMANENT STORAGE SERVICE #2 Powered by CASSANDRA SERVICE #3 Powered by MYSQL ... DATA IMPORT TOOL Powered by S3 Data serialised with PARQUET Powered by SPARK
  • 14. Why did we choose S3? • There is no need to support it • Compatible with Hadoop ecosystem • Relatively stable & cheap
  • 15. Why did we choose Parquet? • Column-oriented format (perfect for analytics and partial reads) • Supports complex data structures • Compatible with Hadoop ecosystem
  • 16. Why did we choose Spark? • Scalable data processing engine • Faster than Hadoop • Has connectors to all popular storages: JDBC, Elastic, Cassandra, Kafka • Has Python bindings • Built-in support of Parquet
  • 18. Data Normalisation Problems • Cleanup duplicates • Partition by year / month / date / hour • Join various data sources
  • 19. Normalisation of real time data (example) SERVICE #1 Powered by ELASTIC DATABUS UI PERMANENT STORAGE The service joins multiple data streams by sending partial updates to Elastic. Note! It isn’t the only way to implement a real time join, more generic solution could be implemented with Apache Samza.
  • 20. Why did we choose Elastic? • Provides real time search and analytics • Has relatively cheap partial updates • Easy to scale
  • 21. Normalisation of previously imported data DATA NORMALISATION TOOL PERMANENT STORAGE Powered by Spark Joins various datasets Removes duplicates Creates partitions by time range buckets
  • 22. Why did we choose Spark? • Scalable data processing engine • Has built-in SQL api to transform data (perfect for joins and deduplication)
  • 24. Data Access Problems • Datasets discovery • Unified data access interface
  • 25. Metadata Storage PERMANENT STORAGE Parquet # 1 Avro #1 CSV #1 Parquet # 2 ... METADATA STORAGE Table # 1 Table # 2 Table # 3 ... Powered by Hive Metastore
  • 26. Why did we choose Hive Metastore? • Supported by Hadoop ecosystem • Simple (Thrift api on top of MySQL table) • Supported by Hue (UI to access metadata)
  • 29. Thanks! Questions? Check this out: https://github.com/Tubular/sparkly