SlideShare a Scribd company logo
Introduction Our Environments Conclusions
PostgreSQL as a Big Data Platform
Chris Travers
May 10, 2019
Introduction Our Environments Conclusions
About Adjust
Adjust provides mobile
advertisement attribution and
analytics on mobile application
events. On average we track
advertisement performance for 10
applications for each smart phone
in use. We focus on fairness and
transparency.
Introduction Our Environments Conclusions
About Me
• Long time software developoer
• New Contributor to PostgreSQL
• Working with PostgreSQL since 1999
• Head of PostgreSQL team at Adjust GmbH
Introduction Our Environments Conclusions
My Team
• Research and Development
• Database environments supporting diverse products
• PB scale deployments
Introduction Our Environments Conclusions
Assumptions in Relational Model
• Optimized for mathematical manipulations via Set Theory
• Real implementations fall short (bags vs sets etc)
• Data modelled as tuples (ideally short) where a tuple element
is assumed to be atomic.
• Think accounting or order management software.
Introduction Our Environments Conclusions
Traditional business intelligence (relational)
• Same approach mathematically as OLTP
• Typically data is historical meaning that it can be inserted
periodically.
• Sales over time per country is a standard example.
• PostgreSQL struggles a bit with BI due to table structure.
Introduction Our Environments Conclusions
Industry Trends
• Wider variety of source data for analysis (Variety)
• Real-time analytics on streams of events (Velocity)
• More data than can be managed gracefully on one machine
(Volume)
Collectively these are known as ”Big Data” and might include
analysis of published research, facebook posts, internet-of-things
events, or MMO game data.
Introduction Our Environments Conclusions
What Big Data Is/Is Not
Big Data Is:
• About V3 Problems
• A set of techniques
• About Attention to Detail
Big Data Is Not
• A set of products
• A set of technologies
• About following recipes.
At Adjust we apply big data techniques to large, high velocity data
sets using vanilla Postgres and a lot of our custom software.
Introduction Our Environments Conclusions
Introducing our KPI Service Pipeline
• Environment Approaching 1PB
• Delivers near-realtime analytics on user behavior
• 100-300k requests a second
• Delivers to dashboard and external API users
• Different pieces have different availability considerations
Introduction Our Environments Conclusions
Big Data Characteristics
• High Volume and Velocity
• High availability requirements for Ingestion
• Distributed data warehouse queries
• Data has has very large clusters of values, making ordinary
sharding difficult.
Introduction Our Environments Conclusions
Engineering Approach
• Pipeline of Data
• Highly redundant initial processing nodes
• Modestly redundant customer-facing shards
• Data moves through a pipeline.
Introduction Our Environments Conclusions
Architecture
• Initial processing systems log their results
• MapReduce to customer-facing shard databases
• MapReduce again in delivering data to client
• Covered in ”PostgreSQL At 20TB and Beyond”
Introduction Our Environments Conclusions
PostgreSQL Challenges
• PostgreSQL FDW too latency sensitive to use between
datacenters.
• Multiple inheritance used for some advanced features makes
data schema changes difficult.
• Our shards’ WAL traffic is measured in the TB/day.
Introduction Our Environments Conclusions
Introducing Bagger
• Elastic Search Replacement
• High velocity ingestion (1M+ data points/sec)
• Very high volume (10PB)
• Free form data (JSON documents)
• Retention for Limited Time
Introduction Our Environments Conclusions
Big Data Characteristics
• Very high velocity (up to 1M items per second ingestion)
• Very high volume (10PB of data, capped by quantity
currently).
• Could include all kinds of new data at any time, so must
handle variety of semi-structured data quite gracefully.
Introduction Our Environments Conclusions
Engineering Approach
• Optimize for bulk storage and linear writes
• Use PostgreSQL JSONB and similar indexes
• Data partitioned by hour and dropped when disks are near full.
• Client-side sharding, so dedicated client
Introduction Our Environments Conclusions
Architecture
• Data arrives by Kafka, partitioned for the dbs
• Data partitioned by query pattern and hour
• Partitions tracked on master databases
• Client written in Perl, which queries appropriate partitions and
concatenates data
Introduction Our Environments Conclusions
PostgreSQL Challenges
• Marshalling JSONB can be expensive
• System catalogs on ZFS on spinning disk are slow
• Requires significant custom C code in triggers to keep the
system fast.
• Exception handling in PostgreSQL has been a source of bugs
in the past.
Introduction Our Environments Conclusions
Introducing Audience Builder
• Retargetting platform (describe use case)
• Only 12TB but expect to grow
• Non-typical query and access patterns
• Feature requests that could push this into PB range
Introduction Our Environments Conclusions
Big Data Characteristics
• High enough velocity that saturation of NVME storage is a
concern.
• Queries touch very large amounts of data
• Expect these issues to become far worse.
Introduction Our Environments Conclusions
Engineering Approach
• Separation of storage and query
• Settled on Parquet as storage format
• Columnar data storage useful but most software does not
support our access patterns well.
• Evaluated a few of alternatives to PostgreSQL here.
• Prioritized predictability and extensibility over peak
performance
Introduction Our Environments Conclusions
Architecture
• Parquet files on CephFS
• PostgreSQL as query engine only
• Wrote Parquet FDW for PostgreSQL
• With tuning and optimization, as fast as native files.
• Data arrives via Kafka and is written to Parquet files and
these are registered with the database.
• Pluggable storage might be of interest here.
https://github.com/zilder/parquet fdw
Introduction Our Environments Conclusions
PostgreSQL Challenges
• Hundreds of thousands of tables
• Performance requires tables to be physically sorted
• Heavy reliance on streaming APIs (COPY)
Introduction Our Environments Conclusions
Key Takeaways
• Big Data is about technique, not technology
• PostgreSQL is quite capable in this area
• Careful attention to requirements is important
• Every big data system is different.
Introduction Our Environments Conclusions
Major Open Source Software We Use
• Apache Kafka
• PostgreSQL
• Redis
• Go
• Apache Flink
• CephFS
• Apache Spark
• Gentoo Linux
Introduction Our Environments Conclusions
Thank You
Thanks for coming. Any questions?
chris.travers@adjust.com

More Related Content

What's hot

Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking VN
 
Getting started with Hadoop, Hive, Spark and Kafka
Getting started with Hadoop, Hive, Spark and KafkaGetting started with Hadoop, Hive, Spark and Kafka
Getting started with Hadoop, Hive, Spark and Kafka
Edelweiss Kammermann
 
How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB
How Discord Migrated Trillions of Messages from Cassandra to ScyllaDBHow Discord Migrated Trillions of Messages from Cassandra to ScyllaDB
How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB
ScyllaDB
 
HTTP2 and gRPC
HTTP2 and gRPCHTTP2 and gRPC
HTTP2 and gRPC
Guo Jing
 
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit... Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Databricks
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
Lynn Langit
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Databricks
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
chrislusf
 
Road to NODES - Blazing Fast Ingest with Apache Arrow
Road to NODES - Blazing Fast Ingest with Apache ArrowRoad to NODES - Blazing Fast Ingest with Apache Arrow
Road to NODES - Blazing Fast Ingest with Apache Arrow
Neo4j
 
The spring ecosystem in 50 min
The spring ecosystem in 50 minThe spring ecosystem in 50 min
The spring ecosystem in 50 min
Jeroen Sterken
 
HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
DataWorks Summit/Hadoop Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
Corey Huinker
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
confluent
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Analyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeAnalyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeDataWorks Summit
 
ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...
ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...
ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...
XinliShang1
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
Jonas Bonér
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
 

What's hot (20)

Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
Getting started with Hadoop, Hive, Spark and Kafka
Getting started with Hadoop, Hive, Spark and KafkaGetting started with Hadoop, Hive, Spark and Kafka
Getting started with Hadoop, Hive, Spark and Kafka
 
How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB
How Discord Migrated Trillions of Messages from Cassandra to ScyllaDBHow Discord Migrated Trillions of Messages from Cassandra to ScyllaDB
How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB
 
HTTP2 and gRPC
HTTP2 and gRPCHTTP2 and gRPC
HTTP2 and gRPC
 
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit... Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
Road to NODES - Blazing Fast Ingest with Apache Arrow
Road to NODES - Blazing Fast Ingest with Apache ArrowRoad to NODES - Blazing Fast Ingest with Apache Arrow
Road to NODES - Blazing Fast Ingest with Apache Arrow
 
The spring ecosystem in 50 min
The spring ecosystem in 50 minThe spring ecosystem in 50 min
The spring ecosystem in 50 min
 
HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Analyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeAnalyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-time
 
ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...
ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...
ApacheCon 2022: From Column-Level to Cell-Level_ Towards Finer-grained Encryp...
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
 

Similar to PostgreSQL as a Big Data Platform

Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
Hakka Labs
 
Optimize with Open Source
Optimize with Open SourceOptimize with Open Source
Optimize with Open Source
EDB
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
DataWorks Summit/Hadoop Summit
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
AWS Chicago
 
Lessons from lhc
Lessons from lhcLessons from lhc
Lessons from lhc
drsm79
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
Ashnikbiz
 
Save money with Postgres on IBM PowerLinux
Save money with Postgres on IBM PowerLinuxSave money with Postgres on IBM PowerLinux
Save money with Postgres on IBM PowerLinux
EDB
 
Endeca Performance Considerations
Endeca Performance ConsiderationsEndeca Performance Considerations
Endeca Performance Considerations
Cirrus10
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Lucas Jellema
 
Technical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPASTechnical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPAS
Ashnikbiz
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Gabriele Bartolini
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from Pivotal
VMware Tanzu Korea
 
PostgreSQL 10: What to Look For
PostgreSQL 10: What to Look ForPostgreSQL 10: What to Look For
PostgreSQL 10: What to Look For
Amit Langote
 
Optymalizacja środowiska Open Source w celu zwiększenia oszczędności i kontroli
Optymalizacja środowiska Open Source w celu zwiększenia oszczędności i kontroliOptymalizacja środowiska Open Source w celu zwiększenia oszczędności i kontroli
Optymalizacja środowiska Open Source w celu zwiększenia oszczędności i kontroli
EDB
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
Amazon Web Services
 
Powering GIS Application with PostgreSQL and Postgres Plus
Powering GIS Application with PostgreSQL and Postgres Plus Powering GIS Application with PostgreSQL and Postgres Plus
Powering GIS Application with PostgreSQL and Postgres Plus
Ashnikbiz
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
Tung Nguyen
 

Similar to PostgreSQL as a Big Data Platform (20)

Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataDataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
 
Optimize with Open Source
Optimize with Open SourceOptimize with Open Source
Optimize with Open Source
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 
Lessons from lhc
Lessons from lhcLessons from lhc
Lessons from lhc
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Save money with Postgres on IBM PowerLinux
Save money with Postgres on IBM PowerLinuxSave money with Postgres on IBM PowerLinux
Save money with Postgres on IBM PowerLinux
 
Endeca Performance Considerations
Endeca Performance ConsiderationsEndeca Performance Considerations
Endeca Performance Considerations
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
 
Technical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPASTechnical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPAS
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from Pivotal
 
PostgreSQL 10: What to Look For
PostgreSQL 10: What to Look ForPostgreSQL 10: What to Look For
PostgreSQL 10: What to Look For
 
Optymalizacja środowiska Open Source w celu zwiększenia oszczędności i kontroli
Optymalizacja środowiska Open Source w celu zwiększenia oszczędności i kontroliOptymalizacja środowiska Open Source w celu zwiększenia oszczędności i kontroli
Optymalizacja środowiska Open Source w celu zwiększenia oszczędności i kontroli
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
 
Powering GIS Application with PostgreSQL and Postgres Plus
Powering GIS Application with PostgreSQL and Postgres Plus Powering GIS Application with PostgreSQL and Postgres Plus
Powering GIS Application with PostgreSQL and Postgres Plus
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 

Recently uploaded

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Game Development with Unity3D (Game Development lecture 3)
Game Development  with Unity3D (Game Development lecture 3)Game Development  with Unity3D (Game Development lecture 3)
Game Development with Unity3D (Game Development lecture 3)
abdulrafaychaudhry
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 

Recently uploaded (20)

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Game Development with Unity3D (Game Development lecture 3)
Game Development  with Unity3D (Game Development lecture 3)Game Development  with Unity3D (Game Development lecture 3)
Game Development with Unity3D (Game Development lecture 3)
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 

PostgreSQL as a Big Data Platform

  • 1. Introduction Our Environments Conclusions PostgreSQL as a Big Data Platform Chris Travers May 10, 2019
  • 2. Introduction Our Environments Conclusions About Adjust Adjust provides mobile advertisement attribution and analytics on mobile application events. On average we track advertisement performance for 10 applications for each smart phone in use. We focus on fairness and transparency.
  • 3. Introduction Our Environments Conclusions About Me • Long time software developoer • New Contributor to PostgreSQL • Working with PostgreSQL since 1999 • Head of PostgreSQL team at Adjust GmbH
  • 4. Introduction Our Environments Conclusions My Team • Research and Development • Database environments supporting diverse products • PB scale deployments
  • 5. Introduction Our Environments Conclusions Assumptions in Relational Model • Optimized for mathematical manipulations via Set Theory • Real implementations fall short (bags vs sets etc) • Data modelled as tuples (ideally short) where a tuple element is assumed to be atomic. • Think accounting or order management software.
  • 6. Introduction Our Environments Conclusions Traditional business intelligence (relational) • Same approach mathematically as OLTP • Typically data is historical meaning that it can be inserted periodically. • Sales over time per country is a standard example. • PostgreSQL struggles a bit with BI due to table structure.
  • 7. Introduction Our Environments Conclusions Industry Trends • Wider variety of source data for analysis (Variety) • Real-time analytics on streams of events (Velocity) • More data than can be managed gracefully on one machine (Volume) Collectively these are known as ”Big Data” and might include analysis of published research, facebook posts, internet-of-things events, or MMO game data.
  • 8. Introduction Our Environments Conclusions What Big Data Is/Is Not Big Data Is: • About V3 Problems • A set of techniques • About Attention to Detail Big Data Is Not • A set of products • A set of technologies • About following recipes. At Adjust we apply big data techniques to large, high velocity data sets using vanilla Postgres and a lot of our custom software.
  • 9. Introduction Our Environments Conclusions Introducing our KPI Service Pipeline • Environment Approaching 1PB • Delivers near-realtime analytics on user behavior • 100-300k requests a second • Delivers to dashboard and external API users • Different pieces have different availability considerations
  • 10. Introduction Our Environments Conclusions Big Data Characteristics • High Volume and Velocity • High availability requirements for Ingestion • Distributed data warehouse queries • Data has has very large clusters of values, making ordinary sharding difficult.
  • 11. Introduction Our Environments Conclusions Engineering Approach • Pipeline of Data • Highly redundant initial processing nodes • Modestly redundant customer-facing shards • Data moves through a pipeline.
  • 12. Introduction Our Environments Conclusions Architecture • Initial processing systems log their results • MapReduce to customer-facing shard databases • MapReduce again in delivering data to client • Covered in ”PostgreSQL At 20TB and Beyond”
  • 13. Introduction Our Environments Conclusions PostgreSQL Challenges • PostgreSQL FDW too latency sensitive to use between datacenters. • Multiple inheritance used for some advanced features makes data schema changes difficult. • Our shards’ WAL traffic is measured in the TB/day.
  • 14. Introduction Our Environments Conclusions Introducing Bagger • Elastic Search Replacement • High velocity ingestion (1M+ data points/sec) • Very high volume (10PB) • Free form data (JSON documents) • Retention for Limited Time
  • 15. Introduction Our Environments Conclusions Big Data Characteristics • Very high velocity (up to 1M items per second ingestion) • Very high volume (10PB of data, capped by quantity currently). • Could include all kinds of new data at any time, so must handle variety of semi-structured data quite gracefully.
  • 16. Introduction Our Environments Conclusions Engineering Approach • Optimize for bulk storage and linear writes • Use PostgreSQL JSONB and similar indexes • Data partitioned by hour and dropped when disks are near full. • Client-side sharding, so dedicated client
  • 17. Introduction Our Environments Conclusions Architecture • Data arrives by Kafka, partitioned for the dbs • Data partitioned by query pattern and hour • Partitions tracked on master databases • Client written in Perl, which queries appropriate partitions and concatenates data
  • 18. Introduction Our Environments Conclusions PostgreSQL Challenges • Marshalling JSONB can be expensive • System catalogs on ZFS on spinning disk are slow • Requires significant custom C code in triggers to keep the system fast. • Exception handling in PostgreSQL has been a source of bugs in the past.
  • 19. Introduction Our Environments Conclusions Introducing Audience Builder • Retargetting platform (describe use case) • Only 12TB but expect to grow • Non-typical query and access patterns • Feature requests that could push this into PB range
  • 20. Introduction Our Environments Conclusions Big Data Characteristics • High enough velocity that saturation of NVME storage is a concern. • Queries touch very large amounts of data • Expect these issues to become far worse.
  • 21. Introduction Our Environments Conclusions Engineering Approach • Separation of storage and query • Settled on Parquet as storage format • Columnar data storage useful but most software does not support our access patterns well. • Evaluated a few of alternatives to PostgreSQL here. • Prioritized predictability and extensibility over peak performance
  • 22. Introduction Our Environments Conclusions Architecture • Parquet files on CephFS • PostgreSQL as query engine only • Wrote Parquet FDW for PostgreSQL • With tuning and optimization, as fast as native files. • Data arrives via Kafka and is written to Parquet files and these are registered with the database. • Pluggable storage might be of interest here. https://github.com/zilder/parquet fdw
  • 23. Introduction Our Environments Conclusions PostgreSQL Challenges • Hundreds of thousands of tables • Performance requires tables to be physically sorted • Heavy reliance on streaming APIs (COPY)
  • 24. Introduction Our Environments Conclusions Key Takeaways • Big Data is about technique, not technology • PostgreSQL is quite capable in this area • Careful attention to requirements is important • Every big data system is different.
  • 25. Introduction Our Environments Conclusions Major Open Source Software We Use • Apache Kafka • PostgreSQL • Redis • Go • Apache Flink • CephFS • Apache Spark • Gentoo Linux
  • 26. Introduction Our Environments Conclusions Thank You Thanks for coming. Any questions? chris.travers@adjust.com