SlideShare a Scribd company logo
Things might go wrong in a
data-intensive application
Petertc Chu | PyConline AU 2021
Scope
Applications deal with huge volumes of data
- Web applications, mobile apps, IoT...
Challenges
- “the quantity of data, the complexity of data, the speed
at which it is changing”
Key factors
- Scalability, Reliability
(dataintensive.net)
About me
Research engineer and Pythonista from Taiwan
Working on data infrastructures for ten years
kiwislife.com
The case
Host and manage UGC (User-generated content) with various usage patterns
- Streaming, IoT data aggregation, file distribution, archiving...
- ~10PiB raw capacity
- Processing several TiBs per day
We can cover a football field if we put all our disks on the ground
Structured data store
Sharding / partitioning,
RDMBS clusters,
NoSQL...
Concepts
Cache layer
Unstructured data store
Various kind of DFSs,
heterogeneous storage
media
Application
servers
Job processing
systems,
Other
subsystems
Various usage patterns
Incident #1
What happened?
Thousands of IoT devices push data to
our cluster 24-7-365, got
- error rate: ~30%
- Avg RTT: 39.005s
The build up
DB race condition
- Optimistic locking doesn’t help in this pattern (W >> R)
databases
IoT
devices
application
servers
contention
occurred! 😱
😡
The build up
Pessimistic locking is too expensive for other usage patterns
databases
IoT
devices
application
servers
Implement global
locking
🚘🚘
🚘
🚘🚘
🚘
🚘
🚘
🚘
other users
😡
😡
😡
👍
The build up
Final: a hybrid / adaptive approach
- Only do pessimistic locking for specific operations
- Do locking in local by default
- Switch to global locking for specific resource automatically while collision detected
- (switch back after a certain duration)
- Keep using optimistic locking otherwise
The build up
Final: a hybrid / adaptive approach
databases
IoT
devices
application
servers
local lock
local lock
local lock
(Global lock)
other users
👍
👍
👍
👍
Root cause #scalability
We don’t design for a usage pattern and workload like that
Action taken
- Test concurrency scenarios before each release
- Introduce observability and proactive monitoring systems for quick incident
detection and diagnosis
Incident #2
What
happened?
We have an advanced data management feature
- Not production ready, just a prototype
- No one use it for several years
One day, a user discovered it and made a million
times more requests to this subsystem!!
The build up
We needed some kind of distributed solution to handle this.
- resque: a Redis-backed framework for creating background jobs
https://github.blog/2009-11-03-introducing-resque/ https://gist.github.com/defunkt/225369
Root cause #scalability
Load exceeds expectations
Action taken
- All batch processing subsystems are now implemented in a distributed way
Incident #3
What
happened?
A supplier built a data protection subsystem for us
...after we deployed it...
Users complain data corruption!!
The build up
Defective padding in the encryption process
Example 1:
Input data: “DD” * 12
Expected result:
| DD DD DD DD DD DD DD DD | DD DD DD DD 04 04 04 04 |
Example 2:
Input data: “DD” * 16
Expected result:
| DD DD DD DD DD DD DD DD | DD DD DD DD DD DD DD DD |
| 16 16 16 16 16 16 16 16 | 16 16 16 16 16 16 16 16 |
Incorrect result:
| DD DD DD DD DD DD DD DD | DD DD DD DD DD DD DD DD |
(If the length of the original data is an integer multiple of the block size B,
then an extra block of bytes with value B is added. B is 16 in this case.)
The build up
Design a process to fix all affected data
- List all affected records from DBs
- Read corresponding data with an “incorrect” decryption algorithm
- Write data back with a correct encryption algorithm
Id Size Encryption method Version number Data reference key
1 32 (Not encrypted) 0 aaa
2 6 Indefective algorithm 0 bbb
3 5 (not affected) Defective algorithm 0 ccc
4 32 (affected) Defective algorithm 1 (fixed) ddd
5 64 (affected) Defective algorithm 0 (not yet fixed) eee
Only the last one needs a fix (block size = 16)
The build up
Just a silly bug, if it didn’t affect…
- Millions of user records
We set up a job processing system to correct all affected data in our system
gearman [Gearman Job Server] https://github.com/Yelp/python-gearman
Root cause #reliability #softwareFaults
1. Unreliable solution provider
2. Less than 1% possibility to find the bug by testing
Action taken
- Not outsourcing anymore
- More comprehensive tests with various kinds of scenarios
- ~10 TiB test dataset
Incident #4
What
happened?
To keep reliability, we
- Replicate user data multiple times
- Distribute replicas to different failure domains
(different host/data center)
Data still lost!!
http://dx.doi.org/10.6861/tanet.201810.0398
The build up
Our system balances loading by writing data into nodes that have more resource
- A newly added node has more resource in general
- Result in data tend to be placed on new nodes
Data are written to unreliable newly added nodes and lost even though they are
distributed in different failure domains.
Topic: Electronic/Electrical Reliability (cmu.edu)
Root cause #reliability #hardwareFaults
It’s hard to prevent data loss completely
- Modeling or simulation cannot truly reflect situations in
real world
Action taken
- Do more stability tests on new coming nodes
- Add a batch of new nodes each time, so it has less
opportunity to write data into an unreliable node
http://dx.doi.org/10.6861/tanet.201810.0398
What do we learn
from these
incidents?🤔
#1 “There is unfortunately no easy fix for
making applications reliable, scalable”
- No way to enumerate all possible reliability causes (hardware faults,
software faults, human errors)
- Usage pattern and load keep changing while your business
expanded, cannot have an ultimate scalability design beforehand
#2 Before trying to build a faultless
architecture, think twice
- Consider maintainability
- We need a team to sustain a large-scale system, not just a talented engineer
(dataintensive.net)
#3 Service = human beings + machines
Thank you! 🙏🙏🙏
@petertc_chu

More Related Content

What's hot

Data Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETLData Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETL
Anant Corporation
 
Software cracking and patching
Software cracking and patchingSoftware cracking and patching
Software cracking and patching
Mayank Gavri
 
Distributed Tracing with OpenTracing, ZipKin and Kubernetes
Distributed Tracing with OpenTracing, ZipKin and KubernetesDistributed Tracing with OpenTracing, ZipKin and Kubernetes
Distributed Tracing with OpenTracing, ZipKin and Kubernetes
Container Solutions
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
University of California, San Diego
 
EKAW - Publishing with Triple Pattern Fragments
EKAW - Publishing with Triple Pattern FragmentsEKAW - Publishing with Triple Pattern Fragments
EKAW - Publishing with Triple Pattern Fragments
Ruben Taelman
 
Self driving computers active learning workflows with human interpretable ve...
Self driving computers  active learning workflows with human interpretable ve...Self driving computers  active learning workflows with human interpretable ve...
Self driving computers active learning workflows with human interpretable ve...
Adam Gibson
 
Advanced deeplearning4j features
Advanced deeplearning4j featuresAdvanced deeplearning4j features
Advanced deeplearning4j features
Adam Gibson
 
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBHow a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
InfluxData
 
Boolan machine learning summit
Boolan machine learning summitBoolan machine learning summit
Boolan machine learning summit
Adam Gibson
 
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
InfluxData
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
Ndjido Ardo BAR
 
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with Order
Ruben Taelman
 
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
MLconf
 
IoT Data Connector Fluent Bit
IoT Data Connector Fluent BitIoT Data Connector Fluent Bit
IoT Data Connector Fluent Bit
Toru Takahashi
 
Open Tracing, to order and understand your mess. - ApiConf 2017
Open Tracing, to order and understand your mess. - ApiConf 2017Open Tracing, to order and understand your mess. - ApiConf 2017
Open Tracing, to order and understand your mess. - ApiConf 2017
Gianluca Arbezzano
 
Performance monitoring and call tracing in microservice environments
Performance monitoring and call tracing in microservice environmentsPerformance monitoring and call tracing in microservice environments
Performance monitoring and call tracing in microservice environments
Martin Gutenbrunner
 
Brief introduction to Distributed Deep Learning
Brief introduction to Distributed Deep LearningBrief introduction to Distributed Deep Learning
Brief introduction to Distributed Deep Learning
Adam Gibson
 
Fluent-bit
Fluent-bitFluent-bit
Fluent-bit
eventdotsjp
 
netflix-real-time-data-strata-talk
netflix-real-time-data-strata-talknetflix-real-time-data-strata-talk
netflix-real-time-data-strata-talk
Danny Yuan
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
MLconf
 

What's hot (20)

Data Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETLData Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETL
 
Software cracking and patching
Software cracking and patchingSoftware cracking and patching
Software cracking and patching
 
Distributed Tracing with OpenTracing, ZipKin and Kubernetes
Distributed Tracing with OpenTracing, ZipKin and KubernetesDistributed Tracing with OpenTracing, ZipKin and Kubernetes
Distributed Tracing with OpenTracing, ZipKin and Kubernetes
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
 
EKAW - Publishing with Triple Pattern Fragments
EKAW - Publishing with Triple Pattern FragmentsEKAW - Publishing with Triple Pattern Fragments
EKAW - Publishing with Triple Pattern Fragments
 
Self driving computers active learning workflows with human interpretable ve...
Self driving computers  active learning workflows with human interpretable ve...Self driving computers  active learning workflows with human interpretable ve...
Self driving computers active learning workflows with human interpretable ve...
 
Advanced deeplearning4j features
Advanced deeplearning4j featuresAdvanced deeplearning4j features
Advanced deeplearning4j features
 
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDBHow a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
How a Particle Accelerator Monitors Scientific Experiments Using InfluxDB
 
Boolan machine learning summit
Boolan machine learning summitBoolan machine learning summit
Boolan machine learning summit
 
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Multidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with OrderMultidimensional Interfaces for Selecting Data with Order
Multidimensional Interfaces for Selecting Data with Order
 
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
 
IoT Data Connector Fluent Bit
IoT Data Connector Fluent BitIoT Data Connector Fluent Bit
IoT Data Connector Fluent Bit
 
Open Tracing, to order and understand your mess. - ApiConf 2017
Open Tracing, to order and understand your mess. - ApiConf 2017Open Tracing, to order and understand your mess. - ApiConf 2017
Open Tracing, to order and understand your mess. - ApiConf 2017
 
Performance monitoring and call tracing in microservice environments
Performance monitoring and call tracing in microservice environmentsPerformance monitoring and call tracing in microservice environments
Performance monitoring and call tracing in microservice environments
 
Brief introduction to Distributed Deep Learning
Brief introduction to Distributed Deep LearningBrief introduction to Distributed Deep Learning
Brief introduction to Distributed Deep Learning
 
Fluent-bit
Fluent-bitFluent-bit
Fluent-bit
 
netflix-real-time-data-strata-talk
netflix-real-time-data-strata-talknetflix-real-time-data-strata-talk
netflix-real-time-data-strata-talk
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 

Similar to PyConline AU 2021 - Things might go wrong in a data-intensive application

Big Data and Fast Data combined – is it possible?
Big Data and Fast Data combined – is it possible?Big Data and Fast Data combined – is it possible?
Big Data and Fast Data combined – is it possible?
Swiss Data Forum Swiss Data Forum
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
exponential-inc
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DATAVERSITY
 
The Growth Of Data Centers
The Growth Of Data CentersThe Growth Of Data Centers
The Growth Of Data Centers
Gina Buck
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)
Denodo
 
Production debugging web applications
Production debugging web applicationsProduction debugging web applications
Production debugging web applications
Ido Flatow
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Guglielmo Iozzia
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
HostedbyConfluent
 
Data Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloningData Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloningKyle Hailey
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Demi Ben-Ari
 
Nyoug delphix slideshare
Nyoug delphix slideshareNyoug delphix slideshare
Nyoug delphix slideshareKyle Hailey
 
BIG DATA
BIG DATABIG DATA
BIG DATA
Shashank Shetty
 
Monitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp DockerMonitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp Docker
The Incredible Automation Day
 
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and OptimizeISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
Christoph Adler
 
Production Debugging War Stories
Production Debugging War StoriesProduction Debugging War Stories
Production Debugging War Stories
Ido Flatow
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
Denodo
 
RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011
Gerardo Pardo-Castellote
 
Roberto minerva 20181130
Roberto minerva 20181130  Roberto minerva 20181130
Roberto minerva 20181130
Roberto Minerva
 

Similar to PyConline AU 2021 - Things might go wrong in a data-intensive application (20)

Big Data and Fast Data combined – is it possible?
Big Data and Fast Data combined – is it possible?Big Data and Fast Data combined – is it possible?
Big Data and Fast Data combined – is it possible?
 
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
Keynote Address at 2013 CloudCon: Future of Big Data by Richard McDougall (In...
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
 
The Growth Of Data Centers
The Growth Of Data CentersThe Growth Of Data Centers
The Growth Of Data Centers
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)
 
Production debugging web applications
Production debugging web applicationsProduction debugging web applications
Production debugging web applications
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Data Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloningData Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloning
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Nyoug delphix slideshare
Nyoug delphix slideshareNyoug delphix slideshare
Nyoug delphix slideshare
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Monitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp DockerMonitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp Docker
 
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and OptimizeISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
ISBG 2015 - Infrastructure Assessment - Analyze, Visualize and Optimize
 
Production Debugging War Stories
Production Debugging War StoriesProduction Debugging War Stories
Production Debugging War Stories
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011
 
Roberto minerva 20181130
Roberto minerva 20181130  Roberto minerva 20181130
Roberto minerva 20181130
 

Recently uploaded

Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
Hornet Dynamics
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 

Recently uploaded (20)

Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
E-commerce Application Development Company.pdf
E-commerce Application Development Company.pdfE-commerce Application Development Company.pdf
E-commerce Application Development Company.pdf
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 

PyConline AU 2021 - Things might go wrong in a data-intensive application

  • 1. Things might go wrong in a data-intensive application Petertc Chu | PyConline AU 2021
  • 2. Scope Applications deal with huge volumes of data - Web applications, mobile apps, IoT... Challenges - “the quantity of data, the complexity of data, the speed at which it is changing” Key factors - Scalability, Reliability (dataintensive.net)
  • 3. About me Research engineer and Pythonista from Taiwan Working on data infrastructures for ten years kiwislife.com
  • 4. The case Host and manage UGC (User-generated content) with various usage patterns - Streaming, IoT data aggregation, file distribution, archiving... - ~10PiB raw capacity - Processing several TiBs per day We can cover a football field if we put all our disks on the ground
  • 5. Structured data store Sharding / partitioning, RDMBS clusters, NoSQL... Concepts Cache layer Unstructured data store Various kind of DFSs, heterogeneous storage media Application servers Job processing systems, Other subsystems Various usage patterns
  • 7. What happened? Thousands of IoT devices push data to our cluster 24-7-365, got - error rate: ~30% - Avg RTT: 39.005s
  • 8. The build up DB race condition - Optimistic locking doesn’t help in this pattern (W >> R) databases IoT devices application servers contention occurred! 😱 😡
  • 9. The build up Pessimistic locking is too expensive for other usage patterns databases IoT devices application servers Implement global locking 🚘🚘 🚘 🚘🚘 🚘 🚘 🚘 🚘 other users 😡 😡 😡 👍
  • 10. The build up Final: a hybrid / adaptive approach - Only do pessimistic locking for specific operations - Do locking in local by default - Switch to global locking for specific resource automatically while collision detected - (switch back after a certain duration) - Keep using optimistic locking otherwise
  • 11. The build up Final: a hybrid / adaptive approach databases IoT devices application servers local lock local lock local lock (Global lock) other users 👍 👍 👍 👍
  • 12. Root cause #scalability We don’t design for a usage pattern and workload like that Action taken - Test concurrency scenarios before each release - Introduce observability and proactive monitoring systems for quick incident detection and diagnosis
  • 14. What happened? We have an advanced data management feature - Not production ready, just a prototype - No one use it for several years One day, a user discovered it and made a million times more requests to this subsystem!!
  • 15. The build up We needed some kind of distributed solution to handle this. - resque: a Redis-backed framework for creating background jobs https://github.blog/2009-11-03-introducing-resque/ https://gist.github.com/defunkt/225369
  • 16. Root cause #scalability Load exceeds expectations Action taken - All batch processing subsystems are now implemented in a distributed way
  • 18. What happened? A supplier built a data protection subsystem for us ...after we deployed it... Users complain data corruption!!
  • 19. The build up Defective padding in the encryption process Example 1: Input data: “DD” * 12 Expected result: | DD DD DD DD DD DD DD DD | DD DD DD DD 04 04 04 04 | Example 2: Input data: “DD” * 16 Expected result: | DD DD DD DD DD DD DD DD | DD DD DD DD DD DD DD DD | | 16 16 16 16 16 16 16 16 | 16 16 16 16 16 16 16 16 | Incorrect result: | DD DD DD DD DD DD DD DD | DD DD DD DD DD DD DD DD | (If the length of the original data is an integer multiple of the block size B, then an extra block of bytes with value B is added. B is 16 in this case.)
  • 20. The build up Design a process to fix all affected data - List all affected records from DBs - Read corresponding data with an “incorrect” decryption algorithm - Write data back with a correct encryption algorithm Id Size Encryption method Version number Data reference key 1 32 (Not encrypted) 0 aaa 2 6 Indefective algorithm 0 bbb 3 5 (not affected) Defective algorithm 0 ccc 4 32 (affected) Defective algorithm 1 (fixed) ddd 5 64 (affected) Defective algorithm 0 (not yet fixed) eee Only the last one needs a fix (block size = 16)
  • 21. The build up Just a silly bug, if it didn’t affect… - Millions of user records We set up a job processing system to correct all affected data in our system gearman [Gearman Job Server] https://github.com/Yelp/python-gearman
  • 22. Root cause #reliability #softwareFaults 1. Unreliable solution provider 2. Less than 1% possibility to find the bug by testing Action taken - Not outsourcing anymore - More comprehensive tests with various kinds of scenarios - ~10 TiB test dataset
  • 24. What happened? To keep reliability, we - Replicate user data multiple times - Distribute replicas to different failure domains (different host/data center) Data still lost!! http://dx.doi.org/10.6861/tanet.201810.0398
  • 25. The build up Our system balances loading by writing data into nodes that have more resource - A newly added node has more resource in general - Result in data tend to be placed on new nodes Data are written to unreliable newly added nodes and lost even though they are distributed in different failure domains. Topic: Electronic/Electrical Reliability (cmu.edu)
  • 26. Root cause #reliability #hardwareFaults It’s hard to prevent data loss completely - Modeling or simulation cannot truly reflect situations in real world Action taken - Do more stability tests on new coming nodes - Add a batch of new nodes each time, so it has less opportunity to write data into an unreliable node http://dx.doi.org/10.6861/tanet.201810.0398
  • 27. What do we learn from these incidents?🤔
  • 28. #1 “There is unfortunately no easy fix for making applications reliable, scalable” - No way to enumerate all possible reliability causes (hardware faults, software faults, human errors) - Usage pattern and load keep changing while your business expanded, cannot have an ultimate scalability design beforehand
  • 29. #2 Before trying to build a faultless architecture, think twice - Consider maintainability - We need a team to sustain a large-scale system, not just a talented engineer (dataintensive.net)
  • 30. #3 Service = human beings + machines