SlideShare a Scribd company logo
Lessons Learned
Designing Data Ingest Systems
Abraham Elmahrek (abe@apache.org)
1. Overview of Big Data Ingest
2. Real world examples with lessons interleaved
3. A summary of lessons learned and extra ideas
Agenda
Big Data Ingest
Ingesting from different data
sources is the goal
Several data sources have
different structures, but
schemas vary mostly
Batch and Real Time ingest
both have their places
Data sources Schema Speed
Data sources
Relational databases,
spreadsheets, object
databases
XML, JSON, EDI, etc. Audio, video, email, etc.
Structured Semi-structured Eventually structured
Schema
One schema with a relatively
flat structure or many
schemas with nested
structures.
Immutable schemas can’t be
changed. Mutable schemas
can evolve. Nested schemas
can also have mutability
properties.
Number of schemas Mutability Inference
Schema inference upon
writing, reading, or offline.
Real Time vs Batch
Transfer data from A -> B on
demand.
Push data from A -> B
consistently. Poll on data
sources or act upon
reception.
Batch Push model Pull model
Clients pull data from A to
write to B. Often times an
intermediate storage system
like Kafka is used to achieve
this.
• GOAL: Generate different forms for
websites
• Store user information
• Forms cannot change over time
Real world scenario: Form generator
Lesson #1: Structure endpoint wisely
Form Definition
id
form name
form metadata
Form 1
id
<field 1>
<field 2>
<field 3>
Form 2
id
<field 1>
<field 2>
<field 3>
Form Definition
id
form name
Field Definition
id
form id
field name
type
Field Values
id
field id
value
• GOAL: Generate list of active contributors on a repository and
general stats about a repository relative to all other repositories.
• Scheduled batch Change Data Capture (CDC).
Real world scenario: Scrape github
My implementation (naive)
• Ingesting data twice doesn’t matter in a lot of cases.
• The cost of re-processing or re-ingesting a few records is
normally pretty low.
• It’s easy to manage and implement.
• Exactly once semantics, in contrast, is not feasible
– Usually requires some de-duping
Lesson #2: At least once is acceptable
A better implementation
My favorite implementation
• Change Data Capture (CDC) without a change log or an easy
way to calculate differences is hard.
• Almost always requires some customized effort.
Lesson #3: CDC is hard
• GOAL: Gather impressions and click information. Attribute to
different vendors based on impressions and clicks.
• Expose a view for customers to understand their usage.
• NRT with batch error checking.
Real world scenario: Ad attribution system
• What is the incidence of errors?
• How frequently should errors be checked?
• Is data loss acceptable?
• Is duplication acceptable?
Lesson #4: Know thy SLA
Push version
Click Logs
Impression
Logs
V
I
P
Scribe
Master
Scribe
Master
Scribe
Master
HBase
MySQL
Push version analysis
• Negatives
– Scribe would lose data in some edge cases. That’s not good for
attribution systems (money involved).
– Amount of messages being written to HBase would cause major
compactions on a weekly basis halting the pipeline.
• Positives
– Latency was super low
– Relatively easy to maintain given scribe configuration
* Flume would have been a better choice! It has better reliability
guarantees!
Pull version
Click Logs
Impression
Logs
V
I
P
Producer HBase
MySQL
Producer
RabbitMQ
Consumer
Consumer
Pull version analysis
• Negatives
– Requires more management and configuration.
• Positives
– Choose data loss with at most once or at least once semantics.
– Intermediate storage relieves HBase.
* Kafka would have been a cool choice! It has better data retention
and scalability!
1. Structureless (or simple structure) and schemaless
a. Log file (e.g. uuid|val1|val2|val3|...)
2. Structured without explicit schema
a. JSON (e.g. {“key1”: “val1”, ...})
3. Structured with explicit schema
a. Avro (e.g. {“key1”: “val1”, ...}, but with schema)
Lesson #5: Record format and schema
• Verbosity directly related to human readability
• Verbosity impacts performance of systems
• A verbose and readable RPC: XML, YAML, JSON, etc.
• A not-so-verbose and not-so-readable RPC: MessagePack,
Protobuf, Avro binary, Parquet, etc.
• Sufficient tooling can make human readability less necessary.
Issues with structure
• Flexibility and structure are inversely related.
• A flexible schema
– Doesn’t require an upfront definition
– Allows you to make and validate assumptions about the data.
– Easy to extend, but difficult to track changes
– May have nested structures
■ e.g. uuid|val1|val2|{“field1”: “value1”, ...}|...
• A structured schema
– Easier for everyone (human and computer) to understand
– Saves time when serializing/deserializing
Issues with schema
• Where is the data coming from?
• How has it changed as it enters the system?
• Snapshots?
• Who touched the data.
Lesson #6: Record lineage
1. Structure endpoints wisely
2. At least once semantics is easy and acceptable
3. CDC is hard
4. Know thy SLA
5. Record format and schema should be thought through
6. Record lineage (provenance)
Summary of lessons
1. Keep track of erroneous records
a. Anomalies lead to more knowledge about data source
b. Improves debugging
2. Keep transformations to a minimum
a. Schema inference makes sense
b. Massive computations can slow down the ingest process and cause
back pressure in the pipeline
Extra ideas
Checkout
http://ingest.tips
for general ingest
Thank you
Licensing
Public Domain
1. https://commons.wikimedia.org/wiki/File:West_Texas_Pumpjack.JPG
2. https://commons.wikimedia.org/wiki/File%3ABulls_Ishikawa%2C_Okinawa_2007.jpg
3. https://commons.wikimedia.org/wiki/File:Hammer_Ace_SATCOM_Antenna.jpg
4. https://commons.wikimedia.org/wiki/File:Shanghai_Shimao_Plaza_Construction.jpg
5. https://pixabay.com/p-111058/?no_redirect
6. https://pixabay.com/p-70908/?no_redirect
7. https://commons.wikimedia.org/wiki/File:
The_Sun_by_the_Atmospheric_Imaging_Assembly_of_NASA's_Solar_Dynamics_Observatory_-_20100819.jpg
8. http://www.freestockphotos.biz/stockphoto/16694
9. https://pixabay.com/en/github-logo-favicon-mascot-button-154769/
Creative Commons V3
10. https://commons.wikimedia.org/wiki/File:Star-schema.png
Creative Commons V2
11. https://www.flickr.com/photos/the_pink_princess/370896536/
12. https://www.flickr.com/photos/digitaljourney/5424241457

More Related Content

Viewers also liked

Empowering Women with Examples
Empowering Women with ExamplesEmpowering Women with Examples
Empowering Women with ExamplesFatema Tuz Zohra
 
How to load curtainside lorry
How to load curtainside lorryHow to load curtainside lorry
How to load curtainside lorry
Kingsnorth Waste
 
Application Paper - Digital Universe
Application Paper - Digital UniverseApplication Paper - Digital Universe
Application Paper - Digital UniverseFatema Tuz Zohra
 
Logik
LogikLogik
Logik
tvm64
 
Ppt ilmu negara1
Ppt ilmu negara1Ppt ilmu negara1
Ppt ilmu negara1
Muhammad Fandi
 
Presentation_short_OdometryInPipes
Presentation_short_OdometryInPipesPresentation_short_OdometryInPipes
Presentation_short_OdometryInPipesElena Morara
 
Η αγωνία στη Γεσθημανή
Η αγωνία στη ΓεσθημανήΗ αγωνία στη Γεσθημανή
Η αγωνία στη ΓεσθημανήOurania Koutrouli
 
Struktur bahasa indonesia ragam ilmiah
Struktur bahasa indonesia ragam ilmiahStruktur bahasa indonesia ragam ilmiah
Struktur bahasa indonesia ragam ilmiah
Muhammad Fandi
 
Means of transports activities
Means of transports activitiesMeans of transports activities
Means of transports activities
perikito22
 
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop MeetupSqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
aaamase
 
Définitions ohsas
Définitions ohsasDéfinitions ohsas
Définitions ohsas
Abdelmalek Nezzal
 
Effets sur la santé et dangers professionnels
Effets sur la santé et dangers professionnelsEffets sur la santé et dangers professionnels
Effets sur la santé et dangers professionnels
Abdelmalek Nezzal
 

Viewers also liked (12)

Empowering Women with Examples
Empowering Women with ExamplesEmpowering Women with Examples
Empowering Women with Examples
 
How to load curtainside lorry
How to load curtainside lorryHow to load curtainside lorry
How to load curtainside lorry
 
Application Paper - Digital Universe
Application Paper - Digital UniverseApplication Paper - Digital Universe
Application Paper - Digital Universe
 
Logik
LogikLogik
Logik
 
Ppt ilmu negara1
Ppt ilmu negara1Ppt ilmu negara1
Ppt ilmu negara1
 
Presentation_short_OdometryInPipes
Presentation_short_OdometryInPipesPresentation_short_OdometryInPipes
Presentation_short_OdometryInPipes
 
Η αγωνία στη Γεσθημανή
Η αγωνία στη ΓεσθημανήΗ αγωνία στη Γεσθημανή
Η αγωνία στη Γεσθημανή
 
Struktur bahasa indonesia ragam ilmiah
Struktur bahasa indonesia ragam ilmiahStruktur bahasa indonesia ragam ilmiah
Struktur bahasa indonesia ragam ilmiah
 
Means of transports activities
Means of transports activitiesMeans of transports activities
Means of transports activities
 
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop MeetupSqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
 
Définitions ohsas
Définitions ohsasDéfinitions ohsas
Définitions ohsas
 
Effets sur la santé et dangers professionnels
Effets sur la santé et dangers professionnelsEffets sur la santé et dangers professionnels
Effets sur la santé et dangers professionnels
 

Similar to Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Data Con LA
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
Nisha Talagala
 
Master.pptx
Master.pptxMaster.pptx
Master.pptx
KarthikR780430
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
David Martínez Rego
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
SpringPeople
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)Ben Stopford
 
Real World Performance - OLTP
Real World Performance - OLTPReal World Performance - OLTP
Real World Performance - OLTP
Connor McDonald
 
MongoDB
MongoDBMongoDB
MongoDB
fsbrooke
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systemselliando dias
 
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists
jlacefie
 
Inerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine LearningInerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine Learning
Yash Diwakar
 
UNIT II (1).pptx
UNIT II (1).pptxUNIT II (1).pptx
UNIT II (1).pptx
gopi venkat
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, Implementations
Firat Atagun
 
1_DBMS_Introduction.pdf
1_DBMS_Introduction.pdf1_DBMS_Introduction.pdf
1_DBMS_Introduction.pdf
JubairAhmedNabin
 
Data massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodesData massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodes
Ulf Wendel
 
NoSQL Basics - A Quick Tour
NoSQL Basics - A Quick TourNoSQL Basics - A Quick Tour
NoSQL Basics - A Quick Tour
Bikram Sinha. MBA, PMP
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
TriNimbus
 
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopEventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Ayon Sinha
 
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
confluent
 

Similar to Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems (20)

Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
Master.pptx
Master.pptxMaster.pptx
Master.pptx
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
Big iron 2 (published)
Big iron 2 (published)Big iron 2 (published)
Big iron 2 (published)
 
Real World Performance - OLTP
Real World Performance - OLTPReal World Performance - OLTP
Real World Performance - OLTP
 
MongoDB
MongoDBMongoDB
MongoDB
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists
 
Inerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine LearningInerview Quesion on Data Mining and Machine Learning
Inerview Quesion on Data Mining and Machine Learning
 
UNIT II (1).pptx
UNIT II (1).pptxUNIT II (1).pptx
UNIT II (1).pptx
 
NoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, ImplementationsNoSQL Introduction, Theory, Implementations
NoSQL Introduction, Theory, Implementations
 
1_DBMS_Introduction.pdf
1_DBMS_Introduction.pdf1_DBMS_Introduction.pdf
1_DBMS_Introduction.pdf
 
Data massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodesData massage: How databases have been scaled from one to one million nodes
Data massage: How databases have been scaled from one to one million nodes
 
NoSQL Basics - A Quick Tour
NoSQL Basics - A Quick TourNoSQL Basics - A Quick Tour
NoSQL Basics - A Quick Tour
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
 
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopEventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
 
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
 

Recently uploaded

Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 

Recently uploaded (20)

Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 

Big Data Day LA 2015 - Lessons Learned Designing Data Ingest Systems

  • 1. Lessons Learned Designing Data Ingest Systems Abraham Elmahrek (abe@apache.org)
  • 2. 1. Overview of Big Data Ingest 2. Real world examples with lessons interleaved 3. A summary of lessons learned and extra ideas Agenda
  • 3. Big Data Ingest Ingesting from different data sources is the goal Several data sources have different structures, but schemas vary mostly Batch and Real Time ingest both have their places Data sources Schema Speed
  • 4. Data sources Relational databases, spreadsheets, object databases XML, JSON, EDI, etc. Audio, video, email, etc. Structured Semi-structured Eventually structured
  • 5. Schema One schema with a relatively flat structure or many schemas with nested structures. Immutable schemas can’t be changed. Mutable schemas can evolve. Nested schemas can also have mutability properties. Number of schemas Mutability Inference Schema inference upon writing, reading, or offline.
  • 6. Real Time vs Batch Transfer data from A -> B on demand. Push data from A -> B consistently. Poll on data sources or act upon reception. Batch Push model Pull model Clients pull data from A to write to B. Often times an intermediate storage system like Kafka is used to achieve this.
  • 7. • GOAL: Generate different forms for websites • Store user information • Forms cannot change over time Real world scenario: Form generator
  • 8. Lesson #1: Structure endpoint wisely Form Definition id form name form metadata Form 1 id <field 1> <field 2> <field 3> Form 2 id <field 1> <field 2> <field 3> Form Definition id form name Field Definition id form id field name type Field Values id field id value
  • 9. • GOAL: Generate list of active contributors on a repository and general stats about a repository relative to all other repositories. • Scheduled batch Change Data Capture (CDC). Real world scenario: Scrape github
  • 11. • Ingesting data twice doesn’t matter in a lot of cases. • The cost of re-processing or re-ingesting a few records is normally pretty low. • It’s easy to manage and implement. • Exactly once semantics, in contrast, is not feasible – Usually requires some de-duping Lesson #2: At least once is acceptable
  • 14. • Change Data Capture (CDC) without a change log or an easy way to calculate differences is hard. • Almost always requires some customized effort. Lesson #3: CDC is hard
  • 15. • GOAL: Gather impressions and click information. Attribute to different vendors based on impressions and clicks. • Expose a view for customers to understand their usage. • NRT with batch error checking. Real world scenario: Ad attribution system
  • 16. • What is the incidence of errors? • How frequently should errors be checked? • Is data loss acceptable? • Is duplication acceptable? Lesson #4: Know thy SLA
  • 18. Push version analysis • Negatives – Scribe would lose data in some edge cases. That’s not good for attribution systems (money involved). – Amount of messages being written to HBase would cause major compactions on a weekly basis halting the pipeline. • Positives – Latency was super low – Relatively easy to maintain given scribe configuration * Flume would have been a better choice! It has better reliability guarantees!
  • 19. Pull version Click Logs Impression Logs V I P Producer HBase MySQL Producer RabbitMQ Consumer Consumer
  • 20. Pull version analysis • Negatives – Requires more management and configuration. • Positives – Choose data loss with at most once or at least once semantics. – Intermediate storage relieves HBase. * Kafka would have been a cool choice! It has better data retention and scalability!
  • 21. 1. Structureless (or simple structure) and schemaless a. Log file (e.g. uuid|val1|val2|val3|...) 2. Structured without explicit schema a. JSON (e.g. {“key1”: “val1”, ...}) 3. Structured with explicit schema a. Avro (e.g. {“key1”: “val1”, ...}, but with schema) Lesson #5: Record format and schema
  • 22. • Verbosity directly related to human readability • Verbosity impacts performance of systems • A verbose and readable RPC: XML, YAML, JSON, etc. • A not-so-verbose and not-so-readable RPC: MessagePack, Protobuf, Avro binary, Parquet, etc. • Sufficient tooling can make human readability less necessary. Issues with structure
  • 23. • Flexibility and structure are inversely related. • A flexible schema – Doesn’t require an upfront definition – Allows you to make and validate assumptions about the data. – Easy to extend, but difficult to track changes – May have nested structures ■ e.g. uuid|val1|val2|{“field1”: “value1”, ...}|... • A structured schema – Easier for everyone (human and computer) to understand – Saves time when serializing/deserializing Issues with schema
  • 24. • Where is the data coming from? • How has it changed as it enters the system? • Snapshots? • Who touched the data. Lesson #6: Record lineage
  • 25. 1. Structure endpoints wisely 2. At least once semantics is easy and acceptable 3. CDC is hard 4. Know thy SLA 5. Record format and schema should be thought through 6. Record lineage (provenance) Summary of lessons
  • 26. 1. Keep track of erroneous records a. Anomalies lead to more knowledge about data source b. Improves debugging 2. Keep transformations to a minimum a. Schema inference makes sense b. Massive computations can slow down the ingest process and cause back pressure in the pipeline Extra ideas
  • 29. Licensing Public Domain 1. https://commons.wikimedia.org/wiki/File:West_Texas_Pumpjack.JPG 2. https://commons.wikimedia.org/wiki/File%3ABulls_Ishikawa%2C_Okinawa_2007.jpg 3. https://commons.wikimedia.org/wiki/File:Hammer_Ace_SATCOM_Antenna.jpg 4. https://commons.wikimedia.org/wiki/File:Shanghai_Shimao_Plaza_Construction.jpg 5. https://pixabay.com/p-111058/?no_redirect 6. https://pixabay.com/p-70908/?no_redirect 7. https://commons.wikimedia.org/wiki/File: The_Sun_by_the_Atmospheric_Imaging_Assembly_of_NASA's_Solar_Dynamics_Observatory_-_20100819.jpg 8. http://www.freestockphotos.biz/stockphoto/16694 9. https://pixabay.com/en/github-logo-favicon-mascot-button-154769/ Creative Commons V3 10. https://commons.wikimedia.org/wiki/File:Star-schema.png Creative Commons V2 11. https://www.flickr.com/photos/the_pink_princess/370896536/ 12. https://www.flickr.com/photos/digitaljourney/5424241457