A data pipeline is a unified system for capturing events for analysis and building products. It involves capturing user events from various sources, storing them in a centralized data warehouse, and performing analysis and building products using tools like Hadoop. Key components of a data pipeline include an event framework, message bus, data serialization, data persistence, workflow management, and batch processing. A Lambda architecture allows for both batch and real-time processing of data captured by the pipeline.
Report
Share
Report
Share
1 of 55
Download to read offline
More Related Content
Building a Data Pipeline from Scratch - Joe Crobak
16. DATA PIPELINE
16
A Data Pipeline is a uniļ¬ed system for
capturing events for analysis and
building products.
17. DATA PIPELINE
17
click data
user events
Data Warehouse
web visits
email sends
ā¦
Product Features
Ad Hoc analysis
ā¢Counting
ā¢Machine Learning
ā¢Extract Transform Load (ETL)
18. DATA PIPELINE
18
A Data Pipeline is a uniļ¬ed system for
capturing events for analysis and
building products.
22. COARSE-GRAINED EVENTS
22
127.0.0.1 - - [17/Jun/2014:01:53:16 UTC] "GET / HTTP/1.1" 200 3969!
IP Address Timestamp Action Status
ā¢Events are captured as a
ā¢Stored in
debugging and secondarily for analysis.
23. COARSE-GRAINED EVENTS
23
Implicit trackingāi.e. a āpage loadā event is a
proxy for ā„1 other event.
!
e.g. event GET /newsfeed corresponds to:
ā¢App Load (but only if this is the ļ¬rst time
loaded this session)
ā¢Timeline load, user is in āgroup Aā of an A/B
Test
These implementations details have to be known at analysis time.
27. FINE-GRAINED EVENTS
27
A couple of schema-less formats are popular
(e.g. JSON and CSV), but they have
drawbacks.
ā¢harder to change schemas
ā¢ineļ¬cient
ā¢require writing parsers
28. SCHEMA
28
Used to describe data, providing a contract
about ļ¬elds and their types.
!
Two schemas are compatible if you can read
data written in schema 1 with schema 2.
41. SERIALIZATION FRAMEWORK
41
Used for converting an Event to bytes on
disk. Provides eļ¬cient, cross-language
framework for serializing/deserializing data.
46. NEXT STEPS
46
This architecture opens up a lot of possibilities
ā¢Near-real time computationāApache
Storm, Apache Samza (incubating), Apache
Spark streaming.
ā¢Sharing information between services
asynchronouslyāe.g. to augment user
proļ¬le information.
ā¢Cross-datacenter replication
ā¢Columnar storage
47. LAMBDA ARCHITECTURE
47
Term coined by Nathan Marz (creator of
Apache Storm) for hybrid batch and real-
time processing.
!
Batch processing is treated as source of truth,
and real-time updates models/insights
between batches.
49. SUMMARY
49
ā¢Data Pipelines are everywhere.
ā¢Useful to think of data as events.
ā¢A uniļ¬ed data pipeline is very powerful.
ā¢Plethora of open-source tools to build
data pipeline.
50. FURTHER READING
50
The Uniļ¬ed Logging Infrastructure for Data
Analytics at Twitter
!
The Log: What every software engineer should
know about real-time data's unifying
abstraction (Jay Kreps, LinkedIn)
!
Big Data by Nathan Marz and James Warren
!
Implementing Microservice Architectures