Copy-on-write storage is a relatively new technology that can be pretty powerful when applied to Big Data needs. Even better, if you combine copy-on-write storage with a data-aware analytics engine, you get unbelievable benefits and flexibility for processing petabyte-scale data workflows! That's exactly what we've built at Pachyderm.
3. Storage Scheduler Packaging
Docker
• Open source
• Generalized for different use cases
• End-to-end solution
• Leverages Docker ecosystem
Pachyderm
Pipeline
System
(pps)
Pachyderm
File
System
(pfs)
Adroll’s Archiecture for everyone else
4. What is PFS?
A copy-on-write distributed file system
Core storage for Pachyderm
6. Why is this cool?
• View diffs of your data
• Instantly revert to previous state
• Immutability
• Reduce storage needs
• Branching Commit
0
Commit
1
Commit
2
Commit
3
Commit
4
Git for huge data sets
8. PPS + PFS is…
Efficient: incremental processing
3
2
1
0
Data Analysis
Task 4
DashboardTask 6
Task 1
Task 2 Task 3
Task 5
1% more
data
Task 4
DashboardTask 6
Only process jobs
that rely on the
data that changed
9. PPS + PFS is…
Flexible: both batched pipelines and streaming
Daily batched
pipelines
Data Analysis
Task 4
DashboardTask 6
Task 1
Task 2 Task 3
Task 5
1
0
∆Time = 1 day
2
Large batched DAG
that processes all the
new data each day
10. PPS + PFS is…
Flexible: both batched pipelines and streaming
Data Analysis
Task 4
DashboardTask 6
Task 1
Task 2 Task 3
Task 5
2
1
0
Streaming
updates
3
∆Time = 1 second
4
Micro-batches that
update constantly as
new data streams in
Commits are insanely
cheap so you can take
one every second
Git centered with graphic (no branch)
Then bullets + branch
PPS knows what data has changed and only recomputes tasks that rely on that diff
Use docker monitoring tools
Cost-effective becomes important when you’re managing a huge cluster
Just text, then storage nodes and small pps. Then big pps. Then s3.
And you can back to S3 to get the best of both worlds.
Cost-effective becomes important when you’re managing a huge cluster
Just text, then storage nodes and small pps. Then big pps. Then s3.
And you can back to S3 to get the best of both worlds.