Pachyderm
Big	Data	Applications	with	Containers
Joe	Doliner	
Founder	&	CEO	
jd@pachyderm.io
About me
What is Pachyderm?
Big data with Containers
• Version	control	for	data	
• Uses	containers	for	data	processing	
• Batched	and	streaming	
• Data	lives	in	object	storage	(S3,	GCS,	Ceph)	
• Shares	no	code	with	Hadoop
Intro to Containers
• What	are	containers	and	why	are	they	
useful	for	Big	Data	Applications?	
• What	is	Kubernetes	and	why	is	it	useful	
for	Big	Data	Application?
Version	Control	for	Data
• View	diffs	of	data	
• Collaboration	
• Data	provenance
Commit		
0
Commit		
1
Commit		
2
Commit		
3
Commit		
4
Git	for	huge	data	sets
Containerized	Data	Pipelines
• DAG	of	jobs	
• Pipelines	triggered	by	data	
changes	
• Processing	efficiency	
• Incrementality
Task	1
Task	2 Task	3
Task	4
Dashboard
Task	5
Task	6
Data Lake Use Case
Production	
databases	
Application	
Pachyderm
Object	storage
Pachyderm	Pipelines
Data	
Warehouse
Pachyderm	File	System
Demo:
• Start	Pachyderm	on	K8s	
• Add	data	
• Create	pipeline	
• Incrementally	update	steaming	results
Steaming data pipelines
Thank	You!
Questions?
pachyderm.io	
jd@pachyderm.io	
github.com/pachyderm/pachyderm

Big Data Applications