Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pachyderm: Building a Big Data Beast On Kubernetes


Published on

Pachyderm is a containerized data analytics solution that's completely deployed using Kubernetes. We take all the amazing tools and potential in the container ecosystem and unlock that power for massive-scale data processing. In this talk we'll show you how to leverage Docker, Kubernetes, and Pachyderm, to build incredibly robust and scalable data infrastructure. We'll start by discussing the key components of a modern data-drive company and how your infrastructure choices can have a massive impact on your product and scalability roadmap. We'll then dive into some architecture details to show how Kubernetes, Docker, and Pachyderm all work in tandem to create a cohesive data infrastructure stack. Finally, we will demonstrate some high-level use cases and powerful benefits you get from the architecture we've outlined.

KubeCon schedule link:

Published in: Technology
  • Be the first to comment

Pachyderm: Building a Big Data Beast On Kubernetes

  1. 1. Pachyderm Building a Big Data Beast on Kubernetes Joe Doliner Founder & CEO
  2. 2. About me
  3. 3. The origin story Wanted to analyze chess games with Hadoop
  4. 4. Let’s build a modern Hadoop! Oh shit! First I need to build 15 years of distributed systems…
  5. 5. Distributed systems are hard Bet it all on the container ecosystem
  6. 6. Pachyderm’s Architecture Kubernetes User Analysis Pachyderm Pipeline System Services Jobs Pachyderm File System User Data
  7. 7. Pachyderm File System A copy-on-write distributed file system Copy-on-write is the paradigm that “powers” technologies like Docker and Spark Core storage for Pachyderm
  8. 8. Why is this cool? • View diffs • Instant Revert • Reduce storage needs • Reliability Commit 0 Commit 1 Commit 2 Commit 3 Commit 4 Git for huge data sets
  9. 9. Pachyderm Pipeline System • Runs k8s jobs over PFS • Jobs triggered by commits • Understands job dependencies • Leverages copy-on-write storage Task 1 Task 2 Task 3 Task 4 Dashboard Task 5 Task 6 Data-aware container scheduler
  10. 10. Pachyderm is… Task 1 Task 2 Task 3 Task 4 Dashboard Task 5 Task 6 $ Task 2 failed $ Task 4 and 6 waiting… … Fixing code … $ Task 2 resuming... $ Task 2 complete $ Task 4 starting… Monitoring Resilient: K8s jobs can be restarted
  11. 11. Efficient: incremental processing 3 2 1 0 Data Analysis Task 4 DashboardTask 6 Task 1 Task 2 Task 3 Task 5 1% more data Task 4 DashboardTask 6 Pachyderm is…
  12. 12. PFS storage nodes PPS Copy-on-write storage nodes Elastically scaling computation nodes d2.8xlarge PPSPPS PPS Spot SpotSpot Cost-effective: resource management Pachyderm is…
  13. 13. Summary Kubernetes is a game-changer for distributed systems Copy-on-write data is really powerful Pachyderm unlocks the power of Kubernetes for big data
  14. 14. Thank You! Questions?