Building Resilient and Scalable Data Pipelines by Decoupling Compute and Storage

1 © 2018 PURE STORAGE INC. PURE PROPRIETARY
Building Resilient and Scalable Data Pipelines
by Decoupling Compute and Storage
Ivan Jibaja
Software Engineer

Our Log Analytics Pipeline in Numbers
ü1.5 - 2M events / second
ü0.5 - 1 PB of data / day
ü5 seconds SLA
ü(6) 9s of Reliability

Data Pipeline – Early Stages
1,000+
VMs
100+
FBs
10+
Jenkins
400+
clients
16
16
16
16
rsyslog
12
12
12
12
12
12
12
12
12
12
6G
40
40
40
40
18T 18T6T
6G 12
Custom code
20,000+
tests

Data Pipeline - Now
2,500+
VMs
350+
FBs
20+
Jenkins
1,000+
clients
72T
12
12
12
12
12
12
12
12
12
12
72T 12
12
12
12
12
12
12
12
12
12
12
12
12
12
120,000+
tests / day
24T
rsyslog
16
16
16
16
16
16
800G
12
12
12
12
12
12
ü Duplicate bug
ü Infrastructure failure
ü Performance regression200T
12
12
12
12
12
12
90G
50G
12
12
12
12189T ü Low level details
ü Easy to read graphs

© 2018 PURE STORAGE INC. PURE PROPRIETARY5
Reliability, Scalability, Flexibility

Software Crashes
Need to be able to restart each stage of
your pipeline without affecting correctness
Idempotency

Growth
Each stage of your pipeline may grow at
different speeds
Orchestration

Efficiency and Flexibility
1. Application stack to solve every kind of problem and they are easy to setup
2. Application silos are inefficient and increase operational cost
3. Scale may require re-architecting a given stage
Decouple compute and storage

Technologies we use
• Docker: Containers
• Nomad: Orchestration
• Prometheus: Monitoring
• Grafana: Dashboards
• Consul: Service discovery
• Chef: Container build
• Jenkins: Continuous Integration
• Kafka Manager: Kafka Interface
• Artifactory: Image repository
• Ansible: Configuring servers

Takeaways
• Reliability: Idempotency
• Scalability: Orchestration
• Flexibility and Efficiency: Decoupled compute and storage

QUESTIONS?

Building Resilient and Scalable Data Pipelines by Decoupling Compute and Storage

More Related Content

What's hot

Similar to Building Resilient and Scalable Data Pipelines by Decoupling Compute and Storage

More from Databricks

Recently uploaded

Building Resilient and Scalable Data Pipelines by Decoupling Compute and Storage