We at Robin Systems help customers deploy data pipelines both on-premises and on the cloud. In this session, we will walk through different examples of data pipelines designed to cater to different data sets and use cases like Twitter feeds, IoT sensor data, recommendations, etc and their deployment using Docker containers. We will also discuss different considerations to simplify management of the data pipeline in terms of performance, availability, recoverability, scaling, etc
Event: Santa Clara Docker Meetup
2. ROBIN FREE COMMUNITY EDITION
CONFIDENTIAL – RESTRICTED DISTRIBUTION
https://robinsystems.com/get-robin/
Free Forever (Robin Licenses)
Up to 5 EC2 Instances
Auto-deploy to AWS using ‘gorobin’ tool
Use pre-defined bundles or bring your
own app
jon@robinsystems.com
adeesh@robinsystems.com
3. LET’S PLAY BUZZWORD BINGO!!
CONFIDENTIAL – RESTRICTED DISTRIBUTION https://www.shutterstock.com/image-illustration/word-cloud-predictive-analytics-related-tags-218879485
6. DATA PIPELINE
› Data pipeline is an automated process
that executes at regular interval to
ingest, cleanse, transform and/or
aggregate incoming feed of data to
generate the output dataset in the
format that is suitable for downstream
processing, with no manual intervention.
CONFIDENTIAL – RESTRICTED DISTRIBUTION
data pipeline: (Worldwide – Last 5 Years)
8. Container-based
Compute Plane
Integrated Scale-Out
Block Storage
Integrated Networking
WHAT IS ROBIN?
CONFIDENTIAL – RESTRICTED DISTRIBUTION
Works Anywhere
BareMetal, VM or Cloud
With All Apps
No changes to apps or workflows
Storage Node
Compute Node
Storage Node
Compute Node Converged Node Converged Node
9. WHAT IS ROBIN?
CONFIDENTIAL – RESTRICTED DISTRIBUTION
Big Data NoSQL RDBMS Other
Custom
Apps
Application Aware Workflow Manager
UpgradeCloneDeploy ShareScale
Access ControlHigh AvailabilityQOS Control Placement Control Security
10. DATA PIPELINE ON DOCKER??
CONFIDENTIAL – RESTRICTED DISTRIBUTION
11. EASY EXAMPLE: TWITTER STREAMING USING ELK
CONFIDENTIAL – RESTRICTED DISTRIBUTION
Master Node
Data Node Data Node
Master Node
12. DEMO 1
› Tweet now with the words ‘robinsystems’ in it
CONFIDENTIAL – RESTRICTED DISTRIBUTION
13. HOW WE DID IT?
• 3 EC2 Instances, 6 EBS volumes
• 6 Docker containers, 3 independent
images
• Ovs bridge for cross host networking
• Private IP for each container
• Virtual volumes mounted on each
container
• Ports mapped for ES and Kibana for
external access
CONFIDENTIAL – RESTRICTED DISTRIBUTION
EC2 M4.2xl EC2 M4.2xl
EBS EBS EBS EBS
Robin
Storage
Plane
Robin
Compute
Plane
EC2 M4.2xl
…
OVS Bridge OVS Bridge OVS Bridge
Primary
Private IPs
Virtual
Volumes
Containers ..… Secondary
Private IPs
14. KNIME-spark
gateway
Data at Rest
(Encryption)
Kerberos (AD)
REAL-WORLD EXAMPLE: SECURITY ANALYSIS
CONFIDENTIAL – RESTRICTED DISTRIBUTION
HDP
HDFS
Ranger
(AuthZ)
SSLStream
Sources
Data Store
Security Layer
Other Services
15. HOW WE DID IT?
• 30 physical servers
• 2 Multitenant clusters - Dev (800 TB) &
InfoSec (1.5 PB)
• 100+ Docker containers
• ~1 service per container
• Ovs bridge for cross host networking
• Routable IP address for each container
• Virtual volumes mounted on each
container
CONFIDENTIAL – RESTRICTED DISTRIBUTION
Physical
Servers
Physical
Servers
HDDs HDDs HDDsStorage
Pool
Compute
Pool
Physical
Servers
…
OVS Bridge OVS Bridge OVS Bridge
Virtual
Volumes
Containers
...
2 Multitenant
clusters
16. BENEFITS
› Rapid Deploy
› Deploy time/cluster = 40 mins (Originally 2 weeks)
› No need to size hardware by App
› Decouple compute and storage
› HDP recommends keeping only 48-96 TB per data node
› Improved server and storage utilization (~40%)
› Enforce data locality for performance
› Multitenancy for any application with performance isolation
CONFIDENTIAL – RESTRICTED DISTRIBUTION
18. How do I provide
developers access
to data ?
Can I run multiple
pipelines on the same
setup without
compromising
performance ?
When data sets &
workload grows ,
can I avoid under
provisioning ?
How do you
handle spikes and
growth ?
How do I quickly
deploy my entire
pipeline ?
How do I provide
developers access
to data ?
Can I run multiple
pipelines on the same
setup without
compromising
performance ?
When data sets &
workload grows ,
can I avoid under
provisioning ?
How do you
handle spikes and
growth ?
How do I quickly
deploy my entire
pipeline ?
DATA PIPELINE - CHALLENGES
CONFIDENTIAL – RESTRICTED DISTRIBUTION
How do I avoid
under or over
provisioning
resources?
DATA PIPELINE IS A CLUSTER OF MULTIPLE CLUSTERED APPLICATIONS EXPECTED
TO WORK IN UNISON
How do I quickly
deploy my entire
pipeline ?
How do you
handle spikes and
growth ?
How do I provide
developers access
to data ?
Can I run multiple
pipelines on the same
setup without
compromising
performance ?
19. CHALLENGES WITH MANAGING DATA PIPELINES
CONFIDENTIAL – RESTRICTED DISTRIBUTION
8 Billion
Exploding Data Volume
24 feeds into ElasticSearch
8 billion security events per day
53 billion documents
1 Week
Poor Agility
Week+ to Provision Clusters
10hrs+ to take a Snapshot
$3 Million
High Cost
Real Time traffic too much
for VMs on Commodity Hardware
Expensive All-Flash servers
to meet performance needs
$3M+ hardware spend and growing
20. CHALLENGES WITH MANAGING DATA PIPELINES
CONFIDENTIAL – RESTRICTED DISTRIBUTION
“Getting Big Data projects to Production is a
challenge …
Only 15% of businesses reported deploying
their Big Data project to Production”
– Gartner Big Data Survey, Oct 2016
21. Let Applications Drive Infrastructure
ROBIN BENEFITS
Big Data NoSQL RDBMS Other
Custom
Apps
Instant Access Lower Complexity Lower Cost
Deploy Entire Pipelines in Minutes Same Workflow across Apps More Apps on Same Resources
Instant Sharing – No Data Copy SLA Guarantees Self-Service Dev, Automation for Ops
Test-before-Commit Dynamic Scaling Faster Time to Market
Simplify Ops10x Lower TCO2x3x Faster Projects
22. ROBIN FREE COMMUNITY EDITION
CONFIDENTIAL – RESTRICTED DISTRIBUTION
https://robinsystems.com/get-robin/
Free Forever (Robin Licenses)
Up to 5 EC2 Instances
Auto-deploy to AWS using ‘gorobin’ tool
Use pre-defined bundles or bring your
own app
jon@robinsystems.com
adeesh@robinsystems.com