DevOps for Big Data 
Enabling Continuous Delivery for 
data analytics applications based on 
Hadoop, Vertica, and Tableau 
1 
Max Martynov, VP of Technology 
Grid Dynamics
Introductions 
• Grid Dynamics 
─ Solutions company, specializing in eCommerce 
─ Experts in mission-critical applications (IMDGs, Big Data) 
─ Implementing Continuous Integration and Continuous Delivery for 5+ years 
• Qubell 
─ Enterprise DevOps platform 
─ Focused on self-service environments, service orchestration, and continuous 
upgrades 
─ Targets web-scale and big data applications 
2
State of DevOps and Continuous Delivery 
Continuous Delivery Value 
• Agility 
• Transparency 
• Efficiency 
• Consistency 
• Quality 
• Control 
Findings from The 2014 State of 
DevOps Report 
• Strong IT performance is a 
competitive advantage 
• DevOps practices improve IT 
performance 
• Organizational culture matters 
• Job satisfaction is the No. 1 
predictor of organizational 
performance 
3
Continuous Delivery Infrastructure 
• Environments 
─ Reliable and repeatable deployment automation 
─ Database schema management 
─ Data management 
─ Application properties management 
─ Dynamic environments 
• Quality 
─ Test automation 
─ Test data management (again) 
─ Code analysis and review 
• Process 
─ Source code management, branching strategy 
─ Agile requirements and project management 
─ CICD pipeline 
* Big Data applications bring 
additional challenges in these 
areas due to big amounts of data, 
complexity of business logic and 
large scale environments. 
4
Implementing Continuous Delivery for Big Data: 
Initial State of the Project 
• Medium size distributed development team 
• Diverse technology stack – Hadoop + Vertica + Tableau 
• Only one environment existed and it was production 
• Delivery pipeline: 
• Procurement of hardware for a new environment was taking months 
5 
Development 
Team 
Production
Development in Production 
6 
It is fun until somebody 
misses the nail
Hadoop Analytical Application 
7 
Master 
Database 
Slaves 1 - N 
Manager 
10+ TB of data; 10+ nodes in production; 10+ applications; manually pre-deployed on hardware servers 
How to quickly reproduce this environment for dev-test purposes?
1. Stop Gap Measure 
• Same hardware, different logical “zones” implemented on the file system 
• Automated build and deployment 
• Delivery pipeline: 
8 
Development 
Team 
Production 
cluster 
/test1-N 
/stage 
/prod 
Zones
1. Stop Gap Measure: Pros and Cons 
Pros 
• Better than before: code can be 
tested before it goes to production 
• All logical environments has 
access to the same production 
data 
• Zero additional environment costs 
Cons 
• Stability, security and compliance 
issues: dev, test and prod 
environments share same 
hardware 
• Performance issues: tests affect 
production performance 
• Impossible to run “destructive” 
tests that affect shared production 
data 
• Impossible to test upgrades of 
middleware (new versions of H* 
components) 
9
2. Hadoop Dynamic Environments 
10 
Data 
Components 
Custom 
Application 
Services Environment 
Policies 
Dev 
QA 
Stage Prod 
Dev/QA/Ops 
Request 
Environment 
Orchestrate environment 
provisioning and application 
deployment 
Environment
2. Hadoop Dynamic Environments (continued) 
• Dev/QA/Ops teams got a self-service portal to 
─ provision environments 
─ deploy applications 
• A new environment can be created from scratch in 2-3 hours 
─ singe-node dev sandbox 
─ multi-node QA 
─ big clusters for scalability and performance 
• An application can be deployed to an environment within 10 minutes 
11
3. Vertica and Tableau Dynamic Environments 
12 
Components 
Data UDF 
Dev 
Services 
Environment 
Policies 
QA 
Stage Prod 
Dev/QA/Ops 
Request 
Environment 
Orchestrate environment 
provisioning and application 
deployment 
Environment 
VSQL Config 
Shared 
service
4. Tests & Test Data 
• Dev and QA teams implemented automated tests 
• Two options to handle data on dev-test environments: 
1. Tests generate data for themselves 
2. A reduced representative snapshot of obfuscated production data (10TB -> 10GB) 
Integration Tests 
(integration with data) 
Component Tests 
Unit Tests 
Manual tests; 
snapshot of production data 
Auto tests on “API” level, validating job output; 
snapshot of production data 
Auto tests on “API” level, testing job output; 
test-generated data 
13 
Exploratory 
Tests 
Java code, auto-generated data; 
build-time validation
5. CICD pipeline 
With all components ready, implementing CICD pipeline is easy: 
14 
2. Commit Github Flow 
Development 
Team 
1. Develop & 
Experiment 
3. Build & 
unit test 
4. Deploy 5. Test 
6. Release 
Dev Sandbox QA Environment
6. Release Button 
15 
Release 
Candidate 
Release 
Ops/RE Production
Assembly Line 
16
Results 
• Reduced risk and higher quality 
─ No more development in production 
─ Developers have sandboxes, tests are run on separate environments 
─ Feature are deployed to production only after validation 
• Increased efficiency 
─ A new environment can be provisioned within 2 hours 
─ Developers can freely experiment with new changes 
─ No resource contention 
• Reduced costs 
─ No need to procure in-house hardware and manage in-house datacenter 
─ Dynamic environments save money by using them on only when they are needed 
17
Enabling Technologies 
Agile Software Factory 
Software Engineering Assembly Line 
griddynamics.com 
Qubell 
Enterprise DevOps Platform 
qubell.com 
18
OCTOBER 14 
Thank You 
19 
Max Martynov, VP of Technology, Grid Dynamics 
mmartynov@griddynamics.com 
Victoria Livschitz, CEO and Founder, Qubell 
vlivschitz@qubell.com

DevOps for Big Data - Data 360 2014 Conference

  • 1.
    DevOps for BigData Enabling Continuous Delivery for data analytics applications based on Hadoop, Vertica, and Tableau 1 Max Martynov, VP of Technology Grid Dynamics
  • 2.
    Introductions • GridDynamics ─ Solutions company, specializing in eCommerce ─ Experts in mission-critical applications (IMDGs, Big Data) ─ Implementing Continuous Integration and Continuous Delivery for 5+ years • Qubell ─ Enterprise DevOps platform ─ Focused on self-service environments, service orchestration, and continuous upgrades ─ Targets web-scale and big data applications 2
  • 3.
    State of DevOpsand Continuous Delivery Continuous Delivery Value • Agility • Transparency • Efficiency • Consistency • Quality • Control Findings from The 2014 State of DevOps Report • Strong IT performance is a competitive advantage • DevOps practices improve IT performance • Organizational culture matters • Job satisfaction is the No. 1 predictor of organizational performance 3
  • 4.
    Continuous Delivery Infrastructure • Environments ─ Reliable and repeatable deployment automation ─ Database schema management ─ Data management ─ Application properties management ─ Dynamic environments • Quality ─ Test automation ─ Test data management (again) ─ Code analysis and review • Process ─ Source code management, branching strategy ─ Agile requirements and project management ─ CICD pipeline * Big Data applications bring additional challenges in these areas due to big amounts of data, complexity of business logic and large scale environments. 4
  • 5.
    Implementing Continuous Deliveryfor Big Data: Initial State of the Project • Medium size distributed development team • Diverse technology stack – Hadoop + Vertica + Tableau • Only one environment existed and it was production • Delivery pipeline: • Procurement of hardware for a new environment was taking months 5 Development Team Production
  • 6.
    Development in Production 6 It is fun until somebody misses the nail
  • 7.
    Hadoop Analytical Application 7 Master Database Slaves 1 - N Manager 10+ TB of data; 10+ nodes in production; 10+ applications; manually pre-deployed on hardware servers How to quickly reproduce this environment for dev-test purposes?
  • 8.
    1. Stop GapMeasure • Same hardware, different logical “zones” implemented on the file system • Automated build and deployment • Delivery pipeline: 8 Development Team Production cluster /test1-N /stage /prod Zones
  • 9.
    1. Stop GapMeasure: Pros and Cons Pros • Better than before: code can be tested before it goes to production • All logical environments has access to the same production data • Zero additional environment costs Cons • Stability, security and compliance issues: dev, test and prod environments share same hardware • Performance issues: tests affect production performance • Impossible to run “destructive” tests that affect shared production data • Impossible to test upgrades of middleware (new versions of H* components) 9
  • 10.
    2. Hadoop DynamicEnvironments 10 Data Components Custom Application Services Environment Policies Dev QA Stage Prod Dev/QA/Ops Request Environment Orchestrate environment provisioning and application deployment Environment
  • 11.
    2. Hadoop DynamicEnvironments (continued) • Dev/QA/Ops teams got a self-service portal to ─ provision environments ─ deploy applications • A new environment can be created from scratch in 2-3 hours ─ singe-node dev sandbox ─ multi-node QA ─ big clusters for scalability and performance • An application can be deployed to an environment within 10 minutes 11
  • 12.
    3. Vertica andTableau Dynamic Environments 12 Components Data UDF Dev Services Environment Policies QA Stage Prod Dev/QA/Ops Request Environment Orchestrate environment provisioning and application deployment Environment VSQL Config Shared service
  • 13.
    4. Tests &Test Data • Dev and QA teams implemented automated tests • Two options to handle data on dev-test environments: 1. Tests generate data for themselves 2. A reduced representative snapshot of obfuscated production data (10TB -> 10GB) Integration Tests (integration with data) Component Tests Unit Tests Manual tests; snapshot of production data Auto tests on “API” level, validating job output; snapshot of production data Auto tests on “API” level, testing job output; test-generated data 13 Exploratory Tests Java code, auto-generated data; build-time validation
  • 14.
    5. CICD pipeline With all components ready, implementing CICD pipeline is easy: 14 2. Commit Github Flow Development Team 1. Develop & Experiment 3. Build & unit test 4. Deploy 5. Test 6. Release Dev Sandbox QA Environment
  • 15.
    6. Release Button 15 Release Candidate Release Ops/RE Production
  • 16.
  • 17.
    Results • Reducedrisk and higher quality ─ No more development in production ─ Developers have sandboxes, tests are run on separate environments ─ Feature are deployed to production only after validation • Increased efficiency ─ A new environment can be provisioned within 2 hours ─ Developers can freely experiment with new changes ─ No resource contention • Reduced costs ─ No need to procure in-house hardware and manage in-house datacenter ─ Dynamic environments save money by using them on only when they are needed 17
  • 18.
    Enabling Technologies AgileSoftware Factory Software Engineering Assembly Line griddynamics.com Qubell Enterprise DevOps Platform qubell.com 18
  • 19.
    OCTOBER 14 ThankYou 19 Max Martynov, VP of Technology, Grid Dynamics mmartynov@griddynamics.com Victoria Livschitz, CEO and Founder, Qubell vlivschitz@qubell.com