Last Conference 2017: Big Data in a Production Environment: Lessons Learnt

Featured Project:
Marina Bay Sands Casino Resort, Singapore
Connecting teams project-wide
Big Data in a production environment:
Lessons Learnt
LAST Conference 2017
Mark Grebler - Aconex

CONFIDENTIAL | 2
Featured Project:
Marina Bay Sands Casino Resort, Singapore
Connecting teams project-wide

What does Big Data mean to you?

Summary
• What is the Insights project
• Big Data for Data Science
• Big Data in a production, user-facing environment
• Lessons Learnt
• Problems still to solve

What’s an Aconex?
Pronounced: Ay-conn-ex

Highly flexible and customisable data model with low level concepts
=
useful for many types of projects
Aconex has Flexible data
The Insights Project

Highly flexible and customisable data model with low level concepts
=
Difficult to produce meaningful customer reports
Flexible Data Needs transformation

What does it look like

Typical Big Data Architectures

Insights architecture
Looks similar to the other architectures

But the differences exist
We use the AWS console
to deploy new
infrastructure
I add new hardware by
buying a new box and
connecting it to the
network
Quotes from Data Engineers interviewed
We deploy by copying the
jar file to the cluster
We don’t have any CI, I
just build it on my box
We test by running it over
some data and ensuring it
doesn’t crash
We have some
rudimentary tests

What are the differences
Other Big Data Projects
● Internal client
● Simple authentication
● For Data Scientists
● Single environment
○ Sometimes 2 or 3
● Manual infrastructure management
● Sanity testing
● Manual integration
● Manual deployment
● Unrestricted data access
Insights Project
● External client
● Integrated authentication
● For end users
● Multiple environments
○ Due to data sovereignty (10)
● Infrastructure as code
● Unit → end-to-end testing
● Continuous integration
● Single-step deployment
● Data access restrictions
It’s not always so black and white, but the left side represents quite a lot of other projects I’ve seen.

Lessons Learnt
● VPN to control data access
● Autoscaling application server
● Network independence
● Zero downtime-deployments with
automatic rollback
○ ElasticBeanstalk provides this

Lessons learned: Infrastructure-as-code
● Must be easily reproducible because we need to do it 10+ times
● Automation of infrastructure management
○ Infrastructure is a core part of the Big Data project, so it must be treated as important as our
application code
○ Terraform is used to manage the infrastructure, including:
■ Networking and VPN management
■ Security
■ Provisioning VMs and other infrastructure
■ Replication and ingestion of data from Data Centres
■ Database Administration and Automation

Lessons learned: Access segregation
● Different accounts for testing and
production
● Separate VPCs for each environment
● Multiple user roles allows fine-grained
control of access
● VPN used as a further level to restrict
data access

Lessons learned: Integration and deployment
Continuous Integration
Once built, versioned artifacts are pushed to s3 buckets
Deployments
Ansible is used to roll out new versions of the
application and transformations
Infrastructure
Terraform controls the base infrastructure
● Deployments run in parallel across environments
● Docker image used for deployments to control
dependencies

Lessons Learnt: Automate Testing
● Big Data testing is hard
● Automated unit tests to ensure transformations are correct
○ We pair with our QA to generate the data, and validate the expected output for the unit tests
○ TDD-ish, but often testing done after development
● Automated Integration tests using a large data set
○ To ensure regressions haven’t occurred
● Manual end-to-end sanity tests
○ This should be automated in the future
● Manual exploratory testing

Problems to resolve
● Testing
○ Big Data testing is time consuming
■ Particularly around data generation
○ How to effectively automate testing of the infrastructure
○ How to automate end-to-end sanity testing.
● Infrastructure
○ CI/CD with Terraform
○ So many moving parts makes management difficult
● Ingestion and transformations
○ How to move from batch processing to incremental or streaming
○ Removing the database clones
● Effectively communicating to the business what/why we’re doing what we are
○ Why are things so slow?

Last Conference 2017: Big Data in a Production Environment: Lessons Learnt

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Last Conference 2017: Big Data in a Production Environment: Lessons Learnt

Similar to Last Conference 2017: Big Data in a Production Environment: Lessons Learnt (20)

Recently uploaded

Recently uploaded (20)

Last Conference 2017: Big Data in a Production Environment: Lessons Learnt

Editor's Notes