AWS provides cloud computing services that allow companies to scale their infrastructure resources up or down depending on demand. By using AWS, companies can avoid over-provisioning for peak usage or running out of capacity during spikes. AWS offers a variety of compute, storage, database, analytics, machine learning, Internet of Things, mobile, security, hybrid, and enterprise applications services that businesses can leverage to build scalable applications.
Lets start with our customer’s life before AWS. There were 6 clusters in a hardware DC. Each cluster share own master MySQL database, set of slaves DB, set of Web-apps, own portal app, tools, etc.
Drivers for changes:
BF preparation is too expensive/complicated
New cluster creation or disaster recovery process was a sequence of documented actions (runbooks).
Few downtimes during 2008 challenged to be more redundant and need failover DC
AWS technologies: none
Two AWS DCs + master on premise hardware DC
AWS Benefits:
Rolling deployment: turn off DC from pool and deploy
Fast and “unlimited” capacity scaling up/down
Lessons learned:
AWS is flexible and scalable
A lot of infra changes need to be done
All components need to be cloud-ready (build to fail)
We need to change and adopt to cloud.
Moving forward:
The slaves routinely fell behind when we had to ingest lots of new data, sometimes by 10 hours or more. We needed to rethink our entire stack.
AWS technologies: EC2, EIP
Full platform/organizational re-architecture. Agile/DevOps/…
Re-architecture drivers:
* app/organizational changes
* next level of flexibility, performance, and reliability
* solve our multi-region replication problem
* get rid of our individual clusters
App/Org changes:
* Monolithic Java app was broken up into a set of small services, each supported by a decentralized engineering team.
* The teams were responsible for the entire service life-cycle, from Development to QA to Operations.
* Engineering adopted Agile as a development methodology, where previously we were waterfall driven.
* Shorter release cycle: from coordinated 8-12 weeks to once per week with any time coordinated release
Tools changes:
Puppet adoption
Zabbix/Nagios to Datadog
Distributed logging to centralized
Data stack:
For our DB system of record, we chose Cassandra for its multi-region replication abilities (DynamoDB did not have this feature) and cloud-native operational qualities.
ElasticSearch replaced Solr for similar reasons.
AWS technologies:
IaaC: CloudFormation
3 VPCs: dev/qa/prod, 3 regions with 3 AZ each
Public and Private ELBs
AutoScaling
EBS – early adopter of big volumes
MySQL RDS
Route53
SWF/SQS/SNS/SES
ECS – clustered Docker container orchestration service
Early ECS adoption:
Early ECS with HAProxy balancers layer + Consul (service discovery) + Consul-template (dynamic balancing via HAProxy) + Registrator (service registration in Consul) + Custom deployment tool (based on Thor + AWS Ruby SDK)
Complex and hard to manage/troubleshoot
Additional layers costs
Missed features
Current ECS implementation:
ALB with host based and URL based target rules
Clear and simple deployment process via yaml CFN templates
New features as Docker labels
Missing: multiple ELBs and Service Discovery
Lambda + API Gateway + EFS + ECS example
Task: run ML TensorFlow image trainer for particular image class by request
AWS Batch is not support EFS – sad
API Gateway + Lambda can launch task on ECS cluster, but what if there is no free resources?
Lambda increasing ASG size as well )
CloudWatch decreasing ASG size
Tokenizer ECS DEMO
Data Pipeline example
Customer’s actions -> S3 parquet data -> EMR cluster -> Spark steps (joins, sorts, aggregations) -> S3 parquet data -> Hive indexing-> ES
DP features:
Preconditions
Intermediate actions (scripts)
AMI with frameworks
Spots
Flexible scaling
Additional talk - cost saving:
Spots/Reserved
T2 instances
S3 policies + storage types
Resources inspection