2. AWS Athena: New Features
New JDBC/ODBC drivers released which are
2-5x faster
Supports CTAS
The drivers integrate with MS Active Directory for
access control (in lieu of access keys)
Supports Views
Introduced Athena Work Groups (in beta)
3. Athena Work Groups
Can be defined to identify different types of
workload or teams
Integrated with cloud watch and allows collection
of metrics at work groups level
Cost control: Can set query threshold limit for each
work group on usage (GB) or time
Can create and send trigger alerts at the breach of
threshold; option to fail the query as well
Option to disable workgroups
4.
5. New Features in AWS
Features Description
EMR Notebooks EMR notebook is a "serverless" Jupyter notebook
executed using EMR cluster and managed via EMR
console. Ability to attach notebook to any EMR cluster,
notebook stored in S3
AWS Glue Support for Hive, Spark & Presto
S3 Select Available in Spark, Java, Python. Objects must be in
CSV, JSON, or Parquet with UTF-8 encoding
AWS Textract
(in beta)
Automatically extracts text and data from scanned
documents including data stored in tables & forms
Predictive Scaling ML based feature which will try to predict the workload
and scale accordingly
Serverless Aurora
(in beta)
Serverless offering of Aurora similar to Athena
6. AWS Glue
Crawlers for automatic data discovery. Auto generate
schema & partitions
Generates Python/Scala code which can be customized
Job bookmarks – keep track of data that is already
processed
Has built-in in Scheduler and integrated with AWS
CloudWatch for notification
Catalog can be shared Athena & Redshift Spectrum
Fine grained catalog permission at Table or Connection
7. AWS Auto Scaling
Free service to scale EC2/ECS/Aurora/Dynamo
Scaling can be defined on any CloudWatch metric including
custom metric
Typical metrics used: CPU Utilization, memory, incoming
traffic
Scaling Options: Manual/Scheduled/Dynamic
New option added: Predictive Scaling
Uses ML and based on historical patterns of resource
utilization
Daily forecasting at 60 minute grain
8. Data Lake Architecture:
Best Practices
Build decouple systems: future proof
Design for ability to scale indefinitely with new
business
Focus on core competencies, reduce dependencies on
managing or building infrastructure.
Be very cost conscious, build ‘pay-for-what-you-use’
architecture
Enable your application to leverage ML
9. Data Lake Architecture:
Best Practices
Design and build for multi-tenancy
Consolidate small files before loading into S3
For full data scan use AVRO file type
Always preserve the raw data in IA or Glacier
Use automated test suites on every
release/commit
10. Best Practices for
Data Lake Security
Encrypt data at rest (KMS) and in transit (SSL)
Set ownership of S3 buckets at user/team level. It reduces
surface area of attack
Disable S3 delete using IAM roles
Buckets should always be created on business domain and
should have security policies baked in
Backup data across regions
Allow S3 access based on tags (eg – redshift, HIPPA query)
or IAM roles (dev/DS)
11. Best Practices for
Data Lake Security
Use AWS Config to detect & notify on any S3 policy
changes
Use AWS Macie to detect & classify PII and sensitive
data
Control data access through views, don’t expose the
core tables directly
Use & leverage centralized data catalogs
12. EMR Best Practices
Run Stateless: the Meta store should be
remote (MySQl or Glue)
Use combination of spot instances to reduce cost
(design for re-runnability).
Don’t specify the Availability Zone to get cheapest
instances
Single spot node termination will not interrupt the
cluster (new feature: graceful decommissioning )
Build Instance fleets with mix of different instance
types (c5/r5..) and different markets (spot/on-demand)
13. Choosing the Right DB
Days of one-size-fit-all DBs are over
DBs have become specialized based on their use
case, different types:
Eg - Better (faster & cheaper) to use time-series
for storing & plotting time base data compared to
RDBMS.
Relational Key-Value MPP In-Memory
Document Graph Time-Series Ledger
Columnar Distributed Object
14. AWS Quantum Ledger DB
(QLDB)
Keeps track of transparent, immutable, and
cryptographically verifiable transaction log data over
distributed ledgers
Every entry is written into a journal and cannot be
changed. Journal is append only and maintains two
states: Current & History
Each txn generates a digest using a cryptographic hash
function (SHA-256) which guarantees the integrity
Serverless , SQL support & ACID compliant
15. Machine Learning
Services
1. AWS SageMaker
Full managed service to build/train/deploy the
machine learning models.
Out of the box optimization for following ML
packages:
Supports: Supervised, Unsupervised & Reinforced
learning
Supported by EC P3dn.24xlarge (8 Tesla V100 GPUs)
TensorFlow, Apache MXNet, PyTorch, Chainer, Scikit-learn, SparkML,
Horovod, Keras, and Gluon
16. Machine Learning
Services (contd..)
Framework & model agnostic (use from pre-
trained model library or bring your own)
Integrated with Jupyter notebook
On demand and scalable training clusters
Integrated with different AWS services like
Lambda, API Gateway, Cloudwatch etc
17. Machine Learning
Services (contd..)
2. AWS SageMaker Ground Truth
Most of ML work is spent in labeling the training
data
This package can help significantly reduce the
time and effort required to create datasets for
training & reduce costs
3. AWS SageMaker Neo
Container to deploy the ML on any hardware or
application
18. AWS Data Services
Service Description
Data Migration Service Transfers data to AWS cloud; supports homogenous
migrations such as Oracle to Oracle, and
heterogeneous migrations such as Oracle to Aurora
AWS Macie Amazon Macie is a security service that uses
machine learning to automatically discover, classify,
and protect sensitive data in AWS such as PII, IP. It
provides you with dashboards and alerts
AWS Direct Connect It lets you establish a dedicated network
connection between your network and one of the
AWS facility. It saves on bandwidth cost and
provides consistent network performance
AWS Snowball An 80 TB physical device shipped to client location,
where the data is copied and is shipped back to
AWS facility for data copy into AWS. Good for
petabyte scale data copy to save money on transfer
cost
19. Netflix Push Messaging
Case Study
Netflix had Polling infrastructure to poll all it’s
client interfaces
Polling is inherently inefficient, they were able to
reduce the web traffic by 12% by switching to Push
They have Open Sourced the complete Push
Messaging Framework: Zuul
Available on GitHub:
https://github.com/Netflix/zuul/wiki/Push-
Messaging
20. Netflix Case Study
Interesting Challenges
Zuul uses persistent connection to make stateful, this
makes deployments difficult
Solved by using Cluster swap, however this created
problem of ‘Thundering herd’ (everyone trying to connect at
same time)
Resolved by introducing connection lifetime (~30 min) and
randomizing connection lifetime
Zuul push cluster can auto-scale based on number of open
connection
AWS allows auto-scaling based on any metric defined in
Cloud Watch
21. AWS Forecast
(in beta)
Predicts future points in a time series given historical data
Uses deep learning models developed by Amazon
Accuracy is cornerstone of any forecasting; this service is
50% more accurate than traditional methods
It comes with 8 pre-packaged models: 5 custom built
algorithm and 3 traditional for benchmarking
Inputs: Historical Time-Series (eg- electricity consumption
per year), Any Related data (weather), Metadata (location)
Output: Ability to visualize the forecast and export via an
API
All forecasting is probabilistic for a specific prediction
interval including margin or error
22. AWS Forecast
Models
Traditional Amazon
Pricing
Exponential Smooting
ARIMA
Prophet
Auto Regressive LSTM
Spline Quantile Forecaster
Multi Horizon Quantile (MQ-
RNN)
Cost Type Pricing
Generated forecasts $0.60 per 1,000 forecasts
Data storage $0.088 per GB
Training hours $0.238 per hour
23. New Terms Learnt
Term Meaning
Dark Data Data that is hidden in files or otherwise not accessible
to the enterprise
Data Ponds Data that live in Silos across the Enterprise
Blast Radius Impact of a deployment, eg – microservice deployment
will have a smaller blast radius compared to a
monolithic API
Thundering Herd When everyone tries to connect at same time, eg – if
your service goes down and after it is restored all users
try to connect simultaneously overwhelming the system
Data Decay Value of data decreases over time, data is most valuable
near its creation
33. More Info
AWS Slide Deck
AWS Videos
AWS re:Invent Recap
In 2019: Dec 2 – Dec 6 @ Las Vegas
Editor's Notes
CTAS is huge, Work groups are essentially resource queue
CTAS is huge
CTAS is huge
Predictive Scaling needs up to two weeks of historical data
The key usage is bit
THE HIGHEST PERFORMING GPU INSTANCE in the cloud
Reinforced: model learns by interacting real world scenarios
THE HIGHEST PERFORMING GPU INSTANCE in the cloud
Reinforced: model learns by interacting real world scenarios
For example, building a computer vision system that is reliable enough to identify objects - such as traffic lights, stop signs, and pedestrians - requires thousands of hours of video recordings that consist of hundreds of millions of video frames. Each one of these frames needs all of the important elements like the road, other cars, and signage to be labeled by a human before any work can begin on the model you want to develop.
Amazon SageMaker Ground Truth significantly reduces the time and effort required to create datasets for training to reduce costs. These savings are achieved by using machine learning to automatically label data. The model is able to get progressively better over time by continuously learning from labels created by human labelers.
They also have best fit model, where AWS will choose based on the data