AWS Summit 2018 Summary

AWS Summit
Knowledge Share
- ASHISH MRIG
HTTPS://WWW.LINKEDIN.COM/IN/ASHISHMRIG/

AWS Athena: New Features
New JDBC/ODBC drivers released which are
2-5x faster
Supports CTAS
The drivers integrate with MS Active Directory for
access control (in lieu of access keys)
Supports Views
Introduced Athena Work Groups (in beta)

Athena Work Groups
Can be defined to identify different types of
workload or teams
Integrated with cloud watch and allows collection
of metrics at work groups level
Cost control: Can set query threshold limit for each
work group on usage (GB) or time
Can create and send trigger alerts at the breach of
threshold; option to fail the query as well
Option to disable workgroups

New Features in AWS
Features Description
EMR Notebooks EMR notebook is a "serverless" Jupyter notebook
executed using EMR cluster and managed via EMR
console. Ability to attach notebook to any EMR cluster,
notebook stored in S3
AWS Glue Support for Hive, Spark & Presto
S3 Select Available in Spark, Java, Python. Objects must be in
CSV, JSON, or Parquet with UTF-8 encoding
AWS Textract
(in beta)
Automatically extracts text and data from scanned
documents including data stored in tables & forms
Predictive Scaling ML based feature which will try to predict the workload
and scale accordingly
Serverless Aurora
(in beta)
Serverless offering of Aurora similar to Athena

AWS Glue
Crawlers for automatic data discovery. Auto generate
schema & partitions
Generates Python/Scala code which can be customized
Job bookmarks – keep track of data that is already
processed
Has built-in in Scheduler and integrated with AWS
CloudWatch for notification
Catalog can be shared Athena & Redshift Spectrum
Fine grained catalog permission at Table or Connection

AWS Auto Scaling
Free service to scale EC2/ECS/Aurora/Dynamo
Scaling can be defined on any CloudWatch metric including
custom metric
Typical metrics used: CPU Utilization, memory, incoming
traffic
Scaling Options: Manual/Scheduled/Dynamic
New option added: Predictive Scaling
Uses ML and based on historical patterns of resource
utilization
Daily forecasting at 60 minute grain

Data Lake Architecture:
Best Practices
Build decouple systems: future proof
Design for ability to scale indefinitely with new
business
Focus on core competencies, reduce dependencies on
managing or building infrastructure.
Be very cost conscious, build ‘pay-for-what-you-use’
architecture
Enable your application to leverage ML

Data Lake Architecture:
Best Practices
Design and build for multi-tenancy
Consolidate small files before loading into S3
For full data scan use AVRO file type
Always preserve the raw data in IA or Glacier
Use automated test suites on every
release/commit

Best Practices for
Data Lake Security
Encrypt data at rest (KMS) and in transit (SSL)
Set ownership of S3 buckets at user/team level. It reduces
surface area of attack
Disable S3 delete using IAM roles
Buckets should always be created on business domain and
should have security policies baked in
Backup data across regions
Allow S3 access based on tags (eg – redshift, HIPPA query)
or IAM roles (dev/DS)

Best Practices for
Data Lake Security
Use AWS Config to detect & notify on any S3 policy
changes
Use AWS Macie to detect & classify PII and sensitive
data
Control data access through views, don’t expose the
core tables directly
Use & leverage centralized data catalogs

EMR Best Practices
Run Stateless: the Meta store should be
remote (MySQl or Glue)
Use combination of spot instances to reduce cost
(design for re-runnability).
Don’t specify the Availability Zone to get cheapest
instances
Single spot node termination will not interrupt the
cluster (new feature: graceful decommissioning )
Build Instance fleets with mix of different instance
types (c5/r5..) and different markets (spot/on-demand)

Choosing the Right DB
Days of one-size-fit-all DBs are over
DBs have become specialized based on their use
case, different types:
Eg - Better (faster & cheaper) to use time-series
for storing & plotting time base data compared to
RDBMS.
Relational Key-Value MPP In-Memory
Document Graph Time-Series Ledger
Columnar Distributed Object

AWS Quantum Ledger DB
(QLDB)
Keeps track of transparent, immutable, and
cryptographically verifiable transaction log data over
distributed ledgers
Every entry is written into a journal and cannot be
changed. Journal is append only and maintains two
states: Current & History
Each txn generates a digest using a cryptographic hash
function (SHA-256) which guarantees the integrity
Serverless , SQL support & ACID compliant

Machine Learning
Services
1. AWS SageMaker
Full managed service to build/train/deploy the
machine learning models.
Out of the box optimization for following ML
packages:
Supports: Supervised, Unsupervised & Reinforced
learning
 Supported by EC P3dn.24xlarge (8 Tesla V100 GPUs)
TensorFlow, Apache MXNet, PyTorch, Chainer, Scikit-learn, SparkML,
Horovod, Keras, and Gluon

Machine Learning
Services (contd..)
Framework & model agnostic (use from pre-
trained model library or bring your own)
Integrated with Jupyter notebook
On demand and scalable training clusters
Integrated with different AWS services like
Lambda, API Gateway, Cloudwatch etc

Machine Learning
Services (contd..)
2. AWS SageMaker Ground Truth
Most of ML work is spent in labeling the training
data
This package can help significantly reduce the
time and effort required to create datasets for
training & reduce costs
3. AWS SageMaker Neo
Container to deploy the ML on any hardware or
application

AWS Data Services
Service Description
Data Migration Service Transfers data to AWS cloud; supports homogenous
migrations such as Oracle to Oracle, and
heterogeneous migrations such as Oracle to Aurora
AWS Macie Amazon Macie is a security service that uses
machine learning to automatically discover, classify,
and protect sensitive data in AWS such as PII, IP. It
provides you with dashboards and alerts
AWS Direct Connect It lets you establish a dedicated network
connection between your network and one of the
AWS facility. It saves on bandwidth cost and
provides consistent network performance
AWS Snowball An 80 TB physical device shipped to client location,
where the data is copied and is shipped back to
AWS facility for data copy into AWS. Good for
petabyte scale data copy to save money on transfer
cost

Netflix Push Messaging
Case Study
Netflix had Polling infrastructure to poll all it’s
client interfaces
Polling is inherently inefficient, they were able to
reduce the web traffic by 12% by switching to Push
They have Open Sourced the complete Push
Messaging Framework: Zuul
Available on GitHub:
https://github.com/Netflix/zuul/wiki/Push-
Messaging

Netflix Case Study
Interesting Challenges
Zuul uses persistent connection to make stateful, this
makes deployments difficult
Solved by using Cluster swap, however this created
problem of ‘Thundering herd’ (everyone trying to connect at
same time)
Resolved by introducing connection lifetime (~30 min) and
randomizing connection lifetime
Zuul push cluster can auto-scale based on number of open
connection
AWS allows auto-scaling based on any metric defined in
Cloud Watch

AWS Forecast
(in beta)
Predicts future points in a time series given historical data
Uses deep learning models developed by Amazon
Accuracy is cornerstone of any forecasting; this service is
50% more accurate than traditional methods
It comes with 8 pre-packaged models: 5 custom built
algorithm and 3 traditional for benchmarking
Inputs: Historical Time-Series (eg- electricity consumption
per year), Any Related data (weather), Metadata (location)
Output: Ability to visualize the forecast and export via an
API
All forecasting is probabilistic for a specific prediction
interval including margin or error

AWS Forecast
Models
Traditional Amazon
Pricing
Exponential Smooting
ARIMA
Prophet
Auto Regressive LSTM
Spline Quantile Forecaster
Multi Horizon Quantile (MQ-
RNN)
Cost Type Pricing
Generated forecasts $0.60 per 1,000 forecasts
Data storage $0.088 per GB
Training hours $0.238 per hour

New Terms Learnt
Term Meaning
Dark Data Data that is hidden in files or otherwise not accessible
to the enterprise
Data Ponds Data that live in Silos across the Enterprise
Blast Radius Impact of a deployment, eg – microservice deployment
will have a smaller blast radius compared to a
monolithic API
Thundering Herd When everyone tries to connect at same time, eg – if
your service goes down and after it is restored all users
try to connect simultaneously overwhelming the system
Data Decay Value of data decreases over time, data is most valuable
near its creation

Accelerate Innovation & Maximize Business Value w/
Serverless Apps

More Info
AWS Slide Deck
AWS Videos
AWS re:Invent Recap
In 2019: Dec 2 – Dec 6 @ Las Vegas

AWS Summit 2018 Summary

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AWS Summit 2018 Summary

Similar to AWS Summit 2018 Summary (20)

Recently uploaded

Recently uploaded (20)

AWS Summit 2018 Summary

Editor's Notes