Big Data on AWS

BIG DATA WITH AWS
INTRO TO
Szilveszter Molnar, Senior Pre-Sales Engineer 
@moszinet

GENERIC AWS
▸Regions
▸Availability Zones

REGIONS AND AVAILABILITY ZONES

GENERIC AWS
▸Regions
▸Availability Zones
▸Edge Locations

GENERIC AWS
Amazon EC2 
Elastic Compute Cloud
Amazon S3 
Simple Storage Service
(Volume Types) (Pricing Model)
IAM 
Identity And Access Management

GLACIER
CONCEPTS
▸ Keep all your data at a much lower cost
▸ Move automatically from S3 to Glacier
▸ Compliance requirements to keep your data
▸ Vault Lock (based on IAM policies)

AWS KINESIS
WHAT IS KINESIS?
▸ Platform for streaming data on AWS
▸ "sometimes TBs per hour"

AWS KINESIS
KINESIS COMPONENTS
Firehose AnalyticsStreams

AWS KINESIS STREAMS
CONCEPTS
▸ Streams - ordered sequence of data records
▸ Data record - Sequence Number, Partition Key, Data Blob
▸ 1MB max
▸ Retention period - 24h - 7d
▸ Producers, Consumers
▸ Shards

AWS KINESIS STREAMS
KINESIS STREAMS - SHARDS
▸ Fixed unit of capacity
▸ Read
▸ 5 transaction / sec
▸ 2MB / sec
▸ Write
▸ 1000 records / sec
▸ 1MB / sec

AWS KINESIS FIREHOSE
CONCEPTS
▸ Firehose delivery stream
▸ record - 1MB max
▸ data producer
▸ buffer size, buffer interval

AWS KINESIS FIREHOSE
DATA DELIVERY
▸ KINESIS STREAM
▸ S3
▸ Redshift
▸ Elasticsearch

AWS KINESIS ANALYTICS
TERMINOLOGY
▸ Input
▸ Application Code
▸ In-App-Streams
▸ Pumps
▸ Streaming SQL
▸ Output

STREAMING SQL
▸ Tumbling Window
[...] GROUP BY 
FLOOR((“SOURCE_SQL_STREAM_001”.ROWTIME – TIMESTAMP
‘1970-01-01 00:00:00’) SECOND / 10 TO SECOND)
▸ Sliding Window
SELECT AVG(change) OVER W1 as avg_change 
FROM "SOURCE_SQL_STREAM_001" 
WINDOW W1 AS (PARTITION BY ticker_symbol RANGE INTERVAL
'10' SECOND PRECEDING)

STREAMING SQL - TUMBLING WINDOW

SQS
CONCEPTS
▸ Simple Queue Service
▸ Send, Store, Retrieve messages
▸ between applications
▸ Acts as a buffer
▸ At least once delivery
▸ FIFO is also supported

SQS
CONCEPTS
▸ Queues are created in regions
▸ 14 days retention
▸ no message priority is supported (2 queues)

SQS
EXAMPLE
WEB SERVER
REQUEST QUEUE
RESPONSE QUEUE
PROCESSOR
NODES

IOT
CONCEPTS
▸ Managed Cloud Platform from Internet of Things
▸ Billions of devices, trillions of messages
▸ Messages can be routed to other AWS services and other
devices

IOT
IOT AND BIG DATA
▸ IoT devices produce data
▸ Analyze streaming data real time
▸ Process and store data
▸ Don't worry about capacity, scaling, infrastructure

IOT
ARCHITECTURE
GENERIC IOT
THING
AWS IOT
Elasticsearch
Kinesis Firehose
Kinesis Streams
DynamoDB
Machine Learning
CloudWatch
S3
SQS, SNS

IOT
ARCHITECTURE
AWS IOT SDK
AUTHENTICATION
AUTHORIZATION
DEVICE GATEWAY
RULE ENGINE
DEVICE SHADOWS
DEVICE REGISTRY

DATA PIPELINE
TERMINOLOGY
▸ A web service to process and move data between AWS
compute and storage services or on-premise data sources
▸ ETL Workﬂow
▸ Runs on an EC2 instance or EMR cluster that are
provisioned automatically

DATA PIPELINE
ARCHITECTURE
▸ Data Nodes - destination for your data
▸ Activity
▸ Hive
▸ Pig
▸ Shell script
▸ etc
▸ Preconditions
▸ Schedules

DYNAMO DB
CONCEPTS
▸ Fully managed NoSQL database
▸ No visible "servers"
▸ Single digit latency
▸ Document & Key-Value models
▸ No storage limitations
▸ Runs on SSD

DYNAMO DB
CONCEPTS
▸ Collection of Tables
▸ Performance is set on the tables
▸ Write Capacity Units - count of 1KB blocks
▸ Read Capacity Units - count of 4KB blocks
▸ Eventually Consistent by default (data in 3 regions)
▸ Strong consistent reads supported

DYNAMO DB
CONCEPTS
▸ Schemaless
▸ Contains Items (rows) and Attributes (elements)
▸ Special Attributes
▸ Partition Key
▸ Sort Key

DYNAMO DB
DATA TYPES
▸ String
▸ Number
▸ Binary
▸ Bool
▸ Null
▸ Document (List/Map)
▸ Set

DYNAMO DB
DYNAMO DB IN THE AWS ECOSYSTEM
▸ On EMR Dynamo DB is integrated with Hive
▸ Copy data to/from S3 with Data Pipeline
▸ Lambda can be used as triggers
▸ Move data into Redshift with the COPY command

EMR
CONCEPTS
▸ Managed Cluster Platform
▸ Hadoop, Spark, Presto, HBase, etc...
▸ Master, Core and Task Nodes
▸ Single AZ concept

EMR
CORE NODE
▸ Like a slave node
▸ Runs tasks
▸ HDFS
▸ Data node, Node Manager, Application Master

EMR
TASK NODE
▸ Like a slave node
▸ No HDFS
▸ Can be added/remove from a running cluster
▸ Extra compute capacity

EMR
STORAGE OPTIONS
▸ Instance Store
▸ ephemeral - deleted when instance terminates/lost
▸ use when High I/O performance is necessary
▸ EBS for HDFS
▸ EMRFS

EMR
EMRFS
▸ Implementation of HDFS
▸ Wrapper over S3
▸ Resize, terminate clusters without loosing data
▸ Multiple clusters can point to the same data in S3
▸ EMRFS & HDFS
▸ S3DistCp

EMR
EMRFS - CONSISTENT VIEWS
▸ S3
▸ Read After Write consistent for new data
▸ Eventual consistent for overwrite & delete
▸ Solve it with Consistent Views switch

EMR
INSTANCE TYPES
▸ Guidelines by Amazon
▸ Map Reduce
▸ Batch Oriented - M3 & M4 instance types
▸ Machine Learning
▸ P2, C3, C4
▸ Spark
▸ R3, R4
▸ Large Instance store for HDFS
▸ I2, D2

LAMBDA
CONCEPTS
▸ Function as a Service
▸ Billed for the ms of use
▸ Isolated & Stateless
▸ Persistence - your responsibility
▸ Event-driven

LAMBDA
LIMITS
▸ /tmp space - 512MB
▸ threads - 1,024
▸ max execution duration - 300 seconds

LAMBDA
ARCHITECTURE
EVENT LAMBDA
EVENT/
DESTINATION
Event 
Data
Result

LAMBDA
EXAMPLES
S3 LAMBDA
FILE WITH
WATERMARK
new ﬁle
Add Watermark
DYNAMO DB LAMBDA
change 
data
EMAIL WITH 
REPORT

REDSHIFT
CONCEPTS
▸ Fully managed petabyte scale data warehouse
▸ MPP (Massively Parallel Processing database)
▸ OLAP & BI applications
▸ ANSI SQL compatible
▸ Column oriented
▸ Single AZ
▸ Backups to S3

REDSHIFT
ARCHITECTURE
LEADER NODE
COMPUTE NODE COMPUTE NODE COMPUTE NODE

REDSHIFT
LEADER NODE
▸ SQL Endpoint
▸ Coordinates parallel query execution
▸ Stores metadata

REDSHIFT
COMPUTE NODES
▸ Execute queries in parallel
▸ Scale out/in, up/down

REDSHIFT
COLUMNAR DATABASE
▸ Stores data on disk grouped by columns, not rows
Row based storage Column based storage

REDSHIFT
BENEFITS
▸ Queries use fewer columns - less disk I/O
▸ Data aggregation (eg. SUM(sales))
▸ Compression

QUICKSIGHT
CONCEPTS
▸ Cloud based analytics service
▸ Build visualizations
▸ Ad-hoc analysis
▸ Connect to AWS services, EC2 or on-premises DB

QUICKSIGHT
SUPPORTED DATA SOURCES
▸ Redshift
▸ Aurora
▸ Athena
▸ RDS (Maria DB, SQL Server, MySQL, PostgreSQL)
▸ Files
▸ Salesforce

AWS BIG DATA
READING MATERIALS
▸ Kinesis
▸ Kinesis Firehose Transformation with Lambda
▸ Implementing producers with Amazon Kinesis Producer Library
▸ EMR
▸ Best practices for Amazon EMR (2013)
▸ Lambda
▸ Big Data processing with serverless MapReduce
▸ Redshift
▸ Redshift Table Design
▸ Optimizing for Star Schemas and Interleaved Sorting

AWS BIG DATA
HOMEWORK - OPTIONAL
1. Create a transient EMR cluster that automatically starts
your application (Hive, Pig, Spark, etc...)
2. Create a Redshift cluster, experiment with the COPY
command (bulk load rows) vs INSERT statement
‣ 100 rows at least
Note: Make sure you do not forget to delete the created
resources ...

Big Data on AWS

More Related Content

Similar to Big Data on AWS

More from Szilveszter Molnár

Recently uploaded

Big Data on AWS