SlideShare a Scribd company logo
1 of 200
Download to read offline
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Shawn Gandhi
Head of Solutions Architecture, AWS Canada
August 16, 2016
Hello, Canada
@shawnagram
~10 Years Young
Amazon S3 launched: March 14th 2006
Our New Canadian Office
• 120 Bremner Blvd, 26th Floor
Who Are We Exactly?
Customer
Account
Manager
Solutions
Architect
Tech
Account
Manager
Pro
Services
Training &
Cert
Partner
Team
Biz Dev
Canadian System Integrators
AWS Global Infrastructure
12 Regions
32 Availability Zones
54 Edge Locations
AZ
AZ
AZ AZ AZ
What is a Region?
• Each datacenter has a purpose built
network
AZ
AZ
AZ AZ AZ
What is a Region?
• Metro-area DWDM links between AZs
• AZs <2ms apart & usually <1ms
• Each datacenter has a purpose built
network
Big Data Solutions Day
Introductory Session
Antoine Genereux
Solutions Architect, AWS
12:30 PM Introductory Session
2:00 PM Best Practices on Real-time Streaming Analytics
3:00 PM Break
3:15 PM Getting started with Amazon Machine Learning
4:15 PM Building your first big data application on AWS
Today
Thank you to our Sponsor
1,000,000+
Active Customers Per Month
AWS In 2016:
58%
YOY Growth
Q2 2015 TO Q2 20166
$11B+
Revenue run rate
Of 14 Others, Combined
Infrastructure
Intelligence
Harness data generated from
your systems and infrastructure
Advanced
Analytics
Anticipate future behaviors and
conduct what-if analysis
Largest Ecosystem of ISVs
Solutions vetted by the AWS Partner Competency Program
Data
Integration
Move, synchronize, cleanse,
and manage data
Data Analysis &
Visualization
Turn data into actionable insight
and enhance decision making
Largest Ecosystem of Integrators
Qualified consulting partners with AWS big data expertise
North America
Europe, Middle East and Africa
Asia Pacific and Japan
http://aws.analytic.business
Big Data Analytics in
AWS Marketplace
Thousands of products pre-integrated with the AWS Cloud.
290 specific to big data
Analysis & Visualization
Solutions are pre-configured and ready to run
on AWS
Advanced Analytics
Deploy when you need it, 1-Click launch
in multiple regions around the world
Metered pricing by the hour. Pay only for
what you use. Volume licensing available
Data Integration
Extract valuable information from your
historical and current data
Predict future business performance; location,
text, social, sentiment analysis
Extract, migrate, or prepare and clean
your data for accurate analysis
Proven Customer Success
The vast majority of big data use cases deployed in the
cloud today run on AWS.
The Evolution of Big Data Processing
Descriptive Real-time Predictive
Dashboards;
Traditional query &
reporting
Clickstream analysis;
Ad bidding;
streaming data
Inventory forecasting;
Fraud detection;
Recommendation
engines
What happened
before
What’s happening
now
Probability of “x”
happening
Volume
Velocity
Variety
Big Data is Breaking Traditional IT Infrastructure
- Too many tools to setup, manage &
scale
- Unsustainable costs
- Difficult & undifferentiated heavy
lifting just to get started
- New & expensive skills
- Bigger responsibility (sensitive data)
Big Data on AWS
1. Broadest & Deepest Functionality
2. Computational Power that is Second to None
3. Petabyte-scale Analytics Made Affordable
4. Easy & Powerful Tools for Migrating Data
5. Security You Can Trust
1. Broadest & Deepest Functionality
Big Data
Storage
Data
Warehousing
Real-time
Streaming
Distributed Analytics
(Hadoop, Spark, Presto)
NoSQL
Databases
Business
Intelligence
Relational
Databases
Internet of
Things (IoT)
Machine
Learning
Server-less
Compute
Elasticsearch
Core Components for
Big Data Workloads
Big Data Storage for Virtually All AWS Services
Store anything
Object storage
Scalable
99.999999999% durability
Extremely low cost
Amazon S3
Real-time processing
High throughput; elastic
Easy to use
Integration with S3, EMR, Redshift,
DynamoDB
Amazon
Kinesis
Real-time Streaming Data
Amazon Kinesis
Streams
• For Technical Developers
• Build your own custom
applications that process
or analyze streaming
data
Amazon Kinesis
Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into S3, Amazon Redshift
and Amazon Elasticsearch
Amazon Kinesis
Analytics
• For all developers, data
scientists
• Easily analyze data
streams using standard
SQL queries
• Now Available
Amazon Kinesis: Streaming Data Made Easy
Services make it easy to capture, deliver and process streams on AWS
Managed Hadoop framework
Spark, Presto, Hbase, Tez, Hive, etc.
Cost effective; Easy to use
On-demand and spot pricing
HDFS & S3 File systems
Amazon EMR
Distributed Analytics
NoSQL Database
Seamless scalability
Zero administration
Single digit millisecond latency
Amazon
DynamoDB
Fast & Flexible NoSQL Database Service
Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; start at $0.25/hour
Amazon
Redshift
Fully Managed Petabyte-scale Data Warehouse
Amazon Redshift has security built-in
SSL to secure data in transit
Encryption to secure data at rest
• AES-256; hardware accelerated
• All blocks on disks and in Amazon S3 encrypted
• HSM Support
No direct access to compute nodes
Audit logging, AWS CloudTrail, AWS KMS integration
Amazon VPC support
SOC 1/2/3, PCI-DSS Level 1, FedRAMP, HIPAA
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
Easy to use, managed machine learning service
built for developers
Robust, powerful machine learning technology
based on Amazon’s internal systems
Create models using your data already stored in the
AWS cloud
Deploy models to production in seconds
Amazon
Machine Learning
Machine Learning
Simple Model Tuning
Sample Use Cases
Fraud detection Detecting fraudulent transactions, filtering spam emails,
flagging suspicious reviews, …
Personalization Recommending content, predictive content loading,
improving user experience, …
Targeted marketing Matching customers and offers, choosing marketing
campaigns, cross-selling and up-selling, …
Content classification Categorizing documents, matching hiring managers and
resumes, …
Churn prediction Finding customers who are likely to stop using the
service, free-tier upgrade targeting, …
Customer support Predictive routing of customer emails, social media
listening, …
2. Computational Power that is Second to None
Over 2x computational instance types as any other cloud
platform to address a wide range of big data use cases.
Compute
Optimized
(C3, C4)
Memory
Optimized
(R3, X1)
GPU
Instances
(G2)
Storage
Optimized
(I2, D2)
Plus general purpose (M4), low cost burstable (T2), and dedicated instances
Intel® Processor Technologies
Intel® AVX – Dramatically increases performance for highly parallel HPC workloads
such as life science engineering, data mining, financial analysis, media processing
Intel® AES-NI – Enhances security with new encryption instructions that reduce the
performance penalty associated with encrypting/decrypting data
Intel® Turbo Boost Technology – Increases computing power with performance that
adapts to spikes in workloads
Intel Transactional Synchronization (TSX) Extensions – Enables execution of
transactions that are independent to accelerate throughput
P state & C state control – provides granular performance tuning for cores and sleep
states to improve overall application performance
3. Affordable Petabyte-scale Analytics
AWS helps customers maximize the value of Big Data
investments while reducing overall IT costs
Secure,
Highly Durable storage
$28.16 / TB / month
Data
Archiving
$7.16 / TB / month
Real-time
streaming data load
$0.035 / GB
10-node
Spark Cluster
$0.15 / hr
Petabyte-scale
Data Warehouse
$0.25 / hr
Amazon Glacier Amazon S3 Amazon RedshiftAmazon EMRAmazon Kinesis
NASDAQ migrated to Amazon Redshift, achieving cost
savings of 57% compared to legacy system
Volume
Avg 7B rows per day.
Peak of 20B rows
Performance
Queries running faster.
Loads finish before 11PM
Scale
Loading more data than
ever stored by legacy DWH
FINRA handles approximately 75 billion market events every
day to build a holistic picture of trading in the U.S.
Deter
misconduct by
enforcing the rules
Detect
and prevent wrongdoing
in the U.S. markets
Discipline
those who break
the rules
4. Easy, Powerful Tools for Migrating Data
AWS provides the broadest range of tools for easy, fast,
secure data movement to and from the AWS cloud.
AWS Direct
Connect
ISV
Connectors
Amazon
Kinesis
Firehose
Amazon S3
Transfer
Acceleration
AWS Storage
Gateway
Database
Migration
Service
AWS
Snowball
What is Snowball? Petabyte scale data transport
E-ink shipping
label
Ruggedized
case
“8.5G Impact”
All data encrypted
end-to-end
50TB & 80TB
10G network
Rain & dust
resistant
Tamper-resistant
case & electronics
Start your first migration in 10 minutes or less
Keep your apps running during the migration
Replicate within, to, or from Amazon EC2 or Amazon RDS
Move data to the same or different database engine
AWS
Database Migration
Service
5. Security & Compliance You Can Trust
• AWS manages 1800+ security controls so you don’t have to
• You benefit from an environment built for the most security sensitive
organizations
• You get to define the right security controls for your workload
sensitivity
• You always have full ownership and control of your data
Broadest Services to Secure Applications
NETWORKING
IDENTITY
ENCRYPTION
COMPLIANCE
Virtual
Private Cloud
Web Application
Firewall
IAM Active Directory
Integration
SAML Federation
KMS Cloud HSM Server-side
Encryption
Encryption
SDK
Service Catalog CloudTrail Config Config Rules Inspector
Key AWS Certifications and Assurance Programs
Design Patterns &
Architectures
Amazon Web Services
Time to Answer (Latency)
Throughput
collect store process /
analyze
consume /
visualize
data answers
Time to Answer (Latency)
Throughput
Cost
Simplify Big Data Processing
Architectural Principles
• Decoupled “data bus”
• Data → Store → Process → Answers
• Use the right tool for the job
• Data structure, latency, throughput, access patterns
• Use Lambda architecture ideas
• Immutable (append-only) log, batch/speed/serving layer
• Leverage AWS managed services
• No/low admin
• Big data ≠ big cost
Storage Compute
• Decoupled “data bus”
• Completely autonomous components
• Independently scale storage & compute resources on-demand
Collect → Store → Process | Analyze → Consume
Decoupling Storage and Compute Resources
Multi-Stage Decoupled “Data Bus”
Multiple stages
Storage decouples multiple processing stages
Store Process Store ProcessData
process
store
Amazon SQS apps
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Machine
Learning
Presto
Amazon
EMR
Amazon Elasticsearch
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
HotHotWarm
FastSlowFast
BatchMessageInteractiveStreamML
SearchSQLNoSQLCacheFileQueueStream
Amazon EC2
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
LoggingIoTApplicationsTransportMessaging
ETL
Multiple Stream Processing Applications Can Read
from Amazon Kinesis
Amazon
Kinesis
AWS Lambda
Amazon
DynamoDB
Amazon
Kinesis S3
Connector
process
store
Amazon S3
Which stream storage should I use?
Amazon
Kinesis
DynamoDB
Streams
Amazon SQS
Amazon SNS
Kafka
Managed Yes Yes Yes No
Ordering Yes Yes No Yes
Delivery at-least-once exactly-once at-least-once at-least-once
Lifetime 7 days 24 hours 14 days Configurable
Replication 3 AZ 3 AZ 3 AZ Configurable
Throughput No Limit No Limit No Limit ~ Nodes
Parallel Clients Yes Yes No (SQS) Yes
MapReduce Yes Yes No Yes
Record size 1MB 400KB 256KB Configurable
Cost Low Higher(table cost) Low-Medium Low (+admin)
Amazon EMR
Analysis Frameworks Can Read from Multiple
Data Stores
Amazon
Kinesis
AWS
Lambda
Amazon S3
Data
Amazon
DynamoDB
AnswerSpark
Streaming
Amazon
Kinesis S3
Connector
process
store
Spark
SQL
Amazon EMR
Real-time Analytics
Apache
Kafka
KCL
AWS Lambda
Spark
Streaming
Apache
Storm
Amazon
SNS
Amazon
ML
Notifications
Amazon
ElastiCache
(Redis)
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
Alert
App state
Real-time prediction
KPI
process
store
DynamoDB
Streams
Amazon
Kinesis
Data
stream
What Stream Processing Technology Should I Use?
Spark Streaming Apache Storm Amazon Kinesis
Client Library
AWS Lambda Amazon EMR (Hive,
Pig)
Scale /
Throughput
~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes
Batch or Real-
time
Real-time Real-time Real-time Real-time Batch
Manageability Yes (Amazon EMR) Do it yourself Amazon EC2 +
Auto Scaling
AWS managed Yes (Amazon EMR)
Fault Tolerance Single AZ Configurable Multi-AZ Multi-AZ Single AZ
Programming
languages
Java, Python, Scala Any language
via Thrift
Java, via
MultiLangDaemon (
.Net, Python, Ruby,
Node.js)
Node.js, Java Hive, Pig, Streaming
languages
Query Latency
High
(Lower is better)
Case Study: Clickstream Analysis
Hearst Corporation monitors trending content for over 250 digital properties
worldwide and processes more than 30TB of data per day, using an architecture
that includes Amazon Kinesis and Spark running on Amazon EMR.
Store → Process | Analyze → Answers
Interactive
&
Batch
Analytics
Amazon S3
Amazon EMR
Hive
Pig
Spark
Amazon
ML
process
store
Consume
Amazon
Redshift
Amazon EMR
Presto
Spark
Batch
Interactive
Batch prediction
Real-time prediction
Amazon
Kinesis
Firehose
What Data Processing Technology Should I Use?
Amazon
Redshift
Impala Presto Spark Hive
Query
Latency
Low Low Low Low Medium (Tez) –
High (MapReduce)
Durability High High High High High
Data Volume 2 PB
Max
~Nodes ~Nodes ~Nodes ~Nodes
Managed Yes Yes (EMR) Yes (EMR) Yes (EMR) Yes (EMR)
Storage Native HDFS / S3A* HDFS / S3 HDFS / S3 HDFS / S3
SQL
Compatibility
High Medium High Low (SparkSQL) Medium (HQL)
Query Latency
High
(Low is better)
Medium
Batch layer
Amazon
Kinesis
Data
stream
process
store
Amazon
Kinesis S3
Connector
A
p
p
l
i
c
a
t
i
o
n
s
Amazon
Redshift
Amazon EMR
Presto
Hive
Pig
Spark
answer
Amazon S3
Speed layer
answer
Serving
layer
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
answer
Amazon
ML
KCL
AWS Lambda
Storm
Lambda Architecture
Spark Streaming
on Amazon EMR
Cache
SQL
Request Rate
High Low
Cost/GB
High Low
Latency
Low High
Data Volume
Low High
GlacierStructure
NoSQL
Hot Data Warm Data Cold Data
Low
High
Search
What Data Store Should I Use?
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
Aurora
Amazon
Elasticsearch
Amazon
EMR (HDFS)
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec sec,min,hrs ms,sec,min
(~ size)
hrs
Data volume GB GB–TBs
(no limit)
GB–TB
(64 TB
Max)
GB–TB GB–PB
(~nodes)
MB–PB
(no limit)
GB–PB
(no limit)
Item size B-KB KB
(400 KB
max)
KB
(64 KB)
KB
(1 MB max)
MB-GB KB-GB
(5 TB max)
GB
(40 TB max)
Request rate High -
Very High
Very High
(no limit)
High High Low – Very
High
Low –
Very High
(no limit)
Very Low
Storage cost
GB/month
$$ ¢¢ ¢¢ ¢¢ ¢ ¢ ¢/10
Durability Low -
Moderate
Very High Very High High High Very High Very High
Hot Data Warm Data Cold Data
Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project that will greatly increase
my team’s use of Amazon S3. Hoping you could answer
some questions. The current iteration of the design calls for
many small files, perhaps up to a billion during peak. The
total size would be on the order of 1.5 TB per month…”
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per month
300 2048 1483 777,600,000
Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
https://aws.amazon.com/calculator
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
300 2,048 1,483 777,600,000
Amazon S3 or
Amazon
DynamoDB?
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
Scenario 1300 2,048 1,483 777,600,000
Scenario 2300 32,768 23,730 777,600,000
Amazon S3
Amazon DynamoDB
use
use
Case Study: On-demand Big Data Analytics
Redfin uses Amazon EMR with spot instances – dynamically spinning up & down
Apache Hadoop clusters – to perform large data transformations and deliver data
to internal and external customers.
Store → Process → Store → Analyze → Answers
Architectural Principles
Decoupled “data bus”
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Use Lambda architecture ideas
• Immutable (append-only) log, batch/speed/serving layer
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Big data ≠ big cost
Building a Big Data
Application
Building a Big Data Application
web clients
mobile clients
DBMS
corporate data center
Getting Started
Building a Big Data Application
web clients
mobile clients
DBMS
Amazon Redshift
Amazon
QuickSight
AWS cloudcorporate data center
Adding a data warehouse
Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Amazon
QuickSight
AWS cloud
Bringing in Log Data
corporate data center
Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Orc/Parquet
(Query optimized)
Amazon
QuickSight
AWS cloud
Extending your DW to S3
corporate data center
Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Orc/Parquet
(Query optimized)
Amazon
QuickSight
KinesisStreams
AWS cloud
Adding a real-time layer
corporate data center
Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Orc/Parquet
(Query optimized)
Amazon
QuickSight
KinesisStreams
AWS cloud
Adding predictive analytics
corporate data center
Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Orc/Parquet
(Query optimized)
Amazon
QuickSight
KinesisStreams
AWS cloud
Adding encryption at rest with AWS KMS
corporate data center
AWSKMS
Building a Big Data Application
web clients
mobile clients
DBMS
Raw data
Amazon Redshift
Staging Data
Orc/Parquet
(Query optimized)
Amazon
QuickSight
KinesisStreams
AWS cloud
AWSKMS
VPC subnet
SSL/TLS
SSL/TLS
Protecting Data in Transit & Adding Network Isolation
corporate data center
The Evolution of Big Data Processing
Descriptive Real-time Predictive
Dashboards;
Traditional query &
reporting
Clickstream analysis;
Ad bidding;
streaming data
Inventory forecasting;
Fraud detection;
Recommendation
engines
What happened
before
What’s happening
now
Probability of “x”
happening
Amazon Kinesis
Amazon EC2
AWS Lambda
Amazon DynamoDB
Amazon EMR
Amazon Redshift
Amazon QuickSight
Amazon EMR
Amazon Machine Learning
Amazon EMR
Learning Big Data on AWS
aws.amazon.com/big-data
Big Data Quest
Learn at your own pace and practice working with AWS
services for big data on qwikLABS. (3 Hours | Online)
Big Data on AWS
How to use AWS services to process data with Hadoop &
create big data environments (3 Days | Classroom )
Big Data Technology Fundamentals FREE!
Overview of AWS big data solutions for architects or data
scientists new to big data. (3 Hours | Online)
AWS Courses
Self-paced Online Labs
2:00 PM Best Practices on Real-time Streaming Analytics
3:00 PM Break
3:15 PM Getting started with Amazon Machine Learning
4:15 PM Building your first big data application on AWS
Next…
© 2016 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Real-time Streaming Data on AWS
Shawn Gandhi
Head of Solutions Architecture
AWS Canada
@shawnagram
{
"payerId": "Joe",
"productCode": "AmazonS3",
"clientProductCode": "AmazonS3",
"usageType": "Bandwidth",
"operation": "PUT",
"value": "22490",
"timestamp": "1216674828"
}
Metering Record
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700]
"GET /apache_pb.gif HTTP/1.0" 200 2326
Common Log Entry
<165>1 2003-10-11T22:14:15.003Z
mymachine.example.com evntslog - ID47
[exampleSDID@32473 iut="3"
eventSource="Application"
eventID="1011"][examplePriority@32473
class="high"]
Syslog Entry
“SeattlePublicWater/Kinesis/123/Realtime”
– 412309129140
MQTT Record <R,AMZN,T,G,R1>
NASDAQ OMX Record
Generated data
Available for analysis
Data volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Abraham Wald (1902-1950)
Big data: best served fresh
Big data
Hourly server logs:
were your systems misbehaving one hour
ago
Weekly / Monthly Bill:
what you spent this billing cycle
Daily customer-preferences report from your
website’s click stream:
what deal or ad to try next time
Daily fraud reports:
was there fraud yesterday
Real-time big data
Amazon CloudWatch metrics:
what went wrong now
Real-time spending alerts/caps:
prevent overspending now
Real-time analysis:
what to offer the current customer now
Real-time detection:
block fraudulent use now
Streaming Data Scenarios Across Verticals
Scenarios/
Verticals
Accelerated Ingest-
Transform-Load
Continuous Metrics
Generation
Responsive Data Analysis
Digital Ad
Tech/Marketing
Publisher, bidder data
aggregation
Advertising metrics like
coverage, yield, and
conversion
User engagement with
ads, optimized bid/buy
engines
IoT Sensor, device telemetry
data ingestion
Operational metrics and
dashboards
Device operational
intelligence and alerts
Gaming Online data aggregation,
e.g., top 10 players
Massively multiplayer
online game (MMOG) live
dashboard
Leader board generation,
player-skill match
Consumer
Online
Clickstream analytics Metrics like impressions
and page views
Recommendation engines,
proactive care
Customer Use Cases
Sonos runs near real-time streaming
analytics on device data logs from
their connected hi-fi audio equipment.
Analyzing 30TB+ clickstream
data enabling real-time insights for
Publishers.
Glu Mobile collects billions of
gaming events data points from
millions of user devices in
real-time every single day.
Nordstorm
Nordstorm recommendation team built
online stylist using Amazon Kinesis
Streams and AWS Lambda.
Amazon Kinesis Deep Dive
Two Main Processing Patterns
Stream processing (real time)
• Real-time response to events in data streams
Examples:
• Proactively detect hardware errors in device logs
• Notify when inventory drops below a threshold
• Fraud detection
Micro-batching (near real time)
• Near real-time operations on small batches of events in data streams
Examples:
• Aggregate and archive events
• Monitor performance SLAs
Amazon Kinesis
Streams
• For Technical Developers
• Build your own custom
applications that process
or analyze streaming
data
Amazon Kinesis
Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into S3, Amazon Redshift
and Amazon Elastisearch
Amazon Kinesis
Analytics
• For all developers, data
scientists
• Easily analyze data
streams using standard
SQL queries
• Coming Soon NEW!
Amazon Kinesis: Streaming Data Made Easy
Services make it easy to capture, deliver and process streams on AWS
Data
Sources
App.4
[Machine
Learning]
AWSEndpoint
App.1
[Aggregate &
De-Duplicate]
Data
Sources
Data
Sources
Data
Sources
App.2
[Metric
Extraction]
Amazon S3
Amazon Redshift
App.3
[Sliding
Window
Analysis]
Availability
Zone
Shard 1
Shard 2
Shard N
Availability
Zone
Availability
Zone
Amazon Kinesis Streams
Managed service for real-time streaming
AWS Lambda
Amazon EMR
Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3, Amazon
Redshift and Amazon Elasticsearch
Capture and submit
streaming data to Firehose
Analyze streaming data using your
favorite BI tools
Firehose loads streaming data
continuously into S3, Amazon
Redshift and Amazon Elasticsearch
Amazon Kinesis Firehose vs. Amazon Kinesis
Streams
Amazon Kinesis Streams is for use cases that require custom
processing, per incoming record, with sub-1 second processing
latency, and a choice of stream processing frameworks.
Amazon Kinesis Firehose is for use cases that require zero
administration, ability to use existing analytics tools based on
Amazon S3, Amazon Redshift and Amazon Elasticsearch, and a
data latency of 60 seconds or higher.
Amazon Kinesis Analytics
• Interact with streaming data in real-time using SQL
• Build fully managed and elastic stream processing applications that
process data for real-time visualizations and alarms
Connect to Kinesis
streams,
Firehose delivery streams
Run standard SQL queries
against data streams
Kinesis Analytics can send processed
data to analytics tools so you can
create alerts and respond in real-time
Streaming Data Ingestion
Problem: Count top 10
hashtags on Twitter and
display
twitter-trends.com
AWS Elastic
Beanstalk
twitter-trends.com
twitter-trends.com website
twitter-trends.com
The solution: streaming Map/Reduce
My top-10
My top-10
My top-10
Global top-10
twitter-trends.com
Core concepts
My top-10
My top-10
My top-10
Global top-10
Data record
Stream
Partition key
Shard
Worker
Shard: 14 17 18 21 23
Data record
Sequence number
twitter-trends.com
How this relates to Amazon Kinesis
Amazon
Kinesis
Amazon
Kinesis application
Core concepts recapped
Data record - Tweet
Stream - All tweets (the Twitter Firehose)
Partition key - Twitter topic (every tweet record belongs to exactly one of these)
Shard - All the data records belonging to a set of Twitter topics that will get grouped
together
Sequence number - Each data record gets one assigned when first ingested
Worker - Processes the records of a shard in sequence number order
twitter-trends.com
Using the Amazon Kinesis API directly
A
m
a
z
o
n
K
i
n
e
s
i
s
iterator = getShardIterator(shardId, LATEST);
while (true) {
[records, iterator] =
getNextRecords(iterator, maxRecsToReturn);
process(records);
}
process(records): {
for (record in records) {
updateLocalTop10(record);
}
if (timeToDoOutput()) {
writeLocalTop10ToDDB();
}
}
while (true) {
localTop10Lists =
scanDDBTable();
updateGlobalTop10List(
localTop10Lists);
sleep(10);
}
A
m
a
z
o
n
K
i
n
i
s
i
s
twitter-trends.com
Challenges with using the Amazon Kinesis API
directly
Kinesis
application
Manual creation of workers and
assignment to shards
How many workers
per EC2 instance?
How many EC2 instances?
A
m
a
z
o
n
K
i
n
e
s
i
s
twitter-trends.com
Using the Amazon Kinesis Client Library
Kinesis
application
Shard mgmt
table
A
m
a
z
o
n
K
i
n
e
s
i
s
twitter-trends.com
Elastic scale and load balancing
Shard mgmt
table
Auto scaling
Group
Amazon
CloudWatch
A
m
a
z
o
n
K
i
n
e
s
i
s
twitter-trends.com
Shard management
Shard mgmt
table
Auto scaling
Group
Amazon
CloudWatch
Amazon
SNS
AWS
Console
• Streams are made of shards
• Each shard ingests up to 1MB/sec, and
1000 records/sec
• Each shard emits up to 2 MB/sec
• All data is stored for 24 hours by
default; storage can be extended for
up to 7 days
• Scale Kinesis streams using scaling util
• Replay data inside of 24-hour window
Sample Code for Scaling Shards
java -cp
KinesisScalingUtils.jar-complete.jar
-Dstream-name=MyStream
-Dscaling-action=scaleUp
-Dcount=10
-Dregion=eu-west-1 ScalingClient
Options:
• stream-name - The name of the stream to be scaled
• scaling-action - The action to be taken to scale. Must be one of "scaleUp”, "scaleDown"
or “resize”
• count - Number of shards by which to absolutely scale up or down, or resize
See https://github.com/awslabs/amazon-kinesis-scaling-utils
A
m
a
z
o
n
K
i
n
e
s
i
s
twitter-trends.com
The challenges of fault tolerance
X
X
X Shard mgmt
table
A
m
a
z
o
n
K
i
n
e
s
i
s
twitter-trends.com
Fault tolerance support in KCL
Shard mgmt
table
X
Availability Zone
1
Availability Zone
3
Checkpoint, replay design pattern
Amazon Kinesis
1417182123
Shard-i
235810
Shard
ID
Lock Seq
num
Shard-i
Host A
Host B
Shard
ID
Local top-10
shard-i
0
10
18
X2
3
5
8
10
14
17
18
21
23
0
310
Host AHost B
{#Obama: 10235, #Seahawks: 9835, …}{#Obama: 10235, #Congress: 9910, …}
1023
14
17
18
21
23
Amazon Kinesis Producer Library
• Writes to one or more Amazon Kinesis streams with automatic,
configurable retry mechanism
• Collects records and uses PutRecords to write multiple records to
multiple shards per request
• Aggregates user records to increase payload size and improve
throughput
• Integrates seamlessly with KCL to de-aggregate batched records
• Submits Amazon CloudWatch metrics on your behalf to provide
visibility into producer performance
Putting Data into Amazon Kinesis
Amazon Kinesis Agent – (supports pre-processing)
• http://docs.aws.amazon.com/firehose/latest/dev/writing-with-agents.html
Pre-batch before Puts for better efficiency
• Consider Flume, Fluentd as collectors/agents
• See https://github.com/awslabs/aws-fluent-plugin-kinesis
Make a tweak to your existing logging
• log4j appender option
• See https://github.com/awslabs/kinesis-log4j-appender
Managed Buffer
• Care about a reliable, scalable
way to capture data
• Defer all other aggregation to
consumer
• Generate Random Partition
Keys
• Ensure a high cardinality for
Partition Keys with respect to
shards, to spray evenly across
available shards
Putting Data into Amazon Kinesis Streams
Streaming Map-Reduce
• Streaming Map-Reduce: leverage
partition keys as a natural way to
aggregate data
• For e.g. Partition Keys per billing
customer, per DeviceId, per stock
symbol
• Design partition keys to scale
• Be aware of “hot partition keys or
shards ”
Determine your partition key strategy
Amazon EMR
Amazon
Kinesis
StreamsStreaming Input
Tumbling/Fixed
Window
Aggregation
Periodic Output
Amazon Redshift
COPY from
Amazon EMR
Common Integration Pattern with Amazon EMR
Tumbling Window Reporting
Apache Spark and Amazon Kinesis Streams
Apache Spark is an in-memory analytics cluster using
RDD for fast processing
Spark Streaming can read directly from an Amazon
Kinesis stream
Amazon software license linking – Add ASL dependency
to SBT/MAVEN project, artifactId = spark-
streaming-kinesis-asl_2.10
KinesisUtils.createStream(‘twitter-stream’)
.filter(_.getText.contains(”Open-Source"))
.countByWindow(Seconds(5))
Example: Counting tweets on a sliding window
Using Spark Streaming with Amazon Kinesis
Streams
1. Use Spark 1.6+ with EMRFS consistent view option – if you
use Amazon S3 as storage for Spark checkpoint
2. Amazon DynamoDB table name – make sure there is only one
instance of the application running with Spark Streaming
3. Enable Spark-based checkpoints
4. Number of Amazon Kinesis receivers is multiple of executors so
they are load-balanced
5. Total processing time is less than the batch interval
6. Number of executors is the same as number of cores per
executor
7. Spark Streaming uses default of 1 sec with KCL
Amazon Kinesis Streams with AWS Lambda
Conclusion
• Amazon Kinesis offers: managed service to build applications, streaming
data ingestion, and continuous processing
• Ingest aggregate data using Amazon Producer Library
• Process data using Amazon Connector Library and open source connectors
• Determine your partition key strategy
• Try out Amazon Kinesis at http://aws.amazon.com/kinesis/
• Technical documentation
• Amazon Kinesis Agent
• Amazon Kinesis Streams and Spark Streaming
• Amazon Kinesis Producer Library Best Practice
• Amazon Kinesis Firehose and AWS Lambda
• Building Near Real-Time Discovery Platform with Amazon Kinesis
• Public case studies
• Glu mobile – Real-Time Analytics
• Hearst Publishing – Clickstream Analytics
• How Sonos Leverages Amazon Kinesis
• Nordstorm Online Stylist
Reference
3:00 PM Break
3:15 PM Getting started with Amazon Machine Learning
4:15 PM Building your first big data application on AWS
Next…
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting Started with
Amazon Machine Learning
Build smart applications by finding patterns
and predictions in your data
@alexbcbr
Alex Coqueiro
Sr. Manager Solutions Architect, Public Sector Team
I want to see an example !!!
http://calgary.abcgov.com.br
Machine learning and smart applications
Machine learning is the technology that
automatically finds patterns in your data and
uses them to make predictions for new data
points as they become available
Machine learning and smart applications
Machine learning is the technology that
automatically finds patterns in your data and
uses them to make predictions for new data
points as they become available
Your data + machine learning = smart applications
Smart applications by example
Based on what you
know about the user:
Will they use your
product?
Smart applications by example
Based on what you
know about the user:
Will they use your
product?
Based on what you
know about an order:
Is this order
fraudulent?
Smart applications by example
Based on what you
know about the user:
Will they use your
product?
Based on what you
know about an order:
Is this order
fraudulent?
Based on what you know
about a news article:
What other articles are
interesting?
Challenges to Building Smart Applications Today
Expertise Technology Operationalization
Limited supply of
data scientists
Many choices, few
mainstays
Complex and error-
prone data workflows
Expensive to hire
or outsource
Difficult to use and
scale
Custom platforms and
APIs
Many moving pieces
lead to custom
solutions every time
Reinventing the model
lifecycle management
wheel
Amazon Machine Learning
Amazon Machine Learning
Easy to use, managed machine learning
service built for developers
Robust, powerful machine learning
technology based on Amazon’s internal
systems
Create models using your data already
stored in the AWS cloud
Deploy models to production in seconds
Three Supported Types of Predictions
Binary Classification
Predict the answer to a Yes/No question
Multi-class classification
Predict the correct category from a list
Regression
Predict the value of a numeric variable
How Do I Get started Using
Amazon Machine Learning?
Build
model
Evaluate and
optimize
Retrieve
predictions
1 2 3
Building smart applications with Amazon ML
Train
model
Evaluate and
optimize
Retrieve
predictions
1 2 3
Building smart applications with Amazon ML
- Create a Datasource object pointing to your data
- Explore and understand your data
- Transform data and train your model
Create a Datasource object
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> ds = ml.create_data_source_from_s3(
data_source_id = ’my_datasource',
data_spec= {
'DataLocationS3':'s3://bucket/input/',
'DataSchemaLocationS3':'s3://bucket/input/.schema'},
compute_statistics = True)
Explore and understand your data
Train your model
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> model = ml.create_ml_model(
ml_model_id=’my_model',
ml_model_type='REGRESSION',
training_data_source_id='my_datasource')
Train
model
Evaluate and
optimize
Retrieve
predictions
1 2 3
Building smart applications with Amazon ML
- Understand model quality
- Adjust model interpretation
Explore model quality
Fine-tune model interpretation
Fine-tune model interpretation
Train
model
Evaluate and
optimize
Retrieve
predictions
1 2 3
Building smart applications with Amazon ML
- Batch predictions
- Real-time predictions
Batch predictions
Asynchronous, large-volume prediction generation
Request through service console or API
Best for applications that deal with batches of data records
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> model = ml.create_batch_prediction(
batch_prediction_id = 'my_batch_prediction’
batch_prediction_data_source_id = ’my_datasource’
ml_model_id = ’my_model',
output_uri = 's3://examplebucket/output/’)
Real-time predictions
Synchronous, low-latency, high-throughput prediction generation
Request through service API or server or mobile SDKs
Best for interaction applications that deal with individual data records
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> ml.predict(
ml_model_id=’my_model',
predict_endpoint=’example_endpoint’,
record={’key1':’value1’, ’key2':’value2’})
{
'Prediction': {
'predictedValue': 13.284348,
'details': {
'Algorithm': 'SGD',
'PredictiveModelType': 'REGRESSION’
}
}
}
Thank You!
http://aws.amazon.com/machine-learning
https://github.com/awslabs/machine-learning-samples
https://blogs.aws.amazon.com/bigdata/post/Tx1Z7AR9QTXIWA1/Readmissi
on-Prediction-Through-Patient-Risk-Stratification-Using-Amazon-Machine
@alexbcbr
Alex Coqueiro
Sr. Manager Solutions Architect, Public Sector Team
Learning Model
Machine
Learning
Supervised Unsupervised
Predictive Model
Classification,
Regression
Descritive Models
Clusterization,
Dimension Reduction
Agenda
• Why did we build Amazon Machine Learning?
• What is Amazon Machine Learning?
• How do I get started using Amazon Machine Learning?
• Q&A
Pay-as-you-go and inexpensive
Data analysis, model training, and
evaluation: $0.42/instance hour
Batch predictions: $0.10/1000
Real-time predictions: $0.10/1000
+ hourly capacity reservation charge
Why Did We Build Amazon Machine Learning?
Three types of data-driven development
Retrospective
analysis and
reporting
Amazon Redshift
Amazon RDS
Amazon S3
Amazon EMR
Three types of data-driven development
Retrospective
analysis and
reporting
Here-and-now
real-time processing
and dashboards
Amazon Kinesis
Amazon EC2
AWS Lambda
Amazon Redshift,
Amazon RDS
Amazon S3
Amazon EMR
Three types of data-driven development
Retrospective
analysis and
reporting
Here-and-now
real-time processing
and dashboards
Predictions
to enable smart
applications
Amazon Kinesis
Amazon EC2
AWS Lambda
Amazon Redshift,
Amazon RDS
Amazon S3
Amazon EMR
Architecture Patterns for Smart
Applications
Batch predictions with Amazon EMR
Query for predictions with
Amazon ML batch API
Process data with
Amazon EMR
Raw data in
Amazon S3
Aggregated data
in Amazon S3
Predictions
in Amazon S3 Your application
Batch predictions with Amazon Redshift
Structured data
In Amazon Redshift
Load predictions into
Amazon Redshift
-or-
Read prediction results
directly from Amazon S3
Predictions
in Amazon S3
Query for predictions with
Amazon ML batch API
Your application
Real-time predictions for interactive applications
Your application
Query for predictions with
Amazon ML real-time API
Adding predictions to an existing data flow
Your application
Amazon
DynamoDB
+
Trigger event with AWS Lambda
+
Query for predictions with
Amazon ML real-time API
Get Started Quickly
Create, access, and manage all
Amazon ML entities through the AWS
Management Console
Easily learn to build a model with the
tutorial dataset provided
Add prediction capabilities to your iOS
and Android applications with AWS
Mobile SDK
Use Amazon ML APIs, CLIs, or SDKs
4:15 PM Building your first big data application on AWS
Next…
Building Your First Big Data
Application on AWS
Dario Rivera
Solutions Architect
Your First Big Data Application on AWS
PROCESS
STORE
ANALYZE & VISUALIZE
COLLECT
Your First Big Data Application on AWS
PROCESS:
Amazon EMR with Spark & Hive
STORE
ANALYZE & VISUALIZE:
Amazon Redshift and Amazon QuickSight
COLLECT:
Amazon Kinesis Firehose
http://aws.amazon.com/big-data/use-cases/
A Modern Take on the Classic Data Warehouse
Setting up the environment
Data Storage with Amazon S3
Download all the CLI steps: http://bit.ly/aws-big-data-steps
Create an Amazon S3 bucket to store the data collected
with Amazon Kinesis Firehose
aws s3 mb s3://YOUR-S3-BUCKET-NAME
Access Control with IAM
Create an IAM role to allow Firehose to write to the S3 bucket
firehose-policy.json:
{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Principal": {"Service": "firehose.amazonaws.com"},
"Action": "sts:AssumeRole"
}
}
Access Control with IAM
Create an IAM role to allow Firehose to write to the S3 bucket
s3-rw-policy.json:
{ "Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::YOUR-S3-BUCKET-NAME",
"arn:aws:s3:::YOUR-S3-BUCKET-NAME/*"
]
} }
Access Control with IAM
Create an IAM role to allow Firehose to write to the S3 bucket
aws iam create-role --role-name firehose-demo 
--assume-role-policy-document file://firehose-policy.json
Copy the value in “Arn” in the output,
e.g., arn:aws:iam::123456789:role/firehose-demo
aws iam put-role-policy --role-name firehose-demo 
--policy-name firehose-s3-rw 
--policy-document file://s3-rw-policy.json
Data Collection with Amazon Kinesis Firehose
Create a Firehose stream for incoming log data
aws firehose create-delivery-stream 
--delivery-stream-name demo-firehose-stream 
--s3-destination-configuration 
RoleARN=YOUR-FIREHOSE-ARN,
BucketARN="arn:aws:s3:::YOUR-S3-BUCKET-NAME",
Prefix=firehose/,
BufferingHints={IntervalInSeconds=60},
CompressionFormat=GZIP
Data Processing with Amazon EMR
Launch an Amazon EMR cluster with Spark and Hive
aws emr create-cluster 
--name "demo" 
--release-label emr-4.5.0 
--instance-type m3.xlarge 
--instance-count 2 
--ec2-attributes KeyName=YOUR-AWS-SSH-KEY 
--use-default-roles 
--applications Name=Hive Name=Spark Name=Zeppelin-Sandbox
Record your ClusterId from the output.
Data Analysis with Amazon Redshift
Create a single-node Amazon Redshift data warehouse:
aws redshift create-cluster 
--cluster-identifier demo 
--db-name demo 
--node-type dc1.large 
--cluster-type single-node 
--master-username master 
--master-user-password YOUR-REDSHIFT-PASSWORD 
--publicly-accessible 
--port 8192
Collect
Weblogs – Common Log Format (CLF)
75.35.230.210 - - [20/Jul/2009:22:22:42 -0700]
"GET /images/pigtrihawk.jpg HTTP/1.1" 200 29236
"http://www.swivel.com/graphs/show/1163466"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.11)
Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)"
Writing into Amazon Kinesis Firehose
Download the demo weblog: http://bit.ly/aws-big-data
Open Python and run the following code to import the log into the stream:
import boto3
iam = boto3.client('iam')
firehose = boto3.client('firehose')
with open('weblog', 'r') as f:
for line in f:
firehose.put_record(
DeliveryStreamName='demo-firehose-stream',
Record={'Data': line}
)
print 'Record added'
Process
Apache Spark
• Fast, general purpose engine
for large-scale data processing
• Write applications quickly in
Java, Scala, or Python
• Combine SQL, streaming, and
complex analytics
Spark SQL
Spark's module for working with structured data using SQL
Run unmodified Hive queries on existing data
Apache Zeppelin
• Web-based notebook for interactive analytics
• Multiple language backend
• Apache Spark integration
• Data visualization
• Collaboration
https://zeppelin.incubator.apache.org/
View the Output Files in Amazon S3
After about 1 minute, you should see files in your S3
bucket:
aws s3 ls s3://YOUR-S3-BUCKET-NAME/firehose/ --recursive
Connect to Your EMR Cluster and Zeppelin
aws emr describe-cluster --cluster-id YOUR-EMR-CLUSTER-ID
Copy the MasterPublicDnsName. Use port forwarding so you can
access Zeppelin at http://localhost:18890 on your local machine.
ssh -i PATH-TO-YOUR-SSH-KEY -L 18890:localhost:8890 
hadoop@YOUR-EMR-DNS-NAME
Open Zeppelin with your local web browser and create a new “Note”:
http://localhost:18890
Exploring the Data in Amazon S3 using Spark
Download the Zeppelin notebook: http://bit.ly/aws-big-data-zeppelin
// Load all the files from S3 into a RDD
val accessLogLines = sc.textFile("s3://YOUR-S3-BUCKET-NAME/firehose/*/*/*/*")
// Count the lines
accessLogLines.count
// Print one line as a string
accessLogLines.first
// delimited by space so split them into fields
var accessLogFields = accessLogLines.map(_.split(" ").map(_.trim))
// Print the fields of a line
accessLogFields.first
Combine Fields: “A, B, C”  “A B C”
var accessLogColumns = accessLogFields
.map( arrayOfFields => { var temp1 =""; for (field <- arrayOfFields) yield {
var temp2 = ""
if (temp1.replaceAll("[",""").startsWith(""") && !temp1.endsWith("""))
temp1 = temp1 + " " + field.replaceAll("[|]",""")
else temp1 = field.replaceAll("[|]",""")
temp2 = temp1
if (temp1.endsWith(""")) temp1 = ""
temp2
}})
.map( fields => fields.filter(field => (field.startsWith(""") &&
field.endsWith(""")) || !field.startsWith(""") ))
.map(fields => fields.map(_.replaceAll(""","")))
Create a Data Frame and Transform the Data
import java.sql.Timestamp
import java.net.URL
case class accessLogs(
ipAddress: String,
requestTime: Timestamp,
requestMethod: String,
requestPath: String,
requestProtocol: String,
responseCode: String,
responseSize: String,
referrerHost: String,
userAgent: String
)
Create a Data Frame and Transform the Data
val accessLogsDF = accessLogColumns.map(line => {
var ipAddress = line(0)
var requestTime = new Timestamp(new
java.text.SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss Z").parse(line(3)).getTime())
var requestString = line(4).split(" ").map(_.trim())
var requestMethod = if (line(4).toString() != "-") requestString(0) else ""
var requestPath = if (line(4).toString() != "-") requestString(1) else ""
var requestProtocol = if (line(4).toString() != "-") requestString(2) else ""
var responseCode = line(5).replaceAll("-","")
var responseSize = line(6).replaceAll("-","")
var referrerHost = line(7)
var userAgent = line(8)
accessLogs(ipAddress, requestTime, requestMethod, requestPath,
requestProtocol,responseCode, responseSize, referrerHost, userAgent)
}).toDF()
Create an External Table Backed by Amazon S3
%sql
CREATE EXTERNAL TABLE access_logs
(
ip_address String,
request_time Timestamp,
request_method String,
request_path String,
request_protocol String,
response_code String,
response_size String,
referrer_host String,
user_agent String
)
PARTITIONED BY (year STRING,month STRING, day STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
STORED AS TEXTFILE
LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed'
Configure Hive Partitioning and Compression
// set up Hive's "dynamic partitioning”
%sql
SET hive.exec.dynamic.partition=true
// compress output files on Amazon S3 using Gzip
%sql
SET hive.exec.compress.output=true
%sql
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
%sql
SET io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec
%sql
SET hive.exec.dynamic.partition.mode=nonstrict;
Write Output to Amazon S3
import org.apache.spark.sql.SaveMode
accessLogsDF
.withColumn("year", year(accessLogsDF("requestTime")))
.withColumn("month", month(accessLogsDF("requestTime")))
.withColumn("day", dayofmonth(accessLogsDF("requestTime")))
.write
.partitionBy("year","month","day")
.mode(SaveMode.Overwrite)
.insertInto("access_logs")
Query the Data Using Spark SQL
// Check the count of records
%sql
select count(*) from access_log_processed
// Fetch the first 10 records
%sql
select * from access_log_processed limit 10
View the Output Files in Amazon S3
Leave Zeppelin and go back to the console…
List the partition prefixes and output files:
aws s3 ls s3://YOUR-S3-BUCKET-NAME/access-log-processed/ 
--recursive
Analyze
Connect to Amazon Redshift
Using the PostgreSQL CLI
psql -h YOUR-REDSHIFT-ENDPOINT 
-p 8192 -U master demo
Or use any JDBC or ODBC SQL client with the PostgreSQL 8.x drivers
or native Amazon Redshift support
• Aginity Workbench for Amazon Redshift
• SQL Workbench/J
Create an Amazon Redshift Table to Hold Your Data
CREATE TABLE accesslogs
(
host_address varchar(512),
request_time timestamp,
request_method varchar(5),
request_path varchar(1024),
request_protocol varchar(10),
response_code Int,
response_size Int,
referrer_host varchar(1024),
user_agent varchar(512)
)
DISTKEY(host_address)
SORTKEY(request_time);
Loading Data into Amazon Redshift
“COPY” command loads files in parallel from
Amazon S3:
COPY accesslogs
FROM 's3://YOUR-S3-BUCKET-NAME/access-log-processed'
CREDENTIALS
'aws_iam_role=arn:aws:iam::YOUR-AWS-ACCOUNT-ID:role/ROLE-NAME'
DELIMITER 't'
MAXERROR 0
GZIP;
Amazon Redshift Test Queries
-- find distribution of response codes over days
SELECT TRUNC(request_time), response_code, COUNT(1) FROM
accesslogs GROUP BY 1,2 ORDER BY 1,3 DESC;
-- find the 404 status codes
SELECT COUNT(1) FROM accessLogs WHERE response_code = 404;
-- show all requests for status as PAGE NOT FOUND
SELECT TOP 1 request_path,COUNT(1) FROM accesslogs WHERE
response_code = 404 GROUP BY 1 ORDER BY 2 DESC;
Visualize the Results
DEMO
Amazon QuickSight
Automating the Big Data Application
Amazon
Kinesis
Firehose
Amazon
EMR
Amazon
S3
Amazon
Redshift
Amazon
QuickSight
Amazon
S3
Event
Notification
Spark job
List of objects from Lambda
Write to Amazon Redshift using spark-redshift
Learn from big data experts
blogs.aws.amazon.com/bigdata
Thank you to our Sponsor

More Related Content

What's hot

AWS Summit 2013 | India - Opening Keynote, Dr. Werner Vogels
AWS Summit 2013 | India - Opening Keynote, Dr. Werner VogelsAWS Summit 2013 | India - Opening Keynote, Dr. Werner Vogels
AWS Summit 2013 | India - Opening Keynote, Dr. Werner VogelsAmazon Web Services
 
Microsof azure class 1- intro
Microsof azure   class 1- introMicrosof azure   class 1- intro
Microsof azure class 1- introMHMuhammadAli1
 
State of the Union: Database & Analytics
State of the Union: Database & AnalyticsState of the Union: Database & Analytics
State of the Union: Database & AnalyticsAmazon Web Services
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Amazon Web Services
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek SinhaAWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek SinhaAmazon Web Services
 
Introduction to Cloud Computing with Amazon Web Services
Introduction to Cloud Computing with Amazon Web ServicesIntroduction to Cloud Computing with Amazon Web Services
Introduction to Cloud Computing with Amazon Web ServicesAmazon Web Services
 
CMS Hybrid Cloud Services Enablement
CMS Hybrid Cloud Services EnablementCMS Hybrid Cloud Services Enablement
CMS Hybrid Cloud Services EnablementSanjay Dhar
 
Comparison of cloud service providers
Comparison of cloud service providersComparison of cloud service providers
Comparison of cloud service providersVisishtaYJ
 
Driving Business Outcomes with a Modern Data Architecture - Level 100
Driving Business Outcomes with a Modern Data Architecture - Level 100Driving Business Outcomes with a Modern Data Architecture - Level 100
Driving Business Outcomes with a Modern Data Architecture - Level 100Amazon Web Services
 
Track 2 Session 5_ 利用 SageMaker 深度學習容器化在廣告推播之應用
Track 2 Session 5_ 利用 SageMaker 深度學習容器化在廣告推播之應用Track 2 Session 5_ 利用 SageMaker 深度學習容器化在廣告推播之應用
Track 2 Session 5_ 利用 SageMaker 深度學習容器化在廣告推播之應用Amazon Web Services
 
AWS Enterprise Summit - 엔터프라이즈에서의 AWS 클라우드 활용 - Markku Lepisto
AWS Enterprise Summit - 엔터프라이즈에서의 AWS 클라우드 활용 - Markku LepistoAWS Enterprise Summit - 엔터프라이즈에서의 AWS 클라우드 활용 - Markku Lepisto
AWS Enterprise Summit - 엔터프라이즈에서의 AWS 클라우드 활용 - Markku LepistoAmazon Web Services Korea
 
Andy Jassy Illuminates Amazon Web Services
Andy Jassy Illuminates Amazon Web ServicesAndy Jassy Illuminates Amazon Web Services
Andy Jassy Illuminates Amazon Web ServicesMichael Skok
 
Partner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_dataPartner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_dataTreasure Data, Inc.
 
Introduction to Cloud Computing - (Eng Session)
Introduction to Cloud Computing - (Eng Session)Introduction to Cloud Computing - (Eng Session)
Introduction to Cloud Computing - (Eng Session)Amazon Web Services
 
AWS Summit Berlin 2013 - Keynote Werner Vogels
AWS Summit Berlin 2013 - Keynote Werner VogelsAWS Summit Berlin 2013 - Keynote Werner Vogels
AWS Summit Berlin 2013 - Keynote Werner VogelsAWS Germany
 
2016 summits - future of enterprise it
2016 summits - future of enterprise it2016 summits - future of enterprise it
2016 summits - future of enterprise itAmazon Web Services
 
Cloud computing overview
Cloud computing overviewCloud computing overview
Cloud computing overviewdaklug
 

What's hot (20)

AWS Summit 2013 | India - Opening Keynote, Dr. Werner Vogels
AWS Summit 2013 | India - Opening Keynote, Dr. Werner VogelsAWS Summit 2013 | India - Opening Keynote, Dr. Werner Vogels
AWS Summit 2013 | India - Opening Keynote, Dr. Werner Vogels
 
Microsof azure class 1- intro
Microsof azure   class 1- introMicrosof azure   class 1- intro
Microsof azure class 1- intro
 
State of the Union: Database & Analytics
State of the Union: Database & AnalyticsState of the Union: Database & Analytics
State of the Union: Database & Analytics
 
Dr. Werner Vogels Keynote
Dr. Werner Vogels KeynoteDr. Werner Vogels Keynote
Dr. Werner Vogels Keynote
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek SinhaAWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
 
Introduction to Cloud Computing with Amazon Web Services
Introduction to Cloud Computing with Amazon Web ServicesIntroduction to Cloud Computing with Amazon Web Services
Introduction to Cloud Computing with Amazon Web Services
 
CMS Hybrid Cloud Services Enablement
CMS Hybrid Cloud Services EnablementCMS Hybrid Cloud Services Enablement
CMS Hybrid Cloud Services Enablement
 
Comparison of cloud service providers
Comparison of cloud service providersComparison of cloud service providers
Comparison of cloud service providers
 
Driving Business Outcomes with a Modern Data Architecture - Level 100
Driving Business Outcomes with a Modern Data Architecture - Level 100Driving Business Outcomes with a Modern Data Architecture - Level 100
Driving Business Outcomes with a Modern Data Architecture - Level 100
 
Track 2 Session 5_ 利用 SageMaker 深度學習容器化在廣告推播之應用
Track 2 Session 5_ 利用 SageMaker 深度學習容器化在廣告推播之應用Track 2 Session 5_ 利用 SageMaker 深度學習容器化在廣告推播之應用
Track 2 Session 5_ 利用 SageMaker 深度學習容器化在廣告推播之應用
 
AWS Enterprise Summit - 엔터프라이즈에서의 AWS 클라우드 활용 - Markku Lepisto
AWS Enterprise Summit - 엔터프라이즈에서의 AWS 클라우드 활용 - Markku LepistoAWS Enterprise Summit - 엔터프라이즈에서의 AWS 클라우드 활용 - Markku Lepisto
AWS Enterprise Summit - 엔터프라이즈에서의 AWS 클라우드 활용 - Markku Lepisto
 
Andy Jassy Illuminates Amazon Web Services
Andy Jassy Illuminates Amazon Web ServicesAndy Jassy Illuminates Amazon Web Services
Andy Jassy Illuminates Amazon Web Services
 
Partner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_dataPartner webinar presentation aws pebble_treasure_data
Partner webinar presentation aws pebble_treasure_data
 
Introduction to Cloud Computing - (Eng Session)
Introduction to Cloud Computing - (Eng Session)Introduction to Cloud Computing - (Eng Session)
Introduction to Cloud Computing - (Eng Session)
 
AWS Summit Berlin 2013 - Keynote Werner Vogels
AWS Summit Berlin 2013 - Keynote Werner VogelsAWS Summit Berlin 2013 - Keynote Werner Vogels
AWS Summit Berlin 2013 - Keynote Werner Vogels
 
2016 summits - future of enterprise it
2016 summits - future of enterprise it2016 summits - future of enterprise it
2016 summits - future of enterprise it
 
AWS Architecting In The Cloud
AWS Architecting In The CloudAWS Architecting In The Cloud
AWS Architecting In The Cloud
 
Cloud computing overview
Cloud computing overviewCloud computing overview
Cloud computing overview
 

Viewers also liked

AWS Partner ConneXions Taiwan - Q3 2016 Technology Update
AWS Partner ConneXions Taiwan - Q3 2016 Technology UpdateAWS Partner ConneXions Taiwan - Q3 2016 Technology Update
AWS Partner ConneXions Taiwan - Q3 2016 Technology UpdateAmazon Web Services
 
Customer Sharing: Trend Micro - Trend Micro's DevOps Practices
Customer Sharing: Trend Micro - Trend Micro's DevOps Practices Customer Sharing: Trend Micro - Trend Micro's DevOps Practices
Customer Sharing: Trend Micro - Trend Micro's DevOps Practices Amazon Web Services
 
Hackproof Your Cloud: Responding to 2016 Threats
Hackproof Your Cloud: Responding to 2016 ThreatsHackproof Your Cloud: Responding to 2016 Threats
Hackproof Your Cloud: Responding to 2016 ThreatsAmazon Web Services
 
Digital Workloads on Amazon Web Services
Digital Workloads on Amazon Web ServicesDigital Workloads on Amazon Web Services
Digital Workloads on Amazon Web ServicesAmazon Web Services
 
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer Tools
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer ToolsDevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer Tools
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer ToolsAmazon Web Services
 
Maximizing Business Value as You Migrate to AWS
Maximizing Business Value as You Migrate to AWSMaximizing Business Value as You Migrate to AWS
Maximizing Business Value as You Migrate to AWSAmazon Web Services
 
Database Migration: Simple, Cross-Engine and Cross-Platform Migrations with M...
Database Migration: Simple, Cross-Engine and Cross-Platform Migrations with M...Database Migration: Simple, Cross-Engine and Cross-Platform Migrations with M...
Database Migration: Simple, Cross-Engine and Cross-Platform Migrations with M...Amazon Web Services
 
This One Weird API Request Will Save You Thousands
This One Weird API Request Will Save You ThousandsThis One Weird API Request Will Save You Thousands
This One Weird API Request Will Save You ThousandsAmazon Web Services
 
another day, another billion packets
another day, another billion packetsanother day, another billion packets
another day, another billion packetsAmazon Web Services
 
Barracuda WAF: Scalable Security for Applications on AWS
Barracuda WAF: Scalable Security for Applications on AWSBarracuda WAF: Scalable Security for Applications on AWS
Barracuda WAF: Scalable Security for Applications on AWSAmazon Web Services
 
AWS re:Invent 2016: AWS Customers Saving Lives with Mobile and IoT Technology...
AWS re:Invent 2016: AWS Customers Saving Lives with Mobile and IoT Technology...AWS re:Invent 2016: AWS Customers Saving Lives with Mobile and IoT Technology...
AWS re:Invent 2016: AWS Customers Saving Lives with Mobile and IoT Technology...Amazon Web Services
 
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...Amazon Web Services
 
Workshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWSWorkshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWSAmazon Web Services
 
AWS Mobile Hub - Building Mobile Apps with AWS
AWS Mobile Hub - Building Mobile Apps with AWSAWS Mobile Hub - Building Mobile Apps with AWS
AWS Mobile Hub - Building Mobile Apps with AWSAmazon Web Services
 
Amazon Aurora for the Enterprise - August 2016 Monthly Webinar Series
Amazon Aurora for the Enterprise - August 2016 Monthly Webinar SeriesAmazon Aurora for the Enterprise - August 2016 Monthly Webinar Series
Amazon Aurora for the Enterprise - August 2016 Monthly Webinar SeriesAmazon Web Services
 
Customer Case Study: Achieving PCI Compliance in AWS
Customer Case Study: Achieving PCI Compliance in AWSCustomer Case Study: Achieving PCI Compliance in AWS
Customer Case Study: Achieving PCI Compliance in AWSAmazon Web Services
 
AWS Welcome to re:Invent recap - 20161214
AWS Welcome to re:Invent recap - 20161214AWS Welcome to re:Invent recap - 20161214
AWS Welcome to re:Invent recap - 20161214Amazon Web Services
 
Secure Content Delivery Using Amazon CloudFront and AWS WAF
Secure Content Delivery Using Amazon CloudFront and AWS WAFSecure Content Delivery Using Amazon CloudFront and AWS WAF
Secure Content Delivery Using Amazon CloudFront and AWS WAFAmazon Web Services
 
AWS Directory Service and Hybrid Strategy | AWS Public Sector Summit 2016
AWS Directory Service and Hybrid Strategy | AWS Public Sector Summit 2016AWS Directory Service and Hybrid Strategy | AWS Public Sector Summit 2016
AWS Directory Service and Hybrid Strategy | AWS Public Sector Summit 2016Amazon Web Services
 

Viewers also liked (20)

AWS Partner ConneXions Taiwan - Q3 2016 Technology Update
AWS Partner ConneXions Taiwan - Q3 2016 Technology UpdateAWS Partner ConneXions Taiwan - Q3 2016 Technology Update
AWS Partner ConneXions Taiwan - Q3 2016 Technology Update
 
Customer Sharing: Trend Micro - Trend Micro's DevOps Practices
Customer Sharing: Trend Micro - Trend Micro's DevOps Practices Customer Sharing: Trend Micro - Trend Micro's DevOps Practices
Customer Sharing: Trend Micro - Trend Micro's DevOps Practices
 
Hackproof Your Cloud: Responding to 2016 Threats
Hackproof Your Cloud: Responding to 2016 ThreatsHackproof Your Cloud: Responding to 2016 Threats
Hackproof Your Cloud: Responding to 2016 Threats
 
Digital Workloads on Amazon Web Services
Digital Workloads on Amazon Web ServicesDigital Workloads on Amazon Web Services
Digital Workloads on Amazon Web Services
 
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer Tools
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer ToolsDevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer Tools
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer Tools
 
Maximizing Business Value as You Migrate to AWS
Maximizing Business Value as You Migrate to AWSMaximizing Business Value as You Migrate to AWS
Maximizing Business Value as You Migrate to AWS
 
Database Migration: Simple, Cross-Engine and Cross-Platform Migrations with M...
Database Migration: Simple, Cross-Engine and Cross-Platform Migrations with M...Database Migration: Simple, Cross-Engine and Cross-Platform Migrations with M...
Database Migration: Simple, Cross-Engine and Cross-Platform Migrations with M...
 
This One Weird API Request Will Save You Thousands
This One Weird API Request Will Save You ThousandsThis One Weird API Request Will Save You Thousands
This One Weird API Request Will Save You Thousands
 
another day, another billion packets
another day, another billion packetsanother day, another billion packets
another day, another billion packets
 
AWSome Day Leeds
AWSome Day Leeds AWSome Day Leeds
AWSome Day Leeds
 
Barracuda WAF: Scalable Security for Applications on AWS
Barracuda WAF: Scalable Security for Applications on AWSBarracuda WAF: Scalable Security for Applications on AWS
Barracuda WAF: Scalable Security for Applications on AWS
 
AWS re:Invent 2016: AWS Customers Saving Lives with Mobile and IoT Technology...
AWS re:Invent 2016: AWS Customers Saving Lives with Mobile and IoT Technology...AWS re:Invent 2016: AWS Customers Saving Lives with Mobile and IoT Technology...
AWS re:Invent 2016: AWS Customers Saving Lives with Mobile and IoT Technology...
 
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
 
Workshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWSWorkshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWS
 
AWS Mobile Hub - Building Mobile Apps with AWS
AWS Mobile Hub - Building Mobile Apps with AWSAWS Mobile Hub - Building Mobile Apps with AWS
AWS Mobile Hub - Building Mobile Apps with AWS
 
Amazon Aurora for the Enterprise - August 2016 Monthly Webinar Series
Amazon Aurora for the Enterprise - August 2016 Monthly Webinar SeriesAmazon Aurora for the Enterprise - August 2016 Monthly Webinar Series
Amazon Aurora for the Enterprise - August 2016 Monthly Webinar Series
 
Customer Case Study: Achieving PCI Compliance in AWS
Customer Case Study: Achieving PCI Compliance in AWSCustomer Case Study: Achieving PCI Compliance in AWS
Customer Case Study: Achieving PCI Compliance in AWS
 
AWS Welcome to re:Invent recap - 20161214
AWS Welcome to re:Invent recap - 20161214AWS Welcome to re:Invent recap - 20161214
AWS Welcome to re:Invent recap - 20161214
 
Secure Content Delivery Using Amazon CloudFront and AWS WAF
Secure Content Delivery Using Amazon CloudFront and AWS WAFSecure Content Delivery Using Amazon CloudFront and AWS WAF
Secure Content Delivery Using Amazon CloudFront and AWS WAF
 
AWS Directory Service and Hybrid Strategy | AWS Public Sector Summit 2016
AWS Directory Service and Hybrid Strategy | AWS Public Sector Summit 2016AWS Directory Service and Hybrid Strategy | AWS Public Sector Summit 2016
AWS Directory Service and Hybrid Strategy | AWS Public Sector Summit 2016
 

Similar to Big Data Solutions Day - Calgary

Tapping the cloud for real time data analytics
 Tapping the cloud for real time data analytics Tapping the cloud for real time data analytics
Tapping the cloud for real time data analyticsAmazon Web Services
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewAmazon Web Services
 
¿Quién es Amazon Web Services?
¿Quién es Amazon Web Services?¿Quién es Amazon Web Services?
¿Quién es Amazon Web Services?Software Guru
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 
Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...
Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...
Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...Amazon Web Services
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWSAmazon Web Services
 
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017Amazon Web Services
 
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...Amazon Web Services
 
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...Amazon Web Services
 
AWS Summit Atlanta Keynote
AWS Summit Atlanta KeynoteAWS Summit Atlanta Keynote
AWS Summit Atlanta KeynoteKristana Kane
 
Microsoft Azure
Microsoft AzureMicrosoft Azure
Microsoft AzureDavid Chou
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
Introduction to Cloud Computing with Amazon Web Services
Introduction to Cloud Computing with Amazon Web ServicesIntroduction to Cloud Computing with Amazon Web Services
Introduction to Cloud Computing with Amazon Web ServicesAmazon Web Services
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Amazon Web Services
 
Track 3 Session 6_打造應用專屬資料庫 (Purpose-built) 與了解託管服務優勢
Track 3 Session 6_打造應用專屬資料庫 (Purpose-built) 與了解託管服務優勢Track 3 Session 6_打造應用專屬資料庫 (Purpose-built) 與了解託管服務優勢
Track 3 Session 6_打造應用專屬資料庫 (Purpose-built) 與了解託管服務優勢Amazon Web Services
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSAmazon Web Services
 

Similar to Big Data Solutions Day - Calgary (20)

AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 
2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days
 
Tapping the cloud for real time data analytics
 Tapping the cloud for real time data analytics Tapping the cloud for real time data analytics
Tapping the cloud for real time data analytics
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution Overview
 
¿Quién es Amazon Web Services?
¿Quién es Amazon Web Services?¿Quién es Amazon Web Services?
¿Quién es Amazon Web Services?
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...
Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...
Understanding AWS Managed Databases and Analytic Services - AWS Innovate Otta...
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWS
 
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017
 
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
How Citrix Uses AWS Marketplace Solutions to Accelerate Analytic Workloads on...
 
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
MSC203_How Citrix Uses AWS Marketplace Solutions To Accelerate Analytic Workl...
 
AWS Summit Atlanta Keynote
AWS Summit Atlanta KeynoteAWS Summit Atlanta Keynote
AWS Summit Atlanta Keynote
 
Microsoft Azure
Microsoft AzureMicrosoft Azure
Microsoft Azure
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Introduction to Cloud Computing with Amazon Web Services
Introduction to Cloud Computing with Amazon Web ServicesIntroduction to Cloud Computing with Amazon Web Services
Introduction to Cloud Computing with Amazon Web Services
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
 
Track 3 Session 6_打造應用專屬資料庫 (Purpose-built) 與了解託管服務優勢
Track 3 Session 6_打造應用專屬資料庫 (Purpose-built) 與了解託管服務優勢Track 3 Session 6_打造應用專屬資料庫 (Purpose-built) 與了解託管服務優勢
Track 3 Session 6_打造應用專屬資料庫 (Purpose-built) 與了解託管服務優勢
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWS
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 

Recently uploaded (20)

A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 

Big Data Solutions Day - Calgary

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Shawn Gandhi Head of Solutions Architecture, AWS Canada August 16, 2016 Hello, Canada @shawnagram
  • 2. ~10 Years Young Amazon S3 launched: March 14th 2006
  • 3. Our New Canadian Office • 120 Bremner Blvd, 26th Floor
  • 4. Who Are We Exactly? Customer Account Manager Solutions Architect Tech Account Manager Pro Services Training & Cert Partner Team Biz Dev
  • 6. AWS Global Infrastructure 12 Regions 32 Availability Zones 54 Edge Locations
  • 7. AZ AZ AZ AZ AZ What is a Region? • Each datacenter has a purpose built network
  • 8. AZ AZ AZ AZ AZ What is a Region? • Metro-area DWDM links between AZs • AZs <2ms apart & usually <1ms • Each datacenter has a purpose built network
  • 9. Big Data Solutions Day Introductory Session Antoine Genereux Solutions Architect, AWS
  • 10. 12:30 PM Introductory Session 2:00 PM Best Practices on Real-time Streaming Analytics 3:00 PM Break 3:15 PM Getting started with Amazon Machine Learning 4:15 PM Building your first big data application on AWS Today
  • 11. Thank you to our Sponsor
  • 12. 1,000,000+ Active Customers Per Month AWS In 2016: 58% YOY Growth Q2 2015 TO Q2 20166 $11B+ Revenue run rate Of 14 Others, Combined
  • 13. Infrastructure Intelligence Harness data generated from your systems and infrastructure Advanced Analytics Anticipate future behaviors and conduct what-if analysis Largest Ecosystem of ISVs Solutions vetted by the AWS Partner Competency Program Data Integration Move, synchronize, cleanse, and manage data Data Analysis & Visualization Turn data into actionable insight and enhance decision making
  • 14. Largest Ecosystem of Integrators Qualified consulting partners with AWS big data expertise North America Europe, Middle East and Africa Asia Pacific and Japan
  • 15. http://aws.analytic.business Big Data Analytics in AWS Marketplace Thousands of products pre-integrated with the AWS Cloud. 290 specific to big data Analysis & Visualization Solutions are pre-configured and ready to run on AWS Advanced Analytics Deploy when you need it, 1-Click launch in multiple regions around the world Metered pricing by the hour. Pay only for what you use. Volume licensing available Data Integration Extract valuable information from your historical and current data Predict future business performance; location, text, social, sentiment analysis Extract, migrate, or prepare and clean your data for accurate analysis
  • 16. Proven Customer Success The vast majority of big data use cases deployed in the cloud today run on AWS.
  • 17. The Evolution of Big Data Processing Descriptive Real-time Predictive Dashboards; Traditional query & reporting Clickstream analysis; Ad bidding; streaming data Inventory forecasting; Fraud detection; Recommendation engines What happened before What’s happening now Probability of “x” happening
  • 18. Volume Velocity Variety Big Data is Breaking Traditional IT Infrastructure - Too many tools to setup, manage & scale - Unsustainable costs - Difficult & undifferentiated heavy lifting just to get started - New & expensive skills - Bigger responsibility (sensitive data)
  • 19. Big Data on AWS 1. Broadest & Deepest Functionality 2. Computational Power that is Second to None 3. Petabyte-scale Analytics Made Affordable 4. Easy & Powerful Tools for Migrating Data 5. Security You Can Trust
  • 20. 1. Broadest & Deepest Functionality Big Data Storage Data Warehousing Real-time Streaming Distributed Analytics (Hadoop, Spark, Presto) NoSQL Databases Business Intelligence Relational Databases Internet of Things (IoT) Machine Learning Server-less Compute Elasticsearch
  • 21. Core Components for Big Data Workloads
  • 22. Big Data Storage for Virtually All AWS Services Store anything Object storage Scalable 99.999999999% durability Extremely low cost Amazon S3
  • 23. Real-time processing High throughput; elastic Easy to use Integration with S3, EMR, Redshift, DynamoDB Amazon Kinesis Real-time Streaming Data
  • 24. Amazon Kinesis Streams • For Technical Developers • Build your own custom applications that process or analyze streaming data Amazon Kinesis Firehose • For all developers, data scientists • Easily load massive volumes of streaming data into S3, Amazon Redshift and Amazon Elasticsearch Amazon Kinesis Analytics • For all developers, data scientists • Easily analyze data streams using standard SQL queries • Now Available Amazon Kinesis: Streaming Data Made Easy Services make it easy to capture, deliver and process streams on AWS
  • 25. Managed Hadoop framework Spark, Presto, Hbase, Tez, Hive, etc. Cost effective; Easy to use On-demand and spot pricing HDFS & S3 File systems Amazon EMR Distributed Analytics
  • 26. NoSQL Database Seamless scalability Zero administration Single digit millisecond latency Amazon DynamoDB Fast & Flexible NoSQL Database Service
  • 27. Relational data warehouse Massively parallel; Petabyte scale Fully managed HDD and SSD Platforms $1,000/TB/Year; start at $0.25/hour Amazon Redshift Fully Managed Petabyte-scale Data Warehouse
  • 28. Amazon Redshift has security built-in SSL to secure data in transit Encryption to secure data at rest • AES-256; hardware accelerated • All blocks on disks and in Amazon S3 encrypted • HSM Support No direct access to compute nodes Audit logging, AWS CloudTrail, AWS KMS integration Amazon VPC support SOC 1/2/3, PCI-DSS Level 1, FedRAMP, HIPAA 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC JDBC/ODBC
  • 29. Easy to use, managed machine learning service built for developers Robust, powerful machine learning technology based on Amazon’s internal systems Create models using your data already stored in the AWS cloud Deploy models to production in seconds Amazon Machine Learning Machine Learning
  • 31. Sample Use Cases Fraud detection Detecting fraudulent transactions, filtering spam emails, flagging suspicious reviews, … Personalization Recommending content, predictive content loading, improving user experience, … Targeted marketing Matching customers and offers, choosing marketing campaigns, cross-selling and up-selling, … Content classification Categorizing documents, matching hiring managers and resumes, … Churn prediction Finding customers who are likely to stop using the service, free-tier upgrade targeting, … Customer support Predictive routing of customer emails, social media listening, …
  • 32. 2. Computational Power that is Second to None Over 2x computational instance types as any other cloud platform to address a wide range of big data use cases. Compute Optimized (C3, C4) Memory Optimized (R3, X1) GPU Instances (G2) Storage Optimized (I2, D2) Plus general purpose (M4), low cost burstable (T2), and dedicated instances
  • 33. Intel® Processor Technologies Intel® AVX – Dramatically increases performance for highly parallel HPC workloads such as life science engineering, data mining, financial analysis, media processing Intel® AES-NI – Enhances security with new encryption instructions that reduce the performance penalty associated with encrypting/decrypting data Intel® Turbo Boost Technology – Increases computing power with performance that adapts to spikes in workloads Intel Transactional Synchronization (TSX) Extensions – Enables execution of transactions that are independent to accelerate throughput P state & C state control – provides granular performance tuning for cores and sleep states to improve overall application performance
  • 34. 3. Affordable Petabyte-scale Analytics AWS helps customers maximize the value of Big Data investments while reducing overall IT costs Secure, Highly Durable storage $28.16 / TB / month Data Archiving $7.16 / TB / month Real-time streaming data load $0.035 / GB 10-node Spark Cluster $0.15 / hr Petabyte-scale Data Warehouse $0.25 / hr Amazon Glacier Amazon S3 Amazon RedshiftAmazon EMRAmazon Kinesis
  • 35. NASDAQ migrated to Amazon Redshift, achieving cost savings of 57% compared to legacy system Volume Avg 7B rows per day. Peak of 20B rows Performance Queries running faster. Loads finish before 11PM Scale Loading more data than ever stored by legacy DWH
  • 36. FINRA handles approximately 75 billion market events every day to build a holistic picture of trading in the U.S. Deter misconduct by enforcing the rules Detect and prevent wrongdoing in the U.S. markets Discipline those who break the rules
  • 37. 4. Easy, Powerful Tools for Migrating Data AWS provides the broadest range of tools for easy, fast, secure data movement to and from the AWS cloud. AWS Direct Connect ISV Connectors Amazon Kinesis Firehose Amazon S3 Transfer Acceleration AWS Storage Gateway Database Migration Service AWS Snowball
  • 38. What is Snowball? Petabyte scale data transport E-ink shipping label Ruggedized case “8.5G Impact” All data encrypted end-to-end 50TB & 80TB 10G network Rain & dust resistant Tamper-resistant case & electronics
  • 39. Start your first migration in 10 minutes or less Keep your apps running during the migration Replicate within, to, or from Amazon EC2 or Amazon RDS Move data to the same or different database engine AWS Database Migration Service
  • 40. 5. Security & Compliance You Can Trust • AWS manages 1800+ security controls so you don’t have to • You benefit from an environment built for the most security sensitive organizations • You get to define the right security controls for your workload sensitivity • You always have full ownership and control of your data
  • 41. Broadest Services to Secure Applications NETWORKING IDENTITY ENCRYPTION COMPLIANCE Virtual Private Cloud Web Application Firewall IAM Active Directory Integration SAML Federation KMS Cloud HSM Server-side Encryption Encryption SDK Service Catalog CloudTrail Config Config Rules Inspector
  • 42. Key AWS Certifications and Assurance Programs
  • 44. Amazon Web Services Time to Answer (Latency) Throughput collect store process / analyze consume / visualize data answers Time to Answer (Latency) Throughput Cost Simplify Big Data Processing
  • 45. Architectural Principles • Decoupled “data bus” • Data → Store → Process → Answers • Use the right tool for the job • Data structure, latency, throughput, access patterns • Use Lambda architecture ideas • Immutable (append-only) log, batch/speed/serving layer • Leverage AWS managed services • No/low admin • Big data ≠ big cost
  • 46. Storage Compute • Decoupled “data bus” • Completely autonomous components • Independently scale storage & compute resources on-demand Collect → Store → Process | Analyze → Consume Decoupling Storage and Compute Resources
  • 47. Multi-Stage Decoupled “Data Bus” Multiple stages Storage decouples multiple processing stages Store Process Store ProcessData process store
  • 48. Amazon SQS apps Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift COLLECT STORE CONSUMEPROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR Amazon Elasticsearch Service Apache Kafka Amazon SQS Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS Amazon DynamoDB Streams HotHotWarm FastSlowFast BatchMessageInteractiveStreamML SearchSQLNoSQLCacheFileQueueStream Amazon EC2 Amazon EC2 Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENTS FILES MESSAGES STREAMS Amazon QuickSight Apps & Services Analysis&visualizationNotebooksIDEAPI LoggingIoTApplicationsTransportMessaging ETL
  • 49. Multiple Stream Processing Applications Can Read from Amazon Kinesis Amazon Kinesis AWS Lambda Amazon DynamoDB Amazon Kinesis S3 Connector process store Amazon S3
  • 50. Which stream storage should I use? Amazon Kinesis DynamoDB Streams Amazon SQS Amazon SNS Kafka Managed Yes Yes Yes No Ordering Yes Yes No Yes Delivery at-least-once exactly-once at-least-once at-least-once Lifetime 7 days 24 hours 14 days Configurable Replication 3 AZ 3 AZ 3 AZ Configurable Throughput No Limit No Limit No Limit ~ Nodes Parallel Clients Yes Yes No (SQS) Yes MapReduce Yes Yes No Yes Record size 1MB 400KB 256KB Configurable Cost Low Higher(table cost) Low-Medium Low (+admin)
  • 51. Amazon EMR Analysis Frameworks Can Read from Multiple Data Stores Amazon Kinesis AWS Lambda Amazon S3 Data Amazon DynamoDB AnswerSpark Streaming Amazon Kinesis S3 Connector process store Spark SQL
  • 52. Amazon EMR Real-time Analytics Apache Kafka KCL AWS Lambda Spark Streaming Apache Storm Amazon SNS Amazon ML Notifications Amazon ElastiCache (Redis) Amazon DynamoDB Amazon RDS Amazon ES Alert App state Real-time prediction KPI process store DynamoDB Streams Amazon Kinesis Data stream
  • 53. What Stream Processing Technology Should I Use? Spark Streaming Apache Storm Amazon Kinesis Client Library AWS Lambda Amazon EMR (Hive, Pig) Scale / Throughput ~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes Batch or Real- time Real-time Real-time Real-time Real-time Batch Manageability Yes (Amazon EMR) Do it yourself Amazon EC2 + Auto Scaling AWS managed Yes (Amazon EMR) Fault Tolerance Single AZ Configurable Multi-AZ Multi-AZ Single AZ Programming languages Java, Python, Scala Any language via Thrift Java, via MultiLangDaemon ( .Net, Python, Ruby, Node.js) Node.js, Java Hive, Pig, Streaming languages Query Latency High (Lower is better)
  • 54. Case Study: Clickstream Analysis Hearst Corporation monitors trending content for over 250 digital properties worldwide and processes more than 30TB of data per day, using an architecture that includes Amazon Kinesis and Spark running on Amazon EMR. Store → Process | Analyze → Answers
  • 55. Interactive & Batch Analytics Amazon S3 Amazon EMR Hive Pig Spark Amazon ML process store Consume Amazon Redshift Amazon EMR Presto Spark Batch Interactive Batch prediction Real-time prediction Amazon Kinesis Firehose
  • 56. What Data Processing Technology Should I Use? Amazon Redshift Impala Presto Spark Hive Query Latency Low Low Low Low Medium (Tez) – High (MapReduce) Durability High High High High High Data Volume 2 PB Max ~Nodes ~Nodes ~Nodes ~Nodes Managed Yes Yes (EMR) Yes (EMR) Yes (EMR) Yes (EMR) Storage Native HDFS / S3A* HDFS / S3 HDFS / S3 HDFS / S3 SQL Compatibility High Medium High Low (SparkSQL) Medium (HQL) Query Latency High (Low is better) Medium
  • 57. Batch layer Amazon Kinesis Data stream process store Amazon Kinesis S3 Connector A p p l i c a t i o n s Amazon Redshift Amazon EMR Presto Hive Pig Spark answer Amazon S3 Speed layer answer Serving layer Amazon ElastiCache Amazon DynamoDB Amazon RDS Amazon ES answer Amazon ML KCL AWS Lambda Storm Lambda Architecture Spark Streaming on Amazon EMR
  • 58. Cache SQL Request Rate High Low Cost/GB High Low Latency Low High Data Volume Low High GlacierStructure NoSQL Hot Data Warm Data Cold Data Low High Search
  • 59. What Data Store Should I Use? Amazon ElastiCache Amazon DynamoDB Amazon Aurora Amazon Elasticsearch Amazon EMR (HDFS) Amazon S3 Amazon Glacier Average latency ms ms ms, sec ms,sec sec,min,hrs ms,sec,min (~ size) hrs Data volume GB GB–TBs (no limit) GB–TB (64 TB Max) GB–TB GB–PB (~nodes) MB–PB (no limit) GB–PB (no limit) Item size B-KB KB (400 KB max) KB (64 KB) KB (1 MB max) MB-GB KB-GB (5 TB max) GB (40 TB max) Request rate High - Very High Very High (no limit) High High Low – Very High Low – Very High (no limit) Very Low Storage cost GB/month $$ ¢¢ ¢¢ ¢¢ ¢ ¢ ¢/10 Durability Low - Moderate Very High Very High High High Very High Very High Hot Data Warm Data Cold Data
  • 60. Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB? “I’m currently scoping out a project that will greatly increase my team’s use of Amazon S3. Hoping you could answer some questions. The current iteration of the design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…” Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month 300 2048 1483 777,600,000
  • 61. Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB? https://aws.amazon.com/calculator
  • 62. Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month 300 2,048 1,483 777,600,000 Amazon S3 or Amazon DynamoDB?
  • 63. Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month Scenario 1300 2,048 1,483 777,600,000 Scenario 2300 32,768 23,730 777,600,000 Amazon S3 Amazon DynamoDB use use
  • 64. Case Study: On-demand Big Data Analytics Redfin uses Amazon EMR with spot instances – dynamically spinning up & down Apache Hadoop clusters – to perform large data transformations and deliver data to internal and external customers. Store → Process → Store → Analyze → Answers
  • 65. Architectural Principles Decoupled “data bus” • Data → Store → Process → Store → Analyze → Answers Use the right tool for the job • Data structure, latency, throughput, access patterns Use Lambda architecture ideas • Immutable (append-only) log, batch/speed/serving layer Leverage AWS managed services • Scalable/elastic, available, reliable, secure, no/low admin Big data ≠ big cost
  • 66. Building a Big Data Application
  • 67. Building a Big Data Application web clients mobile clients DBMS corporate data center Getting Started
  • 68. Building a Big Data Application web clients mobile clients DBMS Amazon Redshift Amazon QuickSight AWS cloudcorporate data center Adding a data warehouse
  • 69. Building a Big Data Application web clients mobile clients DBMS Raw data Amazon Redshift Staging Data Amazon QuickSight AWS cloud Bringing in Log Data corporate data center
  • 70. Building a Big Data Application web clients mobile clients DBMS Raw data Amazon Redshift Staging Data Orc/Parquet (Query optimized) Amazon QuickSight AWS cloud Extending your DW to S3 corporate data center
  • 71. Building a Big Data Application web clients mobile clients DBMS Raw data Amazon Redshift Staging Data Orc/Parquet (Query optimized) Amazon QuickSight KinesisStreams AWS cloud Adding a real-time layer corporate data center
  • 72. Building a Big Data Application web clients mobile clients DBMS Raw data Amazon Redshift Staging Data Orc/Parquet (Query optimized) Amazon QuickSight KinesisStreams AWS cloud Adding predictive analytics corporate data center
  • 73. Building a Big Data Application web clients mobile clients DBMS Raw data Amazon Redshift Staging Data Orc/Parquet (Query optimized) Amazon QuickSight KinesisStreams AWS cloud Adding encryption at rest with AWS KMS corporate data center AWSKMS
  • 74. Building a Big Data Application web clients mobile clients DBMS Raw data Amazon Redshift Staging Data Orc/Parquet (Query optimized) Amazon QuickSight KinesisStreams AWS cloud AWSKMS VPC subnet SSL/TLS SSL/TLS Protecting Data in Transit & Adding Network Isolation corporate data center
  • 75. The Evolution of Big Data Processing Descriptive Real-time Predictive Dashboards; Traditional query & reporting Clickstream analysis; Ad bidding; streaming data Inventory forecasting; Fraud detection; Recommendation engines What happened before What’s happening now Probability of “x” happening Amazon Kinesis Amazon EC2 AWS Lambda Amazon DynamoDB Amazon EMR Amazon Redshift Amazon QuickSight Amazon EMR Amazon Machine Learning Amazon EMR
  • 76. Learning Big Data on AWS aws.amazon.com/big-data Big Data Quest Learn at your own pace and practice working with AWS services for big data on qwikLABS. (3 Hours | Online) Big Data on AWS How to use AWS services to process data with Hadoop & create big data environments (3 Days | Classroom ) Big Data Technology Fundamentals FREE! Overview of AWS big data solutions for architects or data scientists new to big data. (3 Hours | Online) AWS Courses Self-paced Online Labs
  • 77. 2:00 PM Best Practices on Real-time Streaming Analytics 3:00 PM Break 3:15 PM Getting started with Amazon Machine Learning 4:15 PM Building your first big data application on AWS Next…
  • 78. © 2016 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Real-time Streaming Data on AWS Shawn Gandhi Head of Solutions Architecture AWS Canada @shawnagram
  • 79. { "payerId": "Joe", "productCode": "AmazonS3", "clientProductCode": "AmazonS3", "usageType": "Bandwidth", "operation": "PUT", "value": "22490", "timestamp": "1216674828" } Metering Record 127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 Common Log Entry <165>1 2003-10-11T22:14:15.003Z mymachine.example.com evntslog - ID47 [exampleSDID@32473 iut="3" eventSource="Application" eventID="1011"][examplePriority@32473 class="high"] Syslog Entry “SeattlePublicWater/Kinesis/123/Realtime” – 412309129140 MQTT Record <R,AMZN,T,G,R1> NASDAQ OMX Record
  • 80. Generated data Available for analysis Data volume Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  • 82.
  • 83.
  • 84. Big data: best served fresh Big data Hourly server logs: were your systems misbehaving one hour ago Weekly / Monthly Bill: what you spent this billing cycle Daily customer-preferences report from your website’s click stream: what deal or ad to try next time Daily fraud reports: was there fraud yesterday Real-time big data Amazon CloudWatch metrics: what went wrong now Real-time spending alerts/caps: prevent overspending now Real-time analysis: what to offer the current customer now Real-time detection: block fraudulent use now
  • 85. Streaming Data Scenarios Across Verticals Scenarios/ Verticals Accelerated Ingest- Transform-Load Continuous Metrics Generation Responsive Data Analysis Digital Ad Tech/Marketing Publisher, bidder data aggregation Advertising metrics like coverage, yield, and conversion User engagement with ads, optimized bid/buy engines IoT Sensor, device telemetry data ingestion Operational metrics and dashboards Device operational intelligence and alerts Gaming Online data aggregation, e.g., top 10 players Massively multiplayer online game (MMOG) live dashboard Leader board generation, player-skill match Consumer Online Clickstream analytics Metrics like impressions and page views Recommendation engines, proactive care
  • 86. Customer Use Cases Sonos runs near real-time streaming analytics on device data logs from their connected hi-fi audio equipment. Analyzing 30TB+ clickstream data enabling real-time insights for Publishers. Glu Mobile collects billions of gaming events data points from millions of user devices in real-time every single day. Nordstorm Nordstorm recommendation team built online stylist using Amazon Kinesis Streams and AWS Lambda.
  • 88.
  • 89. Two Main Processing Patterns Stream processing (real time) • Real-time response to events in data streams Examples: • Proactively detect hardware errors in device logs • Notify when inventory drops below a threshold • Fraud detection Micro-batching (near real time) • Near real-time operations on small batches of events in data streams Examples: • Aggregate and archive events • Monitor performance SLAs
  • 90. Amazon Kinesis Streams • For Technical Developers • Build your own custom applications that process or analyze streaming data Amazon Kinesis Firehose • For all developers, data scientists • Easily load massive volumes of streaming data into S3, Amazon Redshift and Amazon Elastisearch Amazon Kinesis Analytics • For all developers, data scientists • Easily analyze data streams using standard SQL queries • Coming Soon NEW! Amazon Kinesis: Streaming Data Made Easy Services make it easy to capture, deliver and process streams on AWS
  • 91. Data Sources App.4 [Machine Learning] AWSEndpoint App.1 [Aggregate & De-Duplicate] Data Sources Data Sources Data Sources App.2 [Metric Extraction] Amazon S3 Amazon Redshift App.3 [Sliding Window Analysis] Availability Zone Shard 1 Shard 2 Shard N Availability Zone Availability Zone Amazon Kinesis Streams Managed service for real-time streaming AWS Lambda Amazon EMR
  • 92. Amazon Kinesis Firehose Load massive volumes of streaming data into Amazon S3, Amazon Redshift and Amazon Elasticsearch Capture and submit streaming data to Firehose Analyze streaming data using your favorite BI tools Firehose loads streaming data continuously into S3, Amazon Redshift and Amazon Elasticsearch
  • 93. Amazon Kinesis Firehose vs. Amazon Kinesis Streams Amazon Kinesis Streams is for use cases that require custom processing, per incoming record, with sub-1 second processing latency, and a choice of stream processing frameworks. Amazon Kinesis Firehose is for use cases that require zero administration, ability to use existing analytics tools based on Amazon S3, Amazon Redshift and Amazon Elasticsearch, and a data latency of 60 seconds or higher.
  • 94. Amazon Kinesis Analytics • Interact with streaming data in real-time using SQL • Build fully managed and elastic stream processing applications that process data for real-time visualizations and alarms Connect to Kinesis streams, Firehose delivery streams Run standard SQL queries against data streams Kinesis Analytics can send processed data to analytics tools so you can create alerts and respond in real-time
  • 96. Problem: Count top 10 hashtags on Twitter and display
  • 98. twitter-trends.com The solution: streaming Map/Reduce My top-10 My top-10 My top-10 Global top-10
  • 99. twitter-trends.com Core concepts My top-10 My top-10 My top-10 Global top-10 Data record Stream Partition key Shard Worker Shard: 14 17 18 21 23 Data record Sequence number
  • 100. twitter-trends.com How this relates to Amazon Kinesis Amazon Kinesis Amazon Kinesis application
  • 101. Core concepts recapped Data record - Tweet Stream - All tweets (the Twitter Firehose) Partition key - Twitter topic (every tweet record belongs to exactly one of these) Shard - All the data records belonging to a set of Twitter topics that will get grouped together Sequence number - Each data record gets one assigned when first ingested Worker - Processes the records of a shard in sequence number order
  • 102. twitter-trends.com Using the Amazon Kinesis API directly A m a z o n K i n e s i s iterator = getShardIterator(shardId, LATEST); while (true) { [records, iterator] = getNextRecords(iterator, maxRecsToReturn); process(records); } process(records): { for (record in records) { updateLocalTop10(record); } if (timeToDoOutput()) { writeLocalTop10ToDDB(); } } while (true) { localTop10Lists = scanDDBTable(); updateGlobalTop10List( localTop10Lists); sleep(10); }
  • 103. A m a z o n K i n i s i s twitter-trends.com Challenges with using the Amazon Kinesis API directly Kinesis application Manual creation of workers and assignment to shards How many workers per EC2 instance? How many EC2 instances?
  • 104. A m a z o n K i n e s i s twitter-trends.com Using the Amazon Kinesis Client Library Kinesis application Shard mgmt table
  • 105. A m a z o n K i n e s i s twitter-trends.com Elastic scale and load balancing Shard mgmt table Auto scaling Group Amazon CloudWatch
  • 106. A m a z o n K i n e s i s twitter-trends.com Shard management Shard mgmt table Auto scaling Group Amazon CloudWatch Amazon SNS AWS Console
  • 107. • Streams are made of shards • Each shard ingests up to 1MB/sec, and 1000 records/sec • Each shard emits up to 2 MB/sec • All data is stored for 24 hours by default; storage can be extended for up to 7 days • Scale Kinesis streams using scaling util • Replay data inside of 24-hour window
  • 108. Sample Code for Scaling Shards java -cp KinesisScalingUtils.jar-complete.jar -Dstream-name=MyStream -Dscaling-action=scaleUp -Dcount=10 -Dregion=eu-west-1 ScalingClient Options: • stream-name - The name of the stream to be scaled • scaling-action - The action to be taken to scale. Must be one of "scaleUp”, "scaleDown" or “resize” • count - Number of shards by which to absolutely scale up or down, or resize See https://github.com/awslabs/amazon-kinesis-scaling-utils
  • 109. A m a z o n K i n e s i s twitter-trends.com The challenges of fault tolerance X X X Shard mgmt table
  • 110. A m a z o n K i n e s i s twitter-trends.com Fault tolerance support in KCL Shard mgmt table X Availability Zone 1 Availability Zone 3
  • 111. Checkpoint, replay design pattern Amazon Kinesis 1417182123 Shard-i 235810 Shard ID Lock Seq num Shard-i Host A Host B Shard ID Local top-10 shard-i 0 10 18 X2 3 5 8 10 14 17 18 21 23 0 310 Host AHost B {#Obama: 10235, #Seahawks: 9835, …}{#Obama: 10235, #Congress: 9910, …} 1023 14 17 18 21 23
  • 112. Amazon Kinesis Producer Library • Writes to one or more Amazon Kinesis streams with automatic, configurable retry mechanism • Collects records and uses PutRecords to write multiple records to multiple shards per request • Aggregates user records to increase payload size and improve throughput • Integrates seamlessly with KCL to de-aggregate batched records • Submits Amazon CloudWatch metrics on your behalf to provide visibility into producer performance
  • 113. Putting Data into Amazon Kinesis Amazon Kinesis Agent – (supports pre-processing) • http://docs.aws.amazon.com/firehose/latest/dev/writing-with-agents.html Pre-batch before Puts for better efficiency • Consider Flume, Fluentd as collectors/agents • See https://github.com/awslabs/aws-fluent-plugin-kinesis Make a tweak to your existing logging • log4j appender option • See https://github.com/awslabs/kinesis-log4j-appender
  • 114. Managed Buffer • Care about a reliable, scalable way to capture data • Defer all other aggregation to consumer • Generate Random Partition Keys • Ensure a high cardinality for Partition Keys with respect to shards, to spray evenly across available shards Putting Data into Amazon Kinesis Streams Streaming Map-Reduce • Streaming Map-Reduce: leverage partition keys as a natural way to aggregate data • For e.g. Partition Keys per billing customer, per DeviceId, per stock symbol • Design partition keys to scale • Be aware of “hot partition keys or shards ” Determine your partition key strategy
  • 115. Amazon EMR Amazon Kinesis StreamsStreaming Input Tumbling/Fixed Window Aggregation Periodic Output Amazon Redshift COPY from Amazon EMR Common Integration Pattern with Amazon EMR Tumbling Window Reporting
  • 116. Apache Spark and Amazon Kinesis Streams Apache Spark is an in-memory analytics cluster using RDD for fast processing Spark Streaming can read directly from an Amazon Kinesis stream Amazon software license linking – Add ASL dependency to SBT/MAVEN project, artifactId = spark- streaming-kinesis-asl_2.10 KinesisUtils.createStream(‘twitter-stream’) .filter(_.getText.contains(”Open-Source")) .countByWindow(Seconds(5)) Example: Counting tweets on a sliding window
  • 117. Using Spark Streaming with Amazon Kinesis Streams 1. Use Spark 1.6+ with EMRFS consistent view option – if you use Amazon S3 as storage for Spark checkpoint 2. Amazon DynamoDB table name – make sure there is only one instance of the application running with Spark Streaming 3. Enable Spark-based checkpoints 4. Number of Amazon Kinesis receivers is multiple of executors so they are load-balanced 5. Total processing time is less than the batch interval 6. Number of executors is the same as number of cores per executor 7. Spark Streaming uses default of 1 sec with KCL
  • 118. Amazon Kinesis Streams with AWS Lambda
  • 119. Conclusion • Amazon Kinesis offers: managed service to build applications, streaming data ingestion, and continuous processing • Ingest aggregate data using Amazon Producer Library • Process data using Amazon Connector Library and open source connectors • Determine your partition key strategy • Try out Amazon Kinesis at http://aws.amazon.com/kinesis/
  • 120. • Technical documentation • Amazon Kinesis Agent • Amazon Kinesis Streams and Spark Streaming • Amazon Kinesis Producer Library Best Practice • Amazon Kinesis Firehose and AWS Lambda • Building Near Real-Time Discovery Platform with Amazon Kinesis • Public case studies • Glu mobile – Real-Time Analytics • Hearst Publishing – Clickstream Analytics • How Sonos Leverages Amazon Kinesis • Nordstorm Online Stylist Reference
  • 121. 3:00 PM Break 3:15 PM Getting started with Amazon Machine Learning 4:15 PM Building your first big data application on AWS Next…
  • 122. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting Started with Amazon Machine Learning Build smart applications by finding patterns and predictions in your data @alexbcbr Alex Coqueiro Sr. Manager Solutions Architect, Public Sector Team
  • 123. I want to see an example !!!
  • 125. Machine learning and smart applications Machine learning is the technology that automatically finds patterns in your data and uses them to make predictions for new data points as they become available
  • 126. Machine learning and smart applications Machine learning is the technology that automatically finds patterns in your data and uses them to make predictions for new data points as they become available Your data + machine learning = smart applications
  • 127. Smart applications by example Based on what you know about the user: Will they use your product?
  • 128. Smart applications by example Based on what you know about the user: Will they use your product? Based on what you know about an order: Is this order fraudulent?
  • 129. Smart applications by example Based on what you know about the user: Will they use your product? Based on what you know about an order: Is this order fraudulent? Based on what you know about a news article: What other articles are interesting?
  • 130. Challenges to Building Smart Applications Today Expertise Technology Operationalization Limited supply of data scientists Many choices, few mainstays Complex and error- prone data workflows Expensive to hire or outsource Difficult to use and scale Custom platforms and APIs Many moving pieces lead to custom solutions every time Reinventing the model lifecycle management wheel
  • 132. Amazon Machine Learning Easy to use, managed machine learning service built for developers Robust, powerful machine learning technology based on Amazon’s internal systems Create models using your data already stored in the AWS cloud Deploy models to production in seconds
  • 133. Three Supported Types of Predictions Binary Classification Predict the answer to a Yes/No question Multi-class classification Predict the correct category from a list Regression Predict the value of a numeric variable
  • 134. How Do I Get started Using Amazon Machine Learning?
  • 135. Build model Evaluate and optimize Retrieve predictions 1 2 3 Building smart applications with Amazon ML
  • 136. Train model Evaluate and optimize Retrieve predictions 1 2 3 Building smart applications with Amazon ML - Create a Datasource object pointing to your data - Explore and understand your data - Transform data and train your model
  • 137. Create a Datasource object >>> import boto >>> ml = boto.connect_machinelearning() >>> ds = ml.create_data_source_from_s3( data_source_id = ’my_datasource', data_spec= { 'DataLocationS3':'s3://bucket/input/', 'DataSchemaLocationS3':'s3://bucket/input/.schema'}, compute_statistics = True)
  • 139. Train your model >>> import boto >>> ml = boto.connect_machinelearning() >>> model = ml.create_ml_model( ml_model_id=’my_model', ml_model_type='REGRESSION', training_data_source_id='my_datasource')
  • 140. Train model Evaluate and optimize Retrieve predictions 1 2 3 Building smart applications with Amazon ML - Understand model quality - Adjust model interpretation
  • 144. Train model Evaluate and optimize Retrieve predictions 1 2 3 Building smart applications with Amazon ML - Batch predictions - Real-time predictions
  • 145. Batch predictions Asynchronous, large-volume prediction generation Request through service console or API Best for applications that deal with batches of data records >>> import boto >>> ml = boto.connect_machinelearning() >>> model = ml.create_batch_prediction( batch_prediction_id = 'my_batch_prediction’ batch_prediction_data_source_id = ’my_datasource’ ml_model_id = ’my_model', output_uri = 's3://examplebucket/output/’)
  • 146. Real-time predictions Synchronous, low-latency, high-throughput prediction generation Request through service API or server or mobile SDKs Best for interaction applications that deal with individual data records >>> import boto >>> ml = boto.connect_machinelearning() >>> ml.predict( ml_model_id=’my_model', predict_endpoint=’example_endpoint’, record={’key1':’value1’, ’key2':’value2’}) { 'Prediction': { 'predictedValue': 13.284348, 'details': { 'Algorithm': 'SGD', 'PredictiveModelType': 'REGRESSION’ } } }
  • 148. Learning Model Machine Learning Supervised Unsupervised Predictive Model Classification, Regression Descritive Models Clusterization, Dimension Reduction
  • 149. Agenda • Why did we build Amazon Machine Learning? • What is Amazon Machine Learning? • How do I get started using Amazon Machine Learning? • Q&A
  • 150. Pay-as-you-go and inexpensive Data analysis, model training, and evaluation: $0.42/instance hour Batch predictions: $0.10/1000 Real-time predictions: $0.10/1000 + hourly capacity reservation charge
  • 151. Why Did We Build Amazon Machine Learning?
  • 152. Three types of data-driven development Retrospective analysis and reporting Amazon Redshift Amazon RDS Amazon S3 Amazon EMR
  • 153. Three types of data-driven development Retrospective analysis and reporting Here-and-now real-time processing and dashboards Amazon Kinesis Amazon EC2 AWS Lambda Amazon Redshift, Amazon RDS Amazon S3 Amazon EMR
  • 154. Three types of data-driven development Retrospective analysis and reporting Here-and-now real-time processing and dashboards Predictions to enable smart applications Amazon Kinesis Amazon EC2 AWS Lambda Amazon Redshift, Amazon RDS Amazon S3 Amazon EMR
  • 155. Architecture Patterns for Smart Applications
  • 156. Batch predictions with Amazon EMR Query for predictions with Amazon ML batch API Process data with Amazon EMR Raw data in Amazon S3 Aggregated data in Amazon S3 Predictions in Amazon S3 Your application
  • 157. Batch predictions with Amazon Redshift Structured data In Amazon Redshift Load predictions into Amazon Redshift -or- Read prediction results directly from Amazon S3 Predictions in Amazon S3 Query for predictions with Amazon ML batch API Your application
  • 158. Real-time predictions for interactive applications Your application Query for predictions with Amazon ML real-time API
  • 159. Adding predictions to an existing data flow Your application Amazon DynamoDB + Trigger event with AWS Lambda + Query for predictions with Amazon ML real-time API
  • 160. Get Started Quickly Create, access, and manage all Amazon ML entities through the AWS Management Console Easily learn to build a model with the tutorial dataset provided Add prediction capabilities to your iOS and Android applications with AWS Mobile SDK Use Amazon ML APIs, CLIs, or SDKs
  • 161. 4:15 PM Building your first big data application on AWS Next…
  • 162. Building Your First Big Data Application on AWS Dario Rivera Solutions Architect
  • 163. Your First Big Data Application on AWS PROCESS STORE ANALYZE & VISUALIZE COLLECT
  • 164. Your First Big Data Application on AWS PROCESS: Amazon EMR with Spark & Hive STORE ANALYZE & VISUALIZE: Amazon Redshift and Amazon QuickSight COLLECT: Amazon Kinesis Firehose
  • 166. Setting up the environment
  • 167. Data Storage with Amazon S3 Download all the CLI steps: http://bit.ly/aws-big-data-steps Create an Amazon S3 bucket to store the data collected with Amazon Kinesis Firehose aws s3 mb s3://YOUR-S3-BUCKET-NAME
  • 168. Access Control with IAM Create an IAM role to allow Firehose to write to the S3 bucket firehose-policy.json: { "Version": "2012-10-17", "Statement": { "Effect": "Allow", "Principal": {"Service": "firehose.amazonaws.com"}, "Action": "sts:AssumeRole" } }
  • 169. Access Control with IAM Create an IAM role to allow Firehose to write to the S3 bucket s3-rw-policy.json: { "Version": "2012-10-17", "Statement": { "Effect": "Allow", "Action": "s3:*", "Resource": [ "arn:aws:s3:::YOUR-S3-BUCKET-NAME", "arn:aws:s3:::YOUR-S3-BUCKET-NAME/*" ] } }
  • 170. Access Control with IAM Create an IAM role to allow Firehose to write to the S3 bucket aws iam create-role --role-name firehose-demo --assume-role-policy-document file://firehose-policy.json Copy the value in “Arn” in the output, e.g., arn:aws:iam::123456789:role/firehose-demo aws iam put-role-policy --role-name firehose-demo --policy-name firehose-s3-rw --policy-document file://s3-rw-policy.json
  • 171. Data Collection with Amazon Kinesis Firehose Create a Firehose stream for incoming log data aws firehose create-delivery-stream --delivery-stream-name demo-firehose-stream --s3-destination-configuration RoleARN=YOUR-FIREHOSE-ARN, BucketARN="arn:aws:s3:::YOUR-S3-BUCKET-NAME", Prefix=firehose/, BufferingHints={IntervalInSeconds=60}, CompressionFormat=GZIP
  • 172. Data Processing with Amazon EMR Launch an Amazon EMR cluster with Spark and Hive aws emr create-cluster --name "demo" --release-label emr-4.5.0 --instance-type m3.xlarge --instance-count 2 --ec2-attributes KeyName=YOUR-AWS-SSH-KEY --use-default-roles --applications Name=Hive Name=Spark Name=Zeppelin-Sandbox Record your ClusterId from the output.
  • 173. Data Analysis with Amazon Redshift Create a single-node Amazon Redshift data warehouse: aws redshift create-cluster --cluster-identifier demo --db-name demo --node-type dc1.large --cluster-type single-node --master-username master --master-user-password YOUR-REDSHIFT-PASSWORD --publicly-accessible --port 8192
  • 175. Weblogs – Common Log Format (CLF) 75.35.230.210 - - [20/Jul/2009:22:22:42 -0700] "GET /images/pigtrihawk.jpg HTTP/1.1" 200 29236 "http://www.swivel.com/graphs/show/1163466" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)"
  • 176. Writing into Amazon Kinesis Firehose Download the demo weblog: http://bit.ly/aws-big-data Open Python and run the following code to import the log into the stream: import boto3 iam = boto3.client('iam') firehose = boto3.client('firehose') with open('weblog', 'r') as f: for line in f: firehose.put_record( DeliveryStreamName='demo-firehose-stream', Record={'Data': line} ) print 'Record added'
  • 178. Apache Spark • Fast, general purpose engine for large-scale data processing • Write applications quickly in Java, Scala, or Python • Combine SQL, streaming, and complex analytics
  • 179. Spark SQL Spark's module for working with structured data using SQL Run unmodified Hive queries on existing data
  • 180. Apache Zeppelin • Web-based notebook for interactive analytics • Multiple language backend • Apache Spark integration • Data visualization • Collaboration https://zeppelin.incubator.apache.org/
  • 181. View the Output Files in Amazon S3 After about 1 minute, you should see files in your S3 bucket: aws s3 ls s3://YOUR-S3-BUCKET-NAME/firehose/ --recursive
  • 182. Connect to Your EMR Cluster and Zeppelin aws emr describe-cluster --cluster-id YOUR-EMR-CLUSTER-ID Copy the MasterPublicDnsName. Use port forwarding so you can access Zeppelin at http://localhost:18890 on your local machine. ssh -i PATH-TO-YOUR-SSH-KEY -L 18890:localhost:8890 hadoop@YOUR-EMR-DNS-NAME Open Zeppelin with your local web browser and create a new “Note”: http://localhost:18890
  • 183. Exploring the Data in Amazon S3 using Spark Download the Zeppelin notebook: http://bit.ly/aws-big-data-zeppelin // Load all the files from S3 into a RDD val accessLogLines = sc.textFile("s3://YOUR-S3-BUCKET-NAME/firehose/*/*/*/*") // Count the lines accessLogLines.count // Print one line as a string accessLogLines.first // delimited by space so split them into fields var accessLogFields = accessLogLines.map(_.split(" ").map(_.trim)) // Print the fields of a line accessLogFields.first
  • 184. Combine Fields: “A, B, C”  “A B C” var accessLogColumns = accessLogFields .map( arrayOfFields => { var temp1 =""; for (field <- arrayOfFields) yield { var temp2 = "" if (temp1.replaceAll("[",""").startsWith(""") && !temp1.endsWith(""")) temp1 = temp1 + " " + field.replaceAll("[|]",""") else temp1 = field.replaceAll("[|]",""") temp2 = temp1 if (temp1.endsWith(""")) temp1 = "" temp2 }}) .map( fields => fields.filter(field => (field.startsWith(""") && field.endsWith(""")) || !field.startsWith(""") )) .map(fields => fields.map(_.replaceAll(""","")))
  • 185. Create a Data Frame and Transform the Data import java.sql.Timestamp import java.net.URL case class accessLogs( ipAddress: String, requestTime: Timestamp, requestMethod: String, requestPath: String, requestProtocol: String, responseCode: String, responseSize: String, referrerHost: String, userAgent: String )
  • 186. Create a Data Frame and Transform the Data val accessLogsDF = accessLogColumns.map(line => { var ipAddress = line(0) var requestTime = new Timestamp(new java.text.SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss Z").parse(line(3)).getTime()) var requestString = line(4).split(" ").map(_.trim()) var requestMethod = if (line(4).toString() != "-") requestString(0) else "" var requestPath = if (line(4).toString() != "-") requestString(1) else "" var requestProtocol = if (line(4).toString() != "-") requestString(2) else "" var responseCode = line(5).replaceAll("-","") var responseSize = line(6).replaceAll("-","") var referrerHost = line(7) var userAgent = line(8) accessLogs(ipAddress, requestTime, requestMethod, requestPath, requestProtocol,responseCode, responseSize, referrerHost, userAgent) }).toDF()
  • 187. Create an External Table Backed by Amazon S3 %sql CREATE EXTERNAL TABLE access_logs ( ip_address String, request_time Timestamp, request_method String, request_path String, request_protocol String, response_code String, response_size String, referrer_host String, user_agent String ) PARTITIONED BY (year STRING,month STRING, day STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed'
  • 188. Configure Hive Partitioning and Compression // set up Hive's "dynamic partitioning” %sql SET hive.exec.dynamic.partition=true // compress output files on Amazon S3 using Gzip %sql SET hive.exec.compress.output=true %sql SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec %sql SET io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec %sql SET hive.exec.dynamic.partition.mode=nonstrict;
  • 189. Write Output to Amazon S3 import org.apache.spark.sql.SaveMode accessLogsDF .withColumn("year", year(accessLogsDF("requestTime"))) .withColumn("month", month(accessLogsDF("requestTime"))) .withColumn("day", dayofmonth(accessLogsDF("requestTime"))) .write .partitionBy("year","month","day") .mode(SaveMode.Overwrite) .insertInto("access_logs")
  • 190. Query the Data Using Spark SQL // Check the count of records %sql select count(*) from access_log_processed // Fetch the first 10 records %sql select * from access_log_processed limit 10
  • 191. View the Output Files in Amazon S3 Leave Zeppelin and go back to the console… List the partition prefixes and output files: aws s3 ls s3://YOUR-S3-BUCKET-NAME/access-log-processed/ --recursive
  • 193. Connect to Amazon Redshift Using the PostgreSQL CLI psql -h YOUR-REDSHIFT-ENDPOINT -p 8192 -U master demo Or use any JDBC or ODBC SQL client with the PostgreSQL 8.x drivers or native Amazon Redshift support • Aginity Workbench for Amazon Redshift • SQL Workbench/J
  • 194. Create an Amazon Redshift Table to Hold Your Data CREATE TABLE accesslogs ( host_address varchar(512), request_time timestamp, request_method varchar(5), request_path varchar(1024), request_protocol varchar(10), response_code Int, response_size Int, referrer_host varchar(1024), user_agent varchar(512) ) DISTKEY(host_address) SORTKEY(request_time);
  • 195. Loading Data into Amazon Redshift “COPY” command loads files in parallel from Amazon S3: COPY accesslogs FROM 's3://YOUR-S3-BUCKET-NAME/access-log-processed' CREDENTIALS 'aws_iam_role=arn:aws:iam::YOUR-AWS-ACCOUNT-ID:role/ROLE-NAME' DELIMITER 't' MAXERROR 0 GZIP;
  • 196. Amazon Redshift Test Queries -- find distribution of response codes over days SELECT TRUNC(request_time), response_code, COUNT(1) FROM accesslogs GROUP BY 1,2 ORDER BY 1,3 DESC; -- find the 404 status codes SELECT COUNT(1) FROM accessLogs WHERE response_code = 404; -- show all requests for status as PAGE NOT FOUND SELECT TOP 1 request_path,COUNT(1) FROM accesslogs WHERE response_code = 404 GROUP BY 1 ORDER BY 2 DESC;
  • 198. Automating the Big Data Application Amazon Kinesis Firehose Amazon EMR Amazon S3 Amazon Redshift Amazon QuickSight Amazon S3 Event Notification Spark job List of objects from Lambda Write to Amazon Redshift using spark-redshift
  • 199. Learn from big data experts blogs.aws.amazon.com/bigdata
  • 200. Thank you to our Sponsor