AWS Canberra WWPS Summit 2013 - Big Data with AWS

2013 AWS WWPS Summit
Canberra, Australia
Big Data with AWS
Glenn Gore
Sr Manager, AWS

2013 AWS WWPS Summit,
Canberra – May 23
Overview
• The Big Data Challenge
• Big Data tools and what can we do with them ?
• Packetloop – Big Data Security Analytics
• Intel technology on big data.

Canberra – May 23
An engineer’s definition
When your data sets become so large that you have to start
innovating how to collect, store, organize, analyze and
share it

Canberra – May 23
Generation
Collection & storage
Analytics & computation
Collaboration & sharing

Canberra – May 23
Generation
Lower cost,
higher throughput

Canberra – May 23
Generation
Lower cost,
higher throughput
Highly
constrained

Generated data
Available for analysis
Data volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Canberra – May 23
Amazon Web Services helps remove
constraints

Canberra – May 23
Remove constraints = More experimentation
More experimentation = More innovation
More Innovation = Competitive edge

Canberra – May 23
Elastic MapReduce and Redshift
Big Data tools

Canberra – May 23
EMR is Hadoop in the Cloud

Canberra – May 23
What is Amazon Redshift ?
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS
cloud
Easy to provision and scale
No upfront costs, pay as you go
High performance at a low price
Open and flexible with support for popular BI tools

Canberra – May 23
How does EMR work ?
EMR
EMR Cluster
S3
Put the data
into S3
Choose: Hadoop distribution, # of
nodes, types of nodes, custom
configs, Hive/Pig/etc.
Get the output from
S3
Launch the cluster using the
EMR console, CLI, SDK, or
APIs
You can also store
everything in HDFS

Canberra – May 23
What can you run on EMR…
S3
EMR
EMR Cluster

Canberra – May 23
EMR
EMR Cluster
Resize Nodes
S3
You can easily add and
remove nodes

Canberra – May 23
Resize Nodes with Spot Instances
Cost without Spot
10 node cluster running for 14 hours
Cost = 1.2 * 10 * 14 = $168

Canberra – May 23
Cost without Spot Add 10 nodes on spot
Cost = 1.2 * 10 * 14 = $168
Cost = 1.2 * 10 * 7 = $84
= 0.6 * 10 * 7 = $42

Canberra – May 23
Cost without Spot Add 10 nodes on spot
Cost = 1.2 * 10 * 14 = $168
Cost = 1.2 * 10 * 7 = $84
= 0.6 * 10 * 7 = $42
= Total $126
25% reduction in price
50% reduction in time

Canberra – May 23
Ad-Hoc Clusters – What are they ?
EMR Cluster
S3
When processing is complete, you
can terminate the cluster (and stop
paying)
1

Canberra – May 23
Ad-Hoc Clusters – When to use
EMR Cluster
S3
Not using HDFS
Not using the cluster 24/7
Transient jobs
1

Canberra – May 23
EMR
EMR Cluster
“Alive” Clusters – What are they ?
S3
If you run your jobs 24 x 7 , you
can also run a persistent cluster
and use RI models to save costs
2

Canberra – May 23
EMR
EMR Cluster
“Alive” Clusters – When ?
S3
Frequently running jobs
Dependencies on map-reduce-map
outputs
2

Canberra – May 23
S3 instead of HDFS
S3
EMR
EMR Cluster
• S3 provides 99.99999999999% of
durability
• Elastic
• Version control against failure
• Run multiple clusters with a single
source of truth
• Quick recovery from failure
• Continuously resize clusters
3

Canberra – May 23
S3 and HDFS
S3
EMR
EMR Cluster
Load data from S3 using S3DistCP
Benefits of HDFS
Master copy of the data in S3
Get all the benefits of S3
HDFS
S3distCP
4

Canberra – May 23
Reporting Data-warehouse
RDBMS
Redshift
OLTP
ERP
Reporting
and BI
1

Canberra – May 23
Live Archive for (Structured) Big Data
DynamoDB
Redshift
OLTP
Web Apps Reporting
and BI
2

Canberra – May 23
Cloud ETL for Big Data
Redshift
Reporting
and BI
Elastic MapReduce
S3
3

Streaming Hive Pig DynamoDB Redshift
Unstructured
Data
✓ ✓
Structured Data ✓ ✓ ✓ ✓
Language
Support
Any* HQL Pig Latin Client SQL
SQL ✓SQL-Like ✓
Volume Unlimited Unlimited Unlimited Relatively
Low
1.6 PB
Latency Medium Medium Medium Ultra Low Low

Canberra – May 23
Remove
Constraints
Generation

South Australia Water Data
Management on AWS
Carnegie Mellon University
Dr. Murlikrishna Viswanathan
Srinivasan Vembuli
Rikio Chiba
Romeo Luka

Agenda
1. Project Background
2. Water Management in South Australia
3. Water Data on Cloud (Case in SA)
4. Future Roadmap

Project Background
• “Australia is the driest inhabited continent on
Earth, yet is among the world’s highest
consumers of water.” - CSIRO: Water overview

National Water Initiative
• A shared agreement by State Governments to increase the
efficiency of Australia’s water use. Under this initiative, State
Governments have made commitments to:-
I. Prepare water plans with provisions for the environment
II. Deal with over-allocated or stressed water systems
III. Introduce registers of water rights and standards for water accounting
IV. Expand the trade of water
V. Improve pricing for water storage and delivery
VI. Meet and manage urban water demands
http://www.nationalwatermarket.gov.au/rules-restrictions/national-rules.html

Water Data in South Australia (SA)
• In SA, the Department of Environment Water and Natural
Resources (DEWNR) collects water related data from various
sources
• The data is stored in multiple systems
• Hydstra (Legacy Foxpro DB)
• SQL Server Data Warehouse
• This Data is currently supplied to Bureau Of Meteorology
(BOM) for its analytics applications and other agencies

Current Process at DEWNR
Other
DataField Sensors
Raw
Data
Raw
Data
Raw
Data
Foxpro DB
Hydstra
SQL Server
GIS Application
WDTF
Data Source Storage / Application Output
Analysis
Data Mart

Water Data Transfer Format (WDTF)
• DEWNR and BOM are using data generated from the
current process in Water Data Transfer Format
(WDTF)
• Water Data Transfer Format is a National XML
standard for exchanging water information

Current Limitations
• The current architecture relies on multiple systems
running on legacy software ,i.e., Hydstra (Foxpro DB)
• This leads to increased costs and inefficiency in
service delivery
• Current architecture does not fully utilise WDTF as
the universal data format standard

Objectives
• DEWNR wants to use data in WDTF format to
generate analytical data similar to BOM for public
consumption (Open Data: Open Technology
Foundation is a facilitator for SA Gov.)
• To reduce system operation cost by migrating from
on premise system to on cloud

Cloud-based Water Data
Management & Analytics

Data Pipeline
Raw Files
(On premise)
Raw Files
(S3)
Clean Data
(S3, Redshift)
Analyzed
Data
(S3)
Data Pipeline
Zip
WDTF
WDTF
CSV
JSON
CSV
JSON
Copy, Unzip Parse Query
Open Data
Web Site
(Dashboard)Observation Data

Future Roadmap
Water Data from
Entire Australia
Open Access
To Water Data
Realtime Water
Data Analysis

Summary
• Benefit of cloud for water data management
• Streamlined data management process
• Open data hosting
• Cost effective
• Project progress
•  Data migration onto cloud
•  Real-time data analysis
• ☐ Open access to water data

2013 AWS WWPS Summit
Canberra, Australia

AWS Canberra WWPS Summit 2013 - Big Data with AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to AWS Canberra WWPS Summit 2013 - Big Data with AWS

Similar to AWS Canberra WWPS Summit 2013 - Big Data with AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

AWS Canberra WWPS Summit 2013 - Big Data with AWS