Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.
Infrared simulation and processing on Nvidia platforms
AWS Canberra WWPS Summit 2013 - Big Data with AWS
1. 2013 AWS WWPS Summit
Canberra, Australia
Big Data with AWS
Glenn Gore
Sr Manager, AWS
2. 2013 AWS WWPS Summit,
Canberra – May 23
Overview
• The Big Data Challenge
• Big Data tools and what can we do with them ?
• Packetloop – Big Data Security Analytics
• Intel technology on big data.
3. 2013 AWS WWPS Summit,
Canberra – May 23
An engineer’s definition
When your data sets become so large that you have to start
innovating how to collect, store, organize, analyze and
share it
7. Generated data
Available for analysis
Data volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
8. 2013 AWS WWPS Summit,
Canberra – May 23
Amazon Web Services helps remove
constraints
9. 2013 AWS WWPS Summit,
Canberra – May 23
Remove constraints = More experimentation
More experimentation = More innovation
More Innovation = Competitive edge
10. 2013 AWS WWPS Summit,
Canberra – May 23
Elastic MapReduce and Redshift
Big Data tools
11. 2013 AWS WWPS Summit,
Canberra – May 23
EMR is Hadoop in the Cloud
12. 2013 AWS WWPS Summit,
Canberra – May 23
What is Amazon Redshift ?
Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS
cloud
Easy to provision and scale
No upfront costs, pay as you go
High performance at a low price
Open and flexible with support for popular BI tools
13. 2013 AWS WWPS Summit,
Canberra – May 23
Elastic MapReduce and Redshift
Big Data tools
14. 2013 AWS WWPS Summit,
Canberra – May 23
How does EMR work ?
EMR
EMR Cluster
S3
Put the data
into S3
Choose: Hadoop distribution, # of
nodes, types of nodes, custom
configs, Hive/Pig/etc.
Get the output from
S3
Launch the cluster using the
EMR console, CLI, SDK, or
APIs
You can also store
everything in HDFS
15. 2013 AWS WWPS Summit,
Canberra – May 23
What can you run on EMR…
S3
EMR
EMR Cluster
16. 2013 AWS WWPS Summit,
Canberra – May 23
EMR
EMR Cluster
Resize Nodes
S3
You can easily add and
remove nodes
17. 2013 AWS WWPS Summit,
Canberra – May 23
Resize Nodes with Spot Instances
Cost without Spot
10 node cluster running for 14 hours
Cost = 1.2 * 10 * 14 = $168
19. 2013 AWS WWPS Summit,
Canberra – May 23
Resize Nodes with Spot Instances
Cost without Spot Add 10 nodes on spot
10 node cluster running for 14 hours
Cost = 1.2 * 10 * 14 = $168
20 node cluster running for 7 hours
Cost = 1.2 * 10 * 7 = $84
= 0.6 * 10 * 7 = $42
= Total $126
25% reduction in price
50% reduction in time
20. 2013 AWS WWPS Summit,
Canberra – May 23
Ad-Hoc Clusters – What are they ?
EMR Cluster
S3
When processing is complete, you
can terminate the cluster (and stop
paying)
1
21. 2013 AWS WWPS Summit,
Canberra – May 23
Ad-Hoc Clusters – When to use
EMR Cluster
S3
Not using HDFS
Not using the cluster 24/7
Transient jobs
1
22. 2013 AWS WWPS Summit,
Canberra – May 23
EMR
EMR Cluster
“Alive” Clusters – What are they ?
S3
If you run your jobs 24 x 7 , you
can also run a persistent cluster
and use RI models to save costs
2
23. 2013 AWS WWPS Summit,
Canberra – May 23
EMR
EMR Cluster
“Alive” Clusters – When ?
S3
Frequently running jobs
Dependencies on map-reduce-map
outputs
2
24. 2013 AWS WWPS Summit,
Canberra – May 23
S3 instead of HDFS
S3
EMR
EMR Cluster
• S3 provides 99.99999999999% of
durability
• Elastic
• Version control against failure
• Run multiple clusters with a single
source of truth
• Quick recovery from failure
• Continuously resize clusters
3
25. 2013 AWS WWPS Summit,
Canberra – May 23
S3 and HDFS
S3
EMR
EMR Cluster
Load data from S3 using S3DistCP
Benefits of HDFS
Master copy of the data in S3
Get all the benefits of S3
HDFS
S3distCP
4
26. 2013 AWS WWPS Summit,
Canberra – May 23
Elastic MapReduce and Redshift
Big Data tools
27. 2013 AWS WWPS Summit,
Canberra – May 23
Reporting Data-warehouse
RDBMS
Redshift
OLTP
ERP
Reporting
and BI
1
28. 2013 AWS WWPS Summit,
Canberra – May 23
Live Archive for (Structured) Big Data
DynamoDB
Redshift
OLTP
Web Apps Reporting
and BI
2
29. 2013 AWS WWPS Summit,
Canberra – May 23
Cloud ETL for Big Data
Redshift
Reporting
and BI
Elastic MapReduce
S3
3
30. Streaming Hive Pig DynamoDB Redshift
Unstructured
Data
✓ ✓
Structured Data ✓ ✓ ✓ ✓
Language
Support
Any* HQL Pig Latin Client SQL
SQL ✓SQL-Like ✓
Volume Unlimited Unlimited Unlimited Relatively
Low
1.6 PB
Latency Medium Medium Medium Ultra Low Low
34. Project Background
• “Australia is the driest inhabited continent on
Earth, yet is among the world’s highest
consumers of water.” - CSIRO: Water overview
35. National Water Initiative
• A shared agreement by State Governments to increase the
efficiency of Australia’s water use. Under this initiative, State
Governments have made commitments to:-
I. Prepare water plans with provisions for the environment
II. Deal with over-allocated or stressed water systems
III. Introduce registers of water rights and standards for water accounting
IV. Expand the trade of water
V. Improve pricing for water storage and delivery
VI. Meet and manage urban water demands
http://www.nationalwatermarket.gov.au/rules-restrictions/national-rules.html
36. Water Data in South Australia (SA)
• In SA, the Department of Environment Water and Natural
Resources (DEWNR) collects water related data from various
sources
• The data is stored in multiple systems
• Hydstra (Legacy Foxpro DB)
• SQL Server Data Warehouse
• This Data is currently supplied to Bureau Of Meteorology
(BOM) for its analytics applications and other agencies
37. Current Process at DEWNR
Other
DataField Sensors
Raw
Data
Raw
Data
Raw
Data
Foxpro DB
Hydstra
SQL Server
GIS Application
WDTF
Data Source Storage / Application Output
Analysis
Data Mart
38. Water Data Transfer Format (WDTF)
• DEWNR and BOM are using data generated from the
current process in Water Data Transfer Format
(WDTF)
• Water Data Transfer Format is a National XML
standard for exchanging water information
39. Current Limitations
• The current architecture relies on multiple systems
running on legacy software ,i.e., Hydstra (Foxpro DB)
• This leads to increased costs and inefficiency in
service delivery
• Current architecture does not fully utilise WDTF as
the universal data format standard
40. Objectives
• DEWNR wants to use data in WDTF format to
generate analytical data similar to BOM for public
consumption (Open Data: Open Technology
Foundation is a facilitator for SA Gov.)
• To reduce system operation cost by migrating from
on premise system to on cloud
42. Data Pipeline
Raw Files
(On premise)
Raw Files
(S3)
Clean Data
(S3, Redshift)
Analyzed
Data
(S3)
Data Pipeline
Zip
WDTF
WDTF
CSV
JSON
CSV
JSON
Copy, Unzip Parse Query
Open Data
Web Site
(Dashboard)Observation Data
44. Future Roadmap
Water Data from
Entire Australia
Open Access
To Water Data
Realtime Water
Data Analysis
45. Summary
• Benefit of cloud for water data management
• Streamlined data management process
• Open data hosting
• Cost effective
• Project progress
• Data migration onto cloud
• Real-time data analysis
• ☐ Open access to water data