• Save
AWS Canberra WWPS Summit 2013 - Big Data with AWS
Upcoming SlideShare
Loading in...5
×
 

AWS Canberra WWPS Summit 2013 - Big Data with AWS

on

  • 885 views

Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to ...

Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.

Statistics

Views

Total Views
885
Views on SlideShare
885
Embed Views
0

Actions

Likes
3
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

AWS Canberra WWPS Summit 2013 - Big Data with AWS AWS Canberra WWPS Summit 2013 - Big Data with AWS Presentation Transcript

  • 2013 AWS WWPS SummitCanberra, AustraliaBig Data with AWSGlenn GoreSr Manager, AWS
  • 2013 AWS WWPS Summit,Canberra – May 23Overview• The Big Data Challenge• Big Data tools and what can we do with them ?• Packetloop – Big Data Security Analytics• Intel technology on big data.
  • 2013 AWS WWPS Summit,Canberra – May 23An engineer’s definitionWhen your data sets become so large that you have to startinnovating how to collect, store, organize, analyze andshare it
  • 2013 AWS WWPS Summit,Canberra – May 23GenerationCollection & storageAnalytics & computationCollaboration & sharing
  • 2013 AWS WWPS Summit,Canberra – May 23GenerationCollection & storageAnalytics & computationCollaboration & sharingLower cost,higher throughput
  • 2013 AWS WWPS Summit,Canberra – May 23GenerationCollection & storageAnalytics & computationCollaboration & sharingLower cost,higher throughputHighlyconstrained
  • Generated dataAvailable for analysisData volumeGartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  • 2013 AWS WWPS Summit,Canberra – May 23Amazon Web Services helps removeconstraints
  • 2013 AWS WWPS Summit,Canberra – May 23Remove constraints = More experimentationMore experimentation = More innovationMore Innovation = Competitive edge
  • 2013 AWS WWPS Summit,Canberra – May 23Elastic MapReduce and RedshiftBig Data tools
  • 2013 AWS WWPS Summit,Canberra – May 23EMR is Hadoop in the Cloud
  • 2013 AWS WWPS Summit,Canberra – May 23What is Amazon Redshift ?Amazon Redshift is a fast and powerful, fully managed,petabyte-scale data warehouse service in the AWScloudEasy to provision and scaleNo upfront costs, pay as you goHigh performance at a low priceOpen and flexible with support for popular BI tools
  • 2013 AWS WWPS Summit,Canberra – May 23Elastic MapReduce and RedshiftBig Data tools
  • 2013 AWS WWPS Summit,Canberra – May 23How does EMR work ?EMREMR ClusterS3Put the datainto S3Choose: Hadoop distribution, # ofnodes, types of nodes, customconfigs, Hive/Pig/etc.Get the output fromS3Launch the cluster using theEMR console, CLI, SDK, orAPIsYou can also storeeverything in HDFS
  • 2013 AWS WWPS Summit,Canberra – May 23What can you run on EMR…S3EMREMR Cluster
  • 2013 AWS WWPS Summit,Canberra – May 23EMREMR ClusterResize NodesS3You can easily add andremove nodes
  • 2013 AWS WWPS Summit,Canberra – May 23Resize Nodes with Spot InstancesCost without Spot10 node cluster running for 14 hoursCost = 1.2 * 10 * 14 = $168
  • 2013 AWS WWPS Summit,Canberra – May 23Resize Nodes with Spot InstancesCost without Spot Add 10 nodes on spot10 node cluster running for 14 hoursCost = 1.2 * 10 * 14 = $16820 node cluster running for 7 hoursCost = 1.2 * 10 * 7 = $84= 0.6 * 10 * 7 = $42
  • 2013 AWS WWPS Summit,Canberra – May 23Resize Nodes with Spot InstancesCost without Spot Add 10 nodes on spot10 node cluster running for 14 hoursCost = 1.2 * 10 * 14 = $16820 node cluster running for 7 hoursCost = 1.2 * 10 * 7 = $84= 0.6 * 10 * 7 = $42= Total $12625% reduction in price50% reduction in time
  • 2013 AWS WWPS Summit,Canberra – May 23Ad-Hoc Clusters – What are they ?EMR ClusterS3When processing is complete, youcan terminate the cluster (and stoppaying)1
  • 2013 AWS WWPS Summit,Canberra – May 23Ad-Hoc Clusters – When to useEMR ClusterS3Not using HDFSNot using the cluster 24/7Transient jobs1
  • 2013 AWS WWPS Summit,Canberra – May 23EMREMR Cluster“Alive” Clusters – What are they ?S3If you run your jobs 24 x 7 , youcan also run a persistent clusterand use RI models to save costs2
  • 2013 AWS WWPS Summit,Canberra – May 23EMREMR Cluster“Alive” Clusters – When ?S3Frequently running jobsDependencies on map-reduce-mapoutputs2
  • 2013 AWS WWPS Summit,Canberra – May 23S3 instead of HDFSS3EMREMR Cluster• S3 provides 99.99999999999% ofdurability• Elastic• Version control against failure• Run multiple clusters with a singlesource of truth• Quick recovery from failure• Continuously resize clusters3
  • 2013 AWS WWPS Summit,Canberra – May 23S3 and HDFSS3EMREMR ClusterLoad data from S3 using S3DistCPBenefits of HDFSMaster copy of the data in S3Get all the benefits of S3HDFSS3distCP4
  • 2013 AWS WWPS Summit,Canberra – May 23Elastic MapReduce and RedshiftBig Data tools
  • 2013 AWS WWPS Summit,Canberra – May 23Reporting Data-warehouseRDBMSRedshiftOLTPERPReportingand BI1
  • 2013 AWS WWPS Summit,Canberra – May 23Live Archive for (Structured) Big DataDynamoDBRedshiftOLTPWeb Apps Reportingand BI2
  • 2013 AWS WWPS Summit,Canberra – May 23Cloud ETL for Big DataRedshiftReportingand BIElastic MapReduceS33
  • Streaming Hive Pig DynamoDB RedshiftUnstructuredData✓ ✓Structured Data ✓ ✓ ✓ ✓LanguageSupportAny* HQL Pig Latin Client SQLSQL ✓SQL-Like ✓Volume Unlimited Unlimited Unlimited RelativelyLow1.6 PBLatency Medium Medium Medium Ultra Low Low
  • 2013 AWS WWPS Summit,Canberra – May 23Collection & storageAnalytics & computationCollaboration & sharingRemoveConstraintsGeneration
  • South Australia Water DataManagement on AWSCarnegie Mellon UniversityDr. Murlikrishna ViswanathanSrinivasan VembuliRikio ChibaRomeo Luka
  • Agenda1. Project Background2. Water Management in South Australia3. Water Data on Cloud (Case in SA)4. Future Roadmap
  • Project Background• “Australia is the driest inhabited continent onEarth, yet is among the world’s highestconsumers of water.” - CSIRO: Water overview
  • National Water Initiative• A shared agreement by State Governments to increase theefficiency of Australia’s water use. Under this initiative, StateGovernments have made commitments to:-I. Prepare water plans with provisions for the environmentII. Deal with over-allocated or stressed water systemsIII. Introduce registers of water rights and standards for water accountingIV. Expand the trade of waterV. Improve pricing for water storage and deliveryVI. Meet and manage urban water demandshttp://www.nationalwatermarket.gov.au/rules-restrictions/national-rules.html
  • Water Data in South Australia (SA)• In SA, the Department of Environment Water and NaturalResources (DEWNR) collects water related data from varioussources• The data is stored in multiple systems• Hydstra (Legacy Foxpro DB)• SQL Server Data Warehouse• This Data is currently supplied to Bureau Of Meteorology(BOM) for its analytics applications and other agencies
  • Current Process at DEWNROtherDataField SensorsRawDataRawDataRawDataFoxpro DBHydstraSQL ServerGIS ApplicationWDTFData Source Storage / Application OutputAnalysisData Mart
  • Water Data Transfer Format (WDTF)• DEWNR and BOM are using data generated from thecurrent process in Water Data Transfer Format(WDTF)• Water Data Transfer Format is a National XMLstandard for exchanging water information
  • Current Limitations• The current architecture relies on multiple systemsrunning on legacy software ,i.e., Hydstra (Foxpro DB)• This leads to increased costs and inefficiency inservice delivery• Current architecture does not fully utilise WDTF asthe universal data format standard
  • Objectives• DEWNR wants to use data in WDTF format togenerate analytical data similar to BOM for publicconsumption (Open Data: Open TechnologyFoundation is a facilitator for SA Gov.)• To reduce system operation cost by migrating fromon premise system to on cloud
  • Cloud-based Water DataManagement & Analytics
  • Data PipelineRaw Files(On premise)Raw Files(S3)Clean Data(S3, Redshift)AnalyzedData(S3)Data PipelineZipWDTFWDTFCSVJSONCSVJSONCopy, Unzip Parse QueryOpen DataWeb Site(Dashboard)Observation Data
  • Data Analysis
  • Future RoadmapWater Data fromEntire AustraliaOpen AccessTo Water DataRealtime WaterData Analysis
  • Summary• Benefit of cloud for water data management• Streamlined data management process• Open data hosting• Cost effective• Project progress•  Data migration onto cloud•  Real-time data analysis• ☐ Open access to water data
  • 2013 AWS WWPS SummitCanberra, Australia