Successfully reported this slideshow.

AWS Webcast - Managing Big Data in the AWS Cloud_20140924

11

Share

19 of 87
19 of 87

AWS Webcast - Managing Big Data in the AWS Cloud_20140924

11

Share

Download to read offline

This presentation deck will cover specific services such as Amazon S3, Kinesis, Redshift, Elastic MapReduce, and DynamoDB, including their features and performance characteristics. It will also cover architectural designs for the optimal use of these services based on dimensions of your data source (structured or unstructured data, volume, item size and transfer rates) and application considerations - for latency, cost and durability. It will also share customer success stories and resources to help you get started.

This presentation deck will cover specific services such as Amazon S3, Kinesis, Redshift, Elastic MapReduce, and DynamoDB, including their features and performance characteristics. It will also cover architectural designs for the optimal use of these services based on dimensions of your data source (structured or unstructured data, volume, item size and transfer rates) and application considerations - for latency, cost and durability. It will also share customer success stories and resources to help you get started.

More Related Content

Similar to AWS Webcast - Managing Big Data in the AWS Cloud_20140924

More from Amazon Web Services

Related Books

Free with a 14 day trial from Scribd

See all

AWS Webcast - Managing Big Data in the AWS Cloud_20140924

  1. 1. Managing Big Data in the AWS Cloud Siva Raghupathy Principal Solutions Architect Amazon Web Services
  2. 2. Agenda • Big data challenges • AWS big data portfolio • Architectural considerations • Customer success stories • Resources to help you get started • Q&A
  3. 3. Data Volume, Velocity, & Variety • 4.4 zettabytes (ZB) of data exists in the digital universe today – 1 ZB = 1 billion terabytes • 450 billion transaction per day by 2020 • More unstructured data than structured data GB TB PB ZB EB 1990 2000 2010 2020
  4. 4. Big Data • Hourly server logs: how your systems were misbehaving an hour ago • Weekly / Monthly Bill: What you spent this past billing cycle? • Daily customer-preferences report from your web-site’s click stream: tells you what deal or ad to try next time • Daily fraud reports: tells you if there was fraud yesterday Real-time Big Data • Real-time metrics: what just went wrong now • Real-time spending alerts/caps: guaranteeing you can’t overspend • Real-time analysis: tells you what to offer the current customer now • Real-time detection: blocks fraudulent use now Big Data : Best Served Fresh
  5. 5. Data Analysis Gap Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares Generated data Available for analysis Data volume - Gap 1990 2000 2010 2020
  6. 6. Big Data Potentially massive datasets Iterative, experimental style of data manipulation and analysis Frequently not a steady-state workload; peaks and valleys Time to results is key Hard to configure/manage AWS Cloud Massive, virtually unlimited capacity Iterative, experimental style of infrastructure deployment/usage At its most efficient with highly variable workloads Parallel compute clusters from singe data source Managed services
  7. 7. AWS Big Data Portfolio Collect / Ingest Kinesis Store Process / Analyze Visualize / Report EMR EC2 Redshift Data Pipeline S3 DynamoDB Glacier RDS Import Export Direct Connect Amazon SQS
  8. 8. Ingest: The act of collecting and storing data
  9. 9. Why Data Ingest Tools? • Data ingest tools convert random streams of data into fewer set of sequential streams – Sequential streams are easier to process – Easier to scale – Easier to persist Processing Processing Processing Processing Processing Kafka Or Kinesis Processing
  10. 10. Data Ingest Tools • Facebook Scribe  Data collectors • Apache Kafka  Data collectors • Apache Flume  Data Movement and Transformation • Amazon Kinesis  Data collectors
  11. 11. Real-time processing of streaming data High throughput Elastic Easy to use Connectors for EMR, S3, Redshift, DynamoDB Amazon Kinesis
  12. 12. AmAamzaozno Kn iKneinseiss iAs rAchrcitheictetucrtuere AZ AZ AZ Durable, highly consistent storage replicates data across three data centers (availability zones) Amazon Web Services Aggregate and archive to S3 Millions of sources producing 100s of terabytes per hour Front End Authentication Authorization Ordered stream of events supports multiple readers Real-time dashboards and alarms Machine learning algorithms or sliding window analytics Aggregate analysis in Hadoop or a data warehouse Inexpensive: $0.028 per million puts
  13. 13. Kinesis Stream: Managed ability to capture and store data • Streams are made of Shards • Each Shard ingests data up to 1MB/sec, and up to 1000 TPS • Each Shard emits up to 2 MB/sec • All data is stored for 24 hours • Scale Kinesis streams by adding or removing Shards • Replay data inside of 24Hr. Window
  14. 14. Simple Put interface to store data in Kinesis • Producers use a PUT call to store data in a Stream • PutRecord {Data, PartitionKey, StreamName} • A Partition Key is supplied by producer and used to distribute the PUTs across Shards • Kinesis MD5 hashes supplied partition key over the hash key range of a Shard • A unique Sequence # is returned to the Producer upon a successful PUT call
  15. 15. Building Kinesis Processing Apps: Kinesis Client Library Client library for fault-tolerant, at least-once, Continuous Processing o Java client library, source available on Github o Build & Deploy app with KCL on your EC2 instance(s) o KCL is intermediary b/w your application & stream  Automatically starts a Kinesis Worker for each shard  Simplifies reading by abstracting individual shards  Increase / Decrease Workers as # of shards changes  Checkpoints to keep track of a Worker’s location in the stream, Restarts Workers if they fail o Integrates with AutoScaling groups to redistribute workers to new instances
  16. 16. Sending & Reading Data from Kinesis Streams Sending Reading HTTP Post AWS SDK LOG4J Flume Fluentd Get* APIs Kinesis Client Library + Connector Library Apache Storm Amazon Elastic MapReduce Write Read
  17. 17. AWS Partners for Data Load and Transformation Hparser, Big Data Edition Flume, Sqoop
  18. 18. Storage
  19. 19. Storage Structured – Simple Query NoSQL Amazon DynamoDB Cache Amazon ElastiCache (Memcached, Redis) Structured – Complex Query SQL Amazon RDS Data Warehouse Amazon Redshift Search Amazon CloudSearch Unstructured – No Query Cloud Storage Amazon S3 Amazon Glacier Unstructured – Custom Query Hadoop/HDFS Amazon Elastic Map Reduce Data Structure Complexity Query Structure Complexity
  20. 20. Store anything Object storage Scalable Designed for 99.999999999% durability Amazon S3
  21. 21. Why is Amazon S3 good for Big Data? • No limit on the number of Objects • Object size up to 5TB • Central data storage for all systems • High bandwidth • 99.999999999% durability • Versioning, Lifecycle Policies • Glacier Integration
  22. 22. Amazon S3 Best Practices • Use random hash prefix for keys • Ensure a random access pattern • Use Amazon CloudFront for high throughput GETs and PUTs • Leverage the high durability, high throughput design of Amazon S3 for backup and as a common storage sink • Durable sink between data services • Supports de-coupling and asynchronous delivery • Consider RRS for lower cost, lower durability storage of derivatives or copies • Consider parallel threads and multipart upload for faster writes • Consider parallel threads and range get for faster reads
  23. 23. Aggregate All Data in S3 Surrounded by a collection of the right tools EMR Kinesis Data Pipeline Redshift DynamoDB RDS Cassandra Storm Spark Streaming Amazon S3 Amazon S3
  24. 24. Fully-managed NoSQL database service Built on solid-state drives (SSDs) Consistent low latency performance Any throughput rate No storage limits Amazon DynamoDB
  25. 25. DynamoDB Concepts table items attributes schema-less schema is defined per attribute
  26. 26. DynamoDB: Access and Query Model • Two primary key options • Hash key: Key lookups: “Give me the status for user abc” • Composite key (Hash with Range): “Give me all the status updates for user ‘abc’ that occurred within the past 24 hours” • Support for multiple data types – String, number, binary… or sets of strings, numbers, or binaries • Supports both strong and eventual consistency – Choose your consistency level when you make the API call – Different parts of your app can make different choices • Global Secondary Indexes
  27. 27. DynamoDB: High Availability and Durability
  28. 28. What does DynamoDB handle for me? • Scaling without down-time • Automatic sharding • Security inspections, patches, upgrades • Automatic hardware failover • Multi-AZ replication • Hardware configuration designed specifically for DynamoDB • Performance tuning …and a lot more
  29. 29. Amazon DynamoDB Best Practices • Keep item size small • Store metadata in Amazon DynamoDB and blobs in Amazon S3 • Use a table with a hash key for extremely high scale • Use hash-range key to model – 1:N relationships – Multi-tenancy • Avoid hot keys and hot partitions • Use table per day, week, month etc. for storing time series data • Use conditional updates
  30. 30. Relational Databases Fully managed; zero admin MySQL, PostgreSQL, Oracle & SQL Server Amazon RDS
  31. 31. Process and Analyze
  32. 32. Processing Frameworks • Batch Processing – Take large amount (>100TB) of cold data and ask questions – Takes hours to get answers back • Stream Processing (real-time) – Take small amount of hot data and ask questions – Takes short amount of time to get your answer back
  33. 33. Processing Frameworks • Batch Processing – Amazon EMR (Hadoop) – Amazon Redshift • Stream Processing – Spark Streaming – Storm
  34. 34. Columnar data warehouse ANSI SQL compatible Massively parallel Petabyte scale Fully-managed Very cost-effective Amazon Redshift
  35. 35. Amazon Redshift architecture • Leader Node – SQL endpoint – Stores metadata – Coordinates query execution • Compute Nodes – Local, columnar storage – Execute queries in parallel – Load, backup, restore via Amazon S3 – Parallel load from Amazon DynamoDB • Hardware optimized for data processing • Two hardware platforms – DW1: HDD; scale from 2TB to 1.6PB – DW2: SSD; scale from 160GB to 256TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  36. 36. Amazon Redshift Best Practices • Use COPY command to load large data sets from Amazon S3, Amazon DynamoDB, Amazon EMR/EC2/Unix/Linux hosts – Split your data into multiple files – Use GZIP or LZOP compression – Use manifest file • Choose proper sort key – Range or equality on WHERE clause • Choose proper distribution key – Join column, foreign key or largest dimension, group by column
  37. 37. Hadoop/HDFS clusters Hive, Pig, Impala, HBase Easy to use; fully managed On-demand and spot pricing Tight integration with S3, DynamoDB, and Kinesis Amazon Elastic MapReduce
  38. 38. EMR Cluster S3 1. Put the data into S3 2. Choose: Hadoop distribution, # of nodes, types of nodes, Hadoop apps like Hive/Pig/HBase 4. Get the output from S3 3. Launch the cluster using the EMR console, CLI, SDK, or APIs How Does EMR Work?
  39. 39. EMR Cluster EMR S3 You can easily resize the cluster And launch parallel clusters using the same data How Does EMR Work?
  40. 40. EMR Cluster EMR S3 Use Spot nodes to save time and money How Does EMR Work?
  41. 41. The Hadoop Ecosystem works inside of EMR
  42. 42. Amazon EMR Best Practices • Balance transient vs persistent clusters to get the best TCO • Leverage Amazon S3 integration – Consistent View for EMRFS • Use Compression (LZO is a good pick) • Avoid small files (< 100MB; s3distcp can help!) • Size cluster to suit each job • Use EC2 Spot Instances
  43. 43. Amazon EMR Nodes and Size • Tuning cluster size can be more efficient than tuning Hadoop code • Use m1 and c1 family for functional testing • Use m3 and c3 xlarge and larger nodes for production workloads • Use cc2/c3 for memory and CPU intensive jobs • hs1, hi1, i2 instances for HDFS workloads • Prefer a smaller cluster of larger nodes
  44. 44. Partners – Analytics (Scientific, algorithmic, predictive, etc)
  45. 45. Visualize
  46. 46. Partners - BI & Data Visualization
  47. 47. Putting All The AWS Data Tools Together & Architectural Considerations
  48. 48. One tool to rule them all
  49. 49. Data Characteristics: Hot, Warm, Cold Hot Warm Cold Volume MB–GB GB–TB PB Item size B–KB KB–MB KB–TB Latency ms ms, sec min, hrs Durability Low–High High Very High Request rate Very High High Low Cost/GB $$-$ $-¢¢ ¢
  50. 50. Average latency Data volume Item size Request rate Cost ($/GB/month) Durability Elasti- Cache ms GB B-KB Very High $$ Low - Moderate Amazon DynamoDB ms GB-TBs (no limit) B-KB (64 KB max) Very High ¢¢ Very High Amazon RDS ms.sec GB-TB (3 TB max) KB (~rowsize) High ¢¢ High Cloud Search ms.sec GB-TB KB (1 MB max) High $ High Amazon Redshift sec.min TB-PB (1.6 PB max) KB (64 K max) Low ¢ High Amazon EMR (Hive) sec.min, hrs GB-PB (~nodes) KB-MB Low ¢ High Amazon S3 ms,sec, min (~size) GB-PB (no limit) KB-GB (5 TB max) Low-Very High (no limit) ¢ Very High Amazon Glacier hrs GB-PB (no limit) GB (40 TB max) Very Low (no limit) ¢ Very High
  51. 51. Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB? “I’m currently scoping out a project that will greatly increase my team’s use of Amazon S3. Hoping you could answer some questions. The current iteration of the design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…” Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month 300 2048 1483 777,600,000
  52. 52. Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month DynamoDB or S3? 300 2,048 1,483 777,600,000
  53. 53. Amazon DynamoDB Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month Scenario 1 300 2,048 1,483 777,600,000 Scenario 2 300 32,768 23,730 777,600,000 Amazon S3 use use
  54. 54. Lambda Architecture
  55. 55. Putting it all together De-coupled architecture • Multi-tier data processing architecture • Ingest & Store de-coupled from Processing • Ingest tools write to multiple data stores • Processing frameworks (Hadoop, Spark, etc.) read from data stores • Consumers can decide which data store to read from depending on their data processing requirement
  56. 56. Hot Data Temperature Cold Spark Streaming / Storm Redshift Impala Spark EMR/ Hadoop Redshift EMR/ Hadoop Spark Kinesis/ Kafka Data NoSQL / DynamoDB / Hadoop HDFS S3 Low Latency High Answers
  57. 57. Customer Use Cases
  58. 58. Automatic spelling corrections Autocomplete Search Recommendations
  59. 59. A look at how it works Data Analyzed Using EMR: Months of user history Common misspellings Weste Winstin Westa Whenstin Automatic spelling corrections
  60. 60. Yelp web site log data goes into Amazon S3 Months of user search data Search terms Misspellings Final click throughs Amazon S3
  61. 61. Amazon Elastic MapReduce spins up a 200 node Hadoop cluster Hadoop Cluster Amazon S3 Amazon EMR
  62. 62. All 200 nodes of the cluster simultaneously look for common misspellings Hadoop Cluster Amazon S3 Amazon EMR Westen Wistin Westan
  63. 63. A map of common misspellings and suggested corrections are loaded back into Amazon S3. Hadoop Cluster Amazon S3 Amazon EMR Westen Wistin Westan
  64. 64. Then the cluster is shut down Yelp only pays for the time they used it Hadoop Cluster Amazon S3 Amazon EMR
  65. 65. Each of Yelp’s 80 Engineers Can Do This Whenever They Have a Big Data Problem spins up over 250 Hadoop clusters per week in EMR. Amazon S3 Amazon EMR
  66. 66. Data Innovation Meets Action at Scale at NASDAQ OMX • NASDAQ’s technology powers more than 70 marketplaces in 50 countries • NASDAQ’s global platform can handle more than 1 million messages/second at a median speed of sub-55 microseconds • NASDAQ own & operate 26 markets including 3 clearinghouse & 5 central securities repositories • More than 5,500 structured products are tied to NASDAQ’s global indexes with the notional value of at least $1 trillion • NASDAQ powers 1 in 10 of the world’s securities transactions
  67. 67. NASDAQ’s Big Data Challenge • Archiving Market Data – A classic “Big Data” problem • Power Surveillance and Business Intelligence/Analytics • Minimize Cost – Not only infrastructure, but development/IT labor costs too • Empower the business for self-service
  68. 68. SIP Total Monthly Message Volumes OPRA, UQDF and CQS Market Data Is Big Data Charts courtesy of the Financial Information Forum NASDAQ Exchange Daily Peak Messages Financial Information Forum, Redistribution without permission from FIF prohibited, email: fifinfo@fif.com Total Monthly Message Volume Combined Average Daily Date UQDF CQS Volume Aug-12 2,317,804,321 8,241,554,280 459,102,548 Sep-12 1,948,330,199 7,452,279,225 494,768,917 Oct-12 1,016,336,632 7,452,279,225 403,267,422 Nov-12 2,148,867,295 9,552,313,807 557,199,100 Dec-12 2,017,355,401 8,052,399,165 503,487,728 Jan-13 2,099,233,536 7,474,101,082 455,873,077 Feb-13 1,969,123,978 7,531,093,813 500,011,463 Mar-13 2,010,832,630 7,896,498,260 495,366,545 Apr-13 2,447,109,450 9,805,224,566 556,924,273 May-13 2,400,946,680 9,430,865,048 537,809,624 Jun-13 2,601,863,331 11,062,086,463 683,197,490 Jul-13 2,142,134,920 8,266,215,553 473,106,840 Aug-13 2,188,338,764 9,079,813,726 512,188,750 23 OPRA Annual Increase: 69% CQS Annual Increase: 10% UQDF Annual Decrease: 6% Total Monthly Message Volume Average Daily Date OPRA Volume Aug-12 80,600,107,361 3,504,352,494 Sep-12 77,303,404,427 4,068,600,233 Oct-12 98,407,788,187 4,686,085,152 Nov-12 104,739,265,089 4,987,584,052 Dec-12 81,363,853,339 4,068,192,667 Jan-13 82,227,243,377 3,915,583,018 Feb-13 87,207,025,489 4,589,843,447 Mar-13 93,573,969,245 4,678,698,462 Apr-13 123,865,614,055 5,630,255,184 May-13 134,587,099,561 6,117,595,435 Jun-13 162,771,803,250 8,138,590,163 Jul-13 120,920,111,089 5,496,368,686 Aug-13 136,237,441,349 6,192,610,970 600,000,000 400,000,000 200,000,000 0 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00 Jan-00
  69. 69. NASDAQ’s Legacy Solution • On-premises MPP DB – Relatively expensive, finite storage – Required periodic additional expenses to add more storage – Ongoing IT (administrative) human costs • Legacy BI tool – Requires developer involvement for new data sources, reports, dashboards, etc.
  70. 70. New Solution: Amazon Redshift • Cost Effective – Redshift is 43% of the cost of legacy • Assuming equal storage capacities – Doesn’t include IT ongoing costs! • Performance – Outperforms NASDAQ’s legacy BI/DB solution – Insert 550K rows/second on a 2 node 8XL cluster • Elastic – NASDAQ can add additional capacity on demand, easy to grow their cluster
  71. 71. New Solution: Pentaho BI/ETL • Amazon Redshift partner – http://aws.amazon.com/redshift/partn ers/pentaho/ • Self Service – Tools empower BI users to integrate new data sources, create their own analytics, dashboards, and reports without requiring development involvement • Cost effective
  72. 72. Net Result • New solution is cheaper, faster, and offers capabilities that NASDAQ didn’t have before – Empowers NASDAQ’s business users to explore data like they never could before – Reduces IT and development as bottlenecks – Margin improvement (expense reduction and supports business decisions to grow revenue)
  73. 73. NEXT STEPS
  74. 74. AWS is here to help Solution Architects Professional Services Premium Support AWS Partner Network (APN)
  75. 75. aws.amazon.com/partners/competencies/big-data Partner with an AWS Big Data expert
  76. 76. http://aws.amazon.com/marketplace Big Data Case Studies Learn from other AWS customers aws.amazon.com/solutions/case-studies/big-data
  77. 77. AWS Marketplace AWS Online Software Store aws.amazon.com/marketplace Shop the big data category
  78. 78. http://aws.amazon.com/marketplace AWS Public Data Sets Free access to big data sets aws.amazon.com/publicdatasets
  79. 79. AWS Grants Program AWS in Education aws.amazon.com/grants
  80. 80. AWS Big Data Test Drives APN Partner-provided labs aws.amazon.com/testdrive/bigdata
  81. 81. AWS Training & Events Webinars, Bootcamps, and Self-Paced Labs aws.amazon.com/events https://aws.amazon.com/training
  82. 82. Big Data on AWS Course on Big Data aws.amazon.com/training/course-descriptions/bigdata
  83. 83. reinvent.awsevents.com
  84. 84. aws.amazon.com/big-data
  85. 85. sivar@amazon.com Thank You!

Editor's Notes

  • Organized the deck so that the partner slide in each section closes that section.
  • 2 x 2 Matrix
    Structured
    Level of query (from none to complex)

    Draw down the slide
  • Transition Statement – RDBMS is still a viable and important component in Big Data Architecture

    Traditional SQL Database

    Fully managed which means zero admin

    Most popular flavors

    Binary compatible
  • Generally come in two major types
    Batch
    Streaming
  • Examples
  • Needs a transition statement – Looking at AWS Portfolio in context of Processing ….

    Columnar data warehouse Massively parallel (MPP)
    Petabyte scale Fully managed $1,000/TB/Year (with Heavy RI)
  • Leader node
    Compute Node
    Hardware optimized
    Two different hardware platforms (SSD and HDD)
    Parallel Load
    API (of course)
  • Copy
    Split files into 1 to 2 GB compressed
    Use manifest file
    Sort keys
    Distribution keys
    System has option to make educated guess
  • Regular Hadoop/HDFS
    Support for popular add-ons
    Fully managed and easy to use
    On demand and SPOT pricing
    Integrated with other AWS services
    S3
    DDB
    Kinesis

    Bootstrap capabilities have most flexibility at the layer above core Hadoop/HDFS
  • Popular pattern
    1-Customer puts data into S3
    2-Make some decisions about what to run (type, number and other technologies to install)
    3-Use CLI, SDK, Console or API to launch
    4-Output is sent to S3

    Call out S3 integration as an important innovation and addition

  • Time to resize is going to be a combination of EC2/AMI boot time + the bootstrap options.
  • Call out that the nodes that are added to a running cluster that are SPOT must be task nodes (details)

    Additional nodes to a running cluster that are SPOT

    S3DistCp to load/unload from HDFS

    Shutdown the cluster (stop being charged except
  • Core Hadoop is:
    Map Reduce – Computational Model
    HDFS – Hadoop Distributed File System
    Additional Tools have entered the eco system
    Tools to help get data into Hadoop
    Tools to connect to Relational Systems
    Monitoring
    Machine Learning
    This slide is a small slice
  • EMRFS
    all of your files will be processed as intended when you run a chained series of MapReduce jobs. This is not a replacement file system. Instead, it extends the existing file system with mechanisms that are designed to detect and react to inconsistencies. The detection and recovery process includes a retry mechanism. After it has reached a configurable limit on the number of retries (to allow S3 to return what EMRFS expects in the consistent view), it will either (your choice) raise an exception or log the issue and continue.
    The EMRFS consistent view creates and uses metadata in an Amazon DynamodB table to maintain a consistent view of your S3 objects. This table tracks certain operations but does not hold any of your data. The information in the table is used to confirm that the results returned from an S3 LIST operation are as expected, thereby allowing EMRFS to check list consistency and read-after-write consistency.
    Compression
    Always Compress Data Files On Amazon S3
    Reduces Bandwidth Between Amazon S3 and Amazon EMR
    Speeds Up Your Job
    Compress Mappers and Reducer Output
    Advise Compressing all files for an instance for a day

  • Do not use smaller nodes for production workload unless you’re 100% sure you know what you’re doing. The majority of jobs I’ve seen requires more CPU and Memory the smaller instances have to offer and most of the times causes job failures if the cluster is not fine tuned. Instead of spending time fine tunning small nodes, get a larger node and run your workload with peace of mind. Anything larger and including m1.xlarge is a good candidate. m1.xlarge, c1.xlarge, m2.4xlarge and all cluster compute instances are good choices.
  • To summarize the review of the AWS Big Data Portfolio
    There’s no single tool that can do every job needed
  • Emphasize that this is an “aid” for the design process used to compare options.

    In my role as an SA it helps to have a heuristic tool to think about the requirements
    Is the data HOT, Warm or cold

    As a designer – by asking various questions can slot the data into one of these buckets

    Less of a rule and more of a guideline


  • This material in customer’s own words
    http://www.youtube.com/watch?v=j7uZGgSxJGM&t=3m0s
  • Access to AWS expertise
    Solution Architects and security experts
    Architecture reviews and best practices
    Case studies and sample architectures
    http://aws.amazon.com/solutions/case-studies/
    http://aws.amazon.com/architecture/
    Training and Premium Support
    Four support tiers, including Enterprise: http://aws.amazon.com/premiumsupport/
    Trusted Advisor service for cost optimization
    AWS Professional Services
    Domain and product experts

  • APN Competency Program (to Customers)
     
    What: The APN Competency Program is designed to provide AWS Customers with top quality APN Partners who have demonstrated technical proficiency and proven success in specialized solutions areas. Partners who’ve attained an APN Competency offer a variety of services, software, and solutions on the AWS Cloud.
    -          Big Data Specific:
    o   Who: Big Data Competency Partners help customers evaluate and use the tools, techniques, and technologies of working with data productively, at any scale. Learn More about Big Data Competency Partners: aws.amazon.com/partners/competencies/big-data/
    -          Announcement – Big Data Competency for APN Technology Partners
    o   Technology Partners for Launch: Microstrategy, SAP, Informatica, SnapLogic, Tableau Software, Attunity, MapR, Sumo Logic, Splunk
    o   Find out more information on these partners on the Big Data Competency Partners page(aws.amazon.com/partners/competencies/big-data)
    -          Advise customers to take advantage of these partners software and solutions in Big Data on AWS – they’ve been qualified by Partner Teams, Service Teams, have Marketplace Software, Built Test Drives, etc.
     
    APN Competency Program (to Partners)
     
    What:  The APN Competency program is designed to highlight APN Partners who have demonstrated technical proficiency and proven customer success in specialized solution areas. Attaining an APN Competency allows partners to differentiate themselves to customers by showcasing expertise in a specific solution area.
    -          Banner Attached – Learn More about APN Competencies
    -          Learn More about the Program: APN Competency Program (aws.amazon.com/partners/competencies)  
    -          Learn More about APN Competency Partners:
    o   SAP (aws.amazon.com/partners/competencies/sap)
    o   Oracle (aws.amazon.com/partners/competencies/oracle)
    o   Big Data (aws.amazon.com/partners/competencies/big-data)
    o   MSP (aws.amazon.com/partners/competencies/msp)
    o   Microsoft (aws.amazon.com/partners/competencies/Microsoft)
    -          Announcement – Big Data Competency for APN Technology Partners
    o   Technology Partners for Launch: Microstrategy, SAP, Informatica, SnapLogic, Tableau Software, Attunity, MapR, Sumo Logic, Splunk
    o   Find out more information on these partners on the Big Data Competency Partners page(aws.amazon.com/partners/competencies/big-data)
  • Life technologies
    LinkedIn
    DropCam
    ICRAR
    CDC
    Channel4
    Yelp
    Nokia
  • AWS Marketplace is the AWS Online Software Store
    Customer can find, research, buy software including a wide variety of big data options and software to help you manage your databases

    With AWS Marketplace, the simple hourly pricing of most products aligns with EC2 usage model

    You can find, purchase and 1-Click launch in minutes, making deployment easy

    Marketplace billing integrated into your AWS account

    1300+ product listings across 25 categories

    Description: Attunity CloudBeam for Amazon Redshift (Express) enables organizations to simplify, automate, and accelerate bulk data loading from database sources (Oracle, Microsoft SQL Server, and MySQL) to Amazon Redshift. Attunity CloudBeam allows your team to avoid the heavy lifting of manually extracting data, transferring via API/script, chopping, staging, and importing.
  • We will provide researchers and professors of accredited schools and universities with free access to AWS to accelerate science and discovery.

    With AWS in Education, educators, academic researchers, and students can apply to obtain free usage credits to tap into the on-demand infrastructure of the Amazon Web Services cloud to teach advanced courses, tackle research endeavors, and explore new projects – tasks that previously would have required expensive up-front and ongoing investments in infrastructure.
  • Microstrategy
    Splunk
    QlikView
    EMR
    Pig
    MongoDB
    Oracle BI, OBIEE 11g
    SAP Hana
    Yellowfin BI
  • Speaker Notes:

    We have just released “Big Data on AWS”, a new technical training course for individuals who are responsible for implementing big data environments, namely Data Scientists, Data Analysts, and Enterprise Big Data Solution Architects. This course is designed to teach technical end users how to use Amazon EMR to process data using the broad ecosystem of Hadoop tools like Pig and Hive. We also cover how to create big data environments, work with Amazon DynamoDB and Amazon Redshift, understand the benefits of Amazon Kinesis, and leverage best practices to design big data environments for security and cost-effectiveness.

    Upcoming classes include:

    Audience
    Individuals responsible for implementing big data environments: Data Scientists, Data Analysts, and Enterprise Big Data Solution Architects
    Objectives
    Understand the architecture of an Amazon EMR cluster
    Choose appropriate AWS data storage options for use with Amazon EMR
    Know your options for ingesting, transferring, and compressing data for use with Amazon EMR
    Use common programming frameworks for Amazon EMR including Hive, Pig, and Streaming
    Work with Amazon Redshift and Spark/Shark to implement big data solutions
    Leverage big data visualization software
    Choose appropriate security and cost management options for Amazon EMR
    Understand the benefits of using Amazon Kinesis for big data
    Prerequisites
    Basic familiarity with big data technologies, including Apache Hadoop and HDFS
    Knowledge of big data technologies such as Pig, Hive, and MapReduce helpful, but not required
    Working knowledge of core AWS services and public cloud implementation
    AWS Essentials course completion or equivalent experience
    Basic understanding of data warehousing, relational database systems, and database design
    Format
    Instructor-Led & Hands-on Labs
    Duration
    3 days
    Details
    aws.amazon.com/training/course-descriptions/bigdata/



    Big Data on AWS
    Big Data on AWS introduces you to cloud-based big data solutions and Amazon Elastic MapReduce (EMR), the AWS big data platform. In this course, we show you how to use Amazon EMR to process data using the broad ecosystem of Hadoop tools like Pig and Hive. We also teach you how to create big data environments, work with Amazon DynamoDB and Amazon Redshift, understand the benefits of Amazon Kinesis, and leverage best practices to design big data environments for security and cost-effectiveness.
    Intended Audience
    This course is intended for:
    Partners and customers responsible for implementing big data environments, including: Data Scientists
    Data Analysts
    Enterprise, Big Data Solution Architects
    Prerequisites
    We recommend that attendees of this course have:
    Basic familiarity with big data technologies, including Apache Hadoop and HDFS.
    Knowledge of big data technologies such as Pig, Hive, and MapReduce is helpful but not required
    Working knowledge of core AWS services and public cloud implementation.
    Students should complete the AWS Essentials course or have equivalent experience: http://aws.amazon.com/training/course-descriptions/essentials/
    Basic understanding of data warehousing, relational database systems, and database design
    Delivery Method
    Instructor-Led Training (ILT)
    Hands-on Labs on AWS
    Hands-On Activity
    This course allows you to test new skills and apply knowledge to your working environment through a variety of practical exercises.
    Duration
    3 days
    Course Outline
    Day 1
    Overview of Big Data and Apache Hadoop
    Benefits of Amazon EMR
    Amazon EMR Architecture
    Using Amazon EMR
    Launching and Using an Amazon EMR Cluster
    High-Level Apache Hadoop Programming Frameworks
    Using Hive for Advertising Analytics
    Day 2
    Other Apache Hadoop Programming Frameworks
    Using Streaming for Life Sciences Analytics
    Overview: Spark and Shark for In-Memory Analytics
    Using Spark and Shark for In-Memory Analytics
    Managing Amazon EMR Costs
    Overview of Amazon EMR Security
    Exploring Amazon EMR Security
    Data Ingestion, Transfer, and Compression
    Day 3
    Using Amazon Kinesis for Real-Time Big Data Processing
    AWS Data Storage Options
    Using DynamoDB with Amazon EMR
    Overview: Amazon Redshift and Big Data
    Using Amazon Redshift for Big Data
    Visualizing and Orchestrating Big Data
    Using Tableau Desktop or Jaspersoft BI to Visualize Big Data
    By the end of this course, you will be able to:
    Understand Apache Hadoop in the context of Amazon EMR
    Understand the architecture of an Amazon EMR cluster
    Launch an Amazon EMR cluster using an appropriate Amazon Machine Image and Amazon EC2 instance types
    Choose appropriate AWS data storage options for use with Amazon EMR
    Know your options for ingesting, transferring, and compressing data for use with Amazon EMR
    Use common programming frameworks available for Amazon EMR including Hive, Pig, and Streaming
    Work with Amazon Redshift to implement a big data solution
    Leverage big data visualization software
    Choose appropriate security options for Amazon EMR and your data
    Perform in-memory data analysis with Spark and Shark on Amazon EMR
    Choose appropriate options to manage your Amazon EMR environment cost-effectively
    Understand the benefits of using Amazon Kinesis for big data

  • Sign Up
    Big Data & HPC track with over 20 sessions
    Link to reinvent
    20 + sessions on big data and high performance computing


  • Again mention survey.
  • ×