Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

586 views

Published on

Join us for this general session where AWS big data experts present an in-depth look at the current state of big data. Learn about the latest big data trends and industry use cases. Hear how other organizations are using the AWS big data platform to innovate and remain competitive. Take a look at some of the most recent AWS big data announcements, as we kick off the Big Data re:Source Mini Con.

Published in: Technology

AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. BDM205 Big Data Mini Con State of the Union Roger Barga, AWS November 29, 2016
  2. 2. What is Big Data? When your data sets become so large and complex you have to start innovating around how to collect, store, process, analyze, and share it.
  3. 3. Amazon EMR Amazon EC2 Process & Analyze Amazon Glacier Amazon S3 Store AWS Import/Export AWS Direct Connect Collect Amazon Kinesis Amazon Machine Learning Amazon Redshift Amazon DynamoDB Amazon Kinesis Analytics Amazon QuickSightAWS Database Migration Service AWS Data Pipeline Amazon RDS, Aurora Big Data services on AWS Amazon Elasticsearch Service
  4. 4. Store anything Object storage Highly scalable 99.999999999% durability Amazon S3 Collection and storage
  5. 5. Petabyte-scale data transfer service that uses Amazon-provided storage devices for transport. Copy up to 80TB data from on-prem file system to the Snowball through a 10Gbps network interface All data is encrypted by 256-bit GSM encryption AWS Import/Export Snowball Collection and storage E-ink shipping label Ruggedized case “8.5G Impact” 50TB & 80TB 10G network
  6. 6. Relational data warehouse Massively parallel; Petabyte scale Fully managed HDD and SSD Platforms $1,000/TB/Year; start at $0.25/hour Amazon Redshift Structured data processing
  7. 7. Hadoop as a service Spark, Presto, Flink, Hbase, Hive, etc. Easy to use; fully managed On-demand and Spot pricing HDFS & S3 file systems Amazon EMR Semi-structured / unstructured data processing
  8. 8. Distributed search and analytics engine Managed service using Elasticsearch and Kibana Fully managed - zero admin Highly available and reliable Tightly integrated with other AWS servicesAmazon Elasticsearch Service Semi-structured / unstructured data processing
  9. 9. Serverless compute service that runs your code in response to events. Extend AWS services with user-defined custom logic. Pay only for the requests served and compute time required - billing in increments of 100 milliseconds AWS Lambda Serverless event processing
  10. 10. Streams: Build your own custom application to process streaming data using Amazon Kinesis Client Library. Connectors to S3, DynamoDB, Lambda, Amazon Redshift, Elastisearch, Storm spout,… Firehose: Load massive volumes of streaming data into S3, Amazon Redshift, Elasticsearch. Inline processing using Lambda and library of exiemplates. Analytics: Analyze streaming data using standard SQL, no servers to manage, elastically scale, pay as you go. Amazon Kinesis Streaming data processing
  11. 11. Streams: Build your own custom application to process streaming data using Amazon Kinesis Client Library. Connectors to S3, DynamoDB, Lambda, Amazon Redshift, Elastisearch, Storm spout,… Firehose: Load massive volumes of streaming data into S3, Amazon Redshift, Elasticsearch. Inline processing using Lambda and library of ready to use templates. Analytics: Analyze streaming data using standard SQL, no servers to manage, elastically scale, pay as you go. Amazon Kinesis Streaming data processing
  12. 12. Fast, powered by SPICE, automatically scales. Explore, analyze, share insights with anyone. 1/10th the cost of traditional BI solutions. Broad connectivity with AWS data services, on- premises data, files and business applications. Amazon QuickSight Visualize and explore Amazon RDS Amazon S3 Amazon Redshift
  13. 13. Putting it together Scale
  14. 14. Scale as your data and business grows The volume, variety, and velocity at which data is being generated are leaving organizations with new questions to answer, such as:
  15. 15. Store and analyze all your data, structured and unstructured from all of your sources, in one centralized location at low cost. Quickly ingest data without needing to force it into a pre-defined schema, enabling ad-hoc analysis by applying schemas on read, not write. Separating your storage and compute allows you to scale each component as required, attach multiple data processing and analytics services to the same data set. Scale S3 Data Lake
  16. 16. Implementing a Data Lake on AWS Elasticsearch
  17. 17. Starting small is powerful, when you can scale up fast Scaling up your analytics systems With AWS Traditional IT * Get a new BI server 20 minutes 3 months Upgrade your analytics server to the newest Intel processors and add 16GB memory 10 minutes 2 months Add 500TB of storage instant 2 months Grow a DWH cluster from 8GB to 1PB 1 hour 8 months Build a 1024-node Hadoop cluster 30 minutes unlikely Roll out multi-region production environment hours months * actual provisioning times in a well-organized IT division
  18. 18. Netflix: Using Amazon S3 as the fabric of our big data ecosystem Tuesday, Nov. 29 5:30pm – 6:30pm Mirage, St. Croix B
  19. 19. Putting it together Cost
  20. 20. Putting it together: cost How much would it cost to process the Twitter fire hose?
  21. 21. Putting it together: cost How much would it cost to process the Twitter fire hose? S3: $0.025/GB-Mo Redshift: Starts at $0.25/hour EC2: Starts at $0.02/hour Glacier: $0.007/GB-Mo Kinesis: $0.015/shard 1MB/s in; 2MB/out; $0.014/million puts
  22. 22. 500MM tweets/day = ~ 5,800 tweets/sec 2k/tweet is ~12MB/sec (~1TB/day) $0.015/hour per shard, $0.014/million PUTS Amazon Kinesis cost is $0.47/hour Amazon Redshift cost is $0.850/hour (for a 2TB node) S3 cost is $1.02/hour (no compression) Total: $2.34/hour – on demand Cost
  23. 23. Use only the services you need Scale only the services you need Pay for only what you use Discounts through Reserved Instances Types including Spot, and upfront commitments. Cost
  24. 24. Putting it together Scale and security
  25. 25. Putting it together: scale and security FINRA: Monitor and enforce trading regulations FINRA handles approximately 75 billion market events every day to build a holistic picture of trading in the U.S. Hundreds of surveillance algorithms against massive amounts of data. FINRA mission  Deter misconduct by enforcing the rules.  Detect and prevent wrongdoing in US markets  Discipline those who break the rules Scale brings unique challenges  Market volumes are volatile and increasing  Exchanges are dynamically evolving  Regulatory rules are created and enhanced  New securities products are introduced  Market manipulators innovate
  26. 26. Petabytes of data generated on premise and brought to AWS and stored in S3 data lake. Thousands of analytical queries performed on EMR and Redshift. Over 400 analytics packages. Stringent security requirements met by leveraging VPC, VPN, Encryption at Rest and In Transit, AWS CloudTrail and database auditing Flexible Interactive Queries Predefined Queries Surveillance Analytics Data Management Data Movement Data Registration Version Management Amazon S3 Platform that adapts to market dynamics Web Applications Analysts; Regulators Amazon EMR Amazon EMR Amazon Redshift
  27. 27. Store an exabyte of data or more in S3 Analyze GB to PB using standard tools Encryption of all data at each step Auditability of all APIs and retrievals Control egress and ingress points using VPCs Scale and security FINRA: Building a Secure Data Science Platform on AWS Tuesday, Nov. 29 4:00pm – 5:00pm Mirage, St. Croix B
  28. 28. Putting it together Agility and actionable insights
  29. 29. Actionable insights Demonstration http://amzn.to/bigdata Access from a mobile device…
  30. 30. What item most interests you this week? What item will be the most difficult to explain to your significant other when you return home? What will give you the biggest headache this week? New Amazon Web Services Blackjack Networking with Peers re:Play Party
  31. 31. What item most interests you this week? What are your colleagues most interested in hearing about when you return next week? What will give you the biggest headache this week? New Amazon Web Services Blackjack Networking with Peers re:Play Party
  32. 32. What item most interests you this week? What are your colleagues most interested in hearing about when you return next week? What will give you the biggest headache this week? New Amazon Web Services Blackjack Networking with Peers re:Play Party
  33. 33. Kinesis Ingestion Stream Kinesis Analytics Kinesis Aggregate Stream Lambda Function DynamoDB TableAmazon Cognito SELECT ROWTIME, userId, COUNT(*) FROM STREAM GROUP BY userId, FLOOR(ROWTIME to SECOND) S3 Bucket HTML, JavascriptAggregated DataRaw Device and Quadrant Data Demo architecture
  34. 34. The demo application CREATE OR REPLACE STREAM DESTINATION_SQL_STREAM (UNIQUE_USER_COUNT INT, ANDROID_COUNT INT, IOS_COUNT INT, WINDOWS_PHONE_COUNT INT, OTHER_OS_COUNT INT, QUADRANT_A_COUNT INT, QUADRANT_B_COUNT INT, QUADRANT_C_COUNT INT, QUADRANT_D_COUNT INT, WINDOW_TIME TIMESTAMP); CREATE OR REPLACE STREAM DISTINCT_USER_STREAM (COGNITO_ID VARCHAR(64), DEVICE VARCHAR(32), OS VARCHAR(32), QUADRANT char(1), DT TIMESTAMP); CREATE OR REPLACE PUMP "DISTINCT_USER_PUMP" AS INSERT INTO "DISTINCT_USER_STREAM" SELECT STREAM DISTINCT "cognitoId", "device", "os", "quadrant", FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO SECOND) FROM "SOURCE_SQL_STREAM_001"; CREATE OR REPLACE PUMP "OUTPUT_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM" SELECT STREAM COUNT("DISTINCT_USER_STREAM".COGNITO_ID) AS UNIQUE_USER_COUNT, COUNT((CASE WHEN "DISTINCT_USER_STREAM".OS = 'Android' THEN COGNITO_ID ELSE null END)) AS ANDROID_COUNT, COUNT((CASE WHEN "DISTINCT_USER_STREAM".OS = 'iOS' THEN COGNITO_ID ELSE null END)) AS IOS_COUNT, COUNT((CASE WHEN "DISTINCT_USER_STREAM".OS = 'Windows Phone' THEN COGNITO_ID ELSE null END)) AS WINDOWS_PHONE_COUNT, COUNT((CASE WHEN "DISTINCT_USER_STREAM".OS = 'other' THEN COGNITO_ID ELSE null END)) AS OTHER_OS_COUNT, COUNT((CASE WHEN "DISTINCT_USER_STREAM".QUADRANT = 'A' THEN COGNITO_ID ELSE null END)) AS QUADRANT_A_COUNT, COUNT((CASE WHEN "DISTINCT_USER_STREAM".QUADRANT = 'B' THEN COGNITO_ID ELSE null END)) AS QUADRANT_B_COUNT, COUNT((CASE WHEN "DISTINCT_USER_STREAM".QUADRANT = 'C' THEN COGNITO_ID ELSE null END)) AS QUADRANT_C_COUNT, COUNT((CASE WHEN "DISTINCT_USER_STREAM".QUADRANT = 'D' THEN COGNITO_ID ELSE null END)) AS QUADRANT_D_COUNT, ROWTIME FROM "DISTINCT_USER_STREAM" GROUP BY FLOOR("DISTINCT_USER_STREAM".ROWTIME TO SECOND);
  35. 35. Big data does not mean just batch  Can be streamed in  Processed in real time  Can be used to respond quickly to requests and actionable events, generate business value. You can mix and match  On-premises and cloud  Custom development and managed services Agility & actionable insights
  36. 36. Putting it together Choice and selection
  37. 37. 1-click deployment to launch, in multiple regions around the world Pay-as-you-go pricing with no long term contracts required 2,000+ product listings to browse, test, and buy software; 290 specific to big data. Advanced Analytics Database and Data Enablement Business Intelligence Putting it together: choice and selection AWS Marketplace: Software store with simplified procurement
  38. 38. Largest ecosystem of ISVs & integrators Tens of thousands of consulting and technology partners
  39. 39. We have a retail mindset Use our managed big data services Build or bring your own Or access thousands in our marketplace Each customer decides for themselves Choice & selection
  40. 40. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Richard T. Freeman, Ph.D., Lead Data Engineer and Architect, JustGiving November 29, 2016 JustGiving: Event-Driven Data Platform BDM205
  41. 41. We are A tech-for-good platform for events-based fundraising, charities, and crowdfunding “Ensure no good cause goes unfunded” • The #1 platform for online social giving in the world • Peaks in traffic: Ice bucket, natural disasters • Raised $4.2bn in donations • 28.5m users • 196 countries • 27,000 good causes • GiveGraph • 91 million nodes • 0.53 billion relationships
  42. 42. Fundraising page
  43. 43. Our requirements • Limitation in existing SQL Server data warehouse • Long-running and complex queries for data scientists • New data sources: API, clickstream, unstructured, log, behavioral data, etc. • Easy to add data sources and pipelines • Reduce time spent on data preparation and experiments Machine learning Graph processing Natural language processing Stream processing Data ingestion Data preparation Automated Pipelines Insight Predictions Measure Recommendations Data-driven
  44. 44. Event-driven data platform at JustGiving [1 of 2] • JustGiving developed in-house analytics and data science platform in AWS called RAVEN. • Reporting, Analytics, Visualization, Experimental, Networks • Uses event-driven and serverless pipelines rather than workflows or DAGs • Messaging, queues, pub/sub patterns • Separate storage from compute • Supports scalable event driven • ETL / ELT • Machine learning • Natural language processing • Graph processing • Allows users to consume raw tables, data blocks, metrics, KPIs, insight, reports etc.
  45. 45. Event-driven data platform at JustGiving [2 of 2]
  46. 46. Serverless streaming analytics and persist stream
  47. 47. The outcome • Ingest full clickstream • Near real-time streaming analytics • Persist streams to Amazon S3 and Amazon Redshift Amazon Kinesis • AWS managed services • Event-driven and serverless • Scale out and automate complex queries • Improved productivity • Data-driven: Measure, insight, predict, recommend RAVEN platform: scalable event-driven data platform in AWS
  48. 48. Thank you! “Ensure no good cause goes unfunded” Contact: https://linkedin.com/in/ drfreeman BDM303 - JustGiving: Serverless Data Pipelines, Event-Driven ETL, and Stream Processing Tuesday 2:30 PM - 3:30 PM Wednesday, 3:30 PM - 4:30 PM [repeat]
  49. 49. Proven customer success The vast majority of big data use cases deployed in the cloud today run on AWS.
  50. 50. Big Data Mini Con sessions Mirage, Bermuda A Mirage, St. Croix B Mirage, Event Center B Mirage, Barbados A 1:00 PM Beeswax: Building a Real- Time Streaming Data Platform on AWS Big Data Architectural Patterns and Best Practices on AWS Deep Dive: Amazon EMR Best Practices & Design Patterns Workshop: Building Your First Big Data Application with AWS 2:30 PM JustGiving: Serverless Data Pipelines, Event-Driven ETL, and Stream Processing Best Practices for Apache Spark on Amazon EMR Understanding IoT Data: How to Leverage Amazon Kinesis in Building an IoT Analytics Platform on AWS 4:00 PM Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics FINRA: Building a Secure Data Science Platform on AWS Best Practices for Data Warehousing with Amazon Redshift Workshop: Building Your First Big Data Application with AWS 5:30 PM Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service and Kibana Netflix: Using Amazon S3 as the fabric of our big data ecosystem Visualizing Big Data Insights with Amazon QuickSight Plus, repeats for many sessions throughout the week!
  51. 51. Get started with Big Data on AWS aws.amazon.com/big-data Big Data Quest Learn at your own pace and practice working with AWS services for big data on QwikLABS. (3 Hours | Online) qwiklabs.com/quests/1 Big Data on AWS How to use AWS services to process data with Hadoop & create big data environments (3 Days | Classroom ) aws.amazon.com/training/course-descriptions/bigdata/ Big Data Technology Fundamentals FREE! Overview of AWS big data solutions for architects or data scientists new to big data. (3 Hours | Online) AWS Courses Self-paced Online Labs
  52. 52. Remember to complete your evaluations!
  53. 53. Thank you!

×