Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building your First Big Data Application on AWS

723 views

Published on

Find out more about how to build your Big Data Applications on AWS

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Building your First Big Data Application on AWS

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jarkko Hirvonen, Solutions Architect, AWS Building your first Big Data Application on AWS
  2. 2. Data is being produced continuously Mobile Apps Web Clickstream Application Logs Metering Records IoT Sensors Smart Buildings [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/h tdocs/test
  3. 3. Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares Available for analysis Generated data Data volume - Gap 1990 2000 2010 2020
  4. 4. Ingest/ Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Answers & insights START HERE WITH A BUSINESS CASE
  5. 5. Amazon Kinesis Firehose AWS Direct Connect Amazon Snowball Amazon Kinesis Streams Amazon S3 Amazon Glacier Amazon CloudSearch Amazon RDS, Amazon Aurora Amazon Dynamo DB Amazon Elasticsearch Amazon EMR Amazon EC2Amazon Redshift AWS Data PipelineAWS Database Migration Service AWS Glue Amazon Athena Amazon Kinesis Analytics Collect Store Process / Analyze AWS IoT Amazon QuickSight
  6. 6. Building a Big Data Application web clients mobile clients DBMS Amazon Redshift AWS Cloudcorporate data center Build a data warehouse with Amazon Redshift
  7. 7. Structured Data Processing • Petabyte-scale relational, MPP, data warehousing • Fully managed with SSD and HDD platforms • Built-in end-to-end security, including customer-managed keys • Fault-tolerant. Automatically recovers from disk and node failures • Data automatically backed up to Amazon S3 with cross-region backup capability for global disaster recovery • Over 140 new features added since launch • $1,000/TB/Year; start at $0.25/hour. Provision in minutes; scale from 160 GB to 2 PB of compressed data with just a few clicks Amazon Redshift
  8. 8. How do you get your (big) data into AWS?
  9. 9. Building a Big Data Application web clients mobile clients DBMS Amazon Redshift AWS Cloudcorporate data center Migrate your data to AWS AWS Database Migration Service AWS Direct Connect AWS Snowball
  10. 10. Start your first migration in 10 minutes or less Keep your apps running during the migration Migrate to databases running on Amazon EC2, Amazon RDS, or Amazon Redshift AWS Database Migration Service
  11. 11. AWS Snowball: PB-scale Data Transport E-ink shipping label Ruggedized case “8.5G Impact” All data encrypted end-to-end 50TB & 80TB 10G network Rain & dust resistant Tamper-resistant case & electronics
  12. 12. Your CEO doesn’t want to look at raw SQL query output
  13. 13. Building a Big Data Application web clients mobile clients DBMS Amazon Redshift Amazon QuickSight AWS Cloudcorporate data center Visualize your data with Amazon QuickSight AWS Database Migration Service AWS Direct Connect AWS Import/Export & Snowball
  14. 14. Business Intelligence • Fast and cloud-powered • Easy to use, no infrastructure to manage • Scales to 100s of thousands of users • Quick calculations with SPICE • 1/10th the cost of legacy BI software Amazon QuickSight
  15. 15. What if your data isn’t structured? What if you don’t need all the raw data? What if you need to combine multiple data sets?
  16. 16. Serverless Event Processing • Serverless compute service that runs your code in response to events • Extend AWS services with user-defined custom logic • Write custom code in Node.js, Python, and Java • Pay only for the requests served and compute time required - billing in increments of 100 milliseconds AWS Lambda
  17. 17. Building a Big Data Application web clients mobile clients DBMS Amazon Redshift Amazon QuickSight AWS Cloud Event-driven data transformations with AWS Lambda corporate data center AWS Lambda Structured Data In Amazon S3 Raw data In Amazon S3
  18. 18. How will this work at scale? What if the data processing exceeds the timeout?
  19. 19. Semi-structured/Unstructured Data Processing • Hadoop, Hive, Presto, Spark, Tez, Impala etc. • Release 5.2: Hadoop 2.7.3, Hive 2.1, Spark 2.02, Zeppelin, Presto, HBase 1.2.3 and HBase on S3, Phoenix, Tez, Flink. • New applications added within 30 days of their open source release • Fully managed, Auto Scaling clusters with support for on-demand and spot pricing • Support for HDFS and S3 file systems enabling separated compute and storage; multiple clusters can run against the same data in S3 • Support for end-to-end encryption, IAM/VPC, S3 client-side encryption with customer managed keys and AWS KMS. HIPAA-eligible. Amazon EMR
  20. 20. Building a Big Data Application web clients mobile clients DBMS Amazon Redshift Amazon QuickSight AWS Cloud Transform and explore your data at scale with Amazon EMR corporate data center Amazon EMR Structured Data In Amazon S3 Raw data In Amazon S3
  21. 21. What about ad-hoc queries when you are exploring new data?
  22. 22. Serverless Query Processing • Serverless query service for querying data in S3 using standard SQL with no infrastructure to manage • No data loading required; query directly from Amazon S3 • Use standard ANSI SQL queries with support for joins, JSON, and window functions • Support for multiple data formats include text, CSV, TSV, JSON, Avro, ORC, Parquet • Pay per query only when you’re running queries based on data scanned. If you compress your data, you pay less and your queries run faster Amazon Athena
  23. 23. Building a Big Data Application Extend your data warehouse to S3 with Amazon Athena web clients mobile clients DBMS Raw data In Amazon S3 Amazon Redshift Staging Data in Amazon S3 Amazon QuickSight AWS Cloudcorporate data center Amazon EMR Amazon Athena
  24. 24. A Data Lake on AWS Catalog & Search Access and search metadata Access & User Interface Give your users easy and secure access DynamoDB Elasticsearch API Gateway Identity & Access Management Cognito QuickSight Amazon AI EMR Redshift Athena Kinesis Analytics RDS Central Storage Secure, cost-effective Storage in Amazon S3 S3 Snowball Database Migration Service Kinesis Firehose Direct Connect Data Ingestion Get your data into S3 Quickly and securely Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified Processing & Analytics Use of predictive and prescriptive analytics to gain better understanding Security Token Service CloudWatch CloudTrail Key Management Service
  25. 25. Martin Buberl Director of Engineering at Trustpilot mbl@trustpilot.com | @martinbuberl
  26. 26. Trustpilot at a glance “Trustpilot is an online review platform to help people choose services and products with confidence and to help companies to harness the power of reviews.” - 30 million reviews in total - 1 million new reviews each month - 1.5 billion page impressions each month - 15 million emails sent each month
  27. 27. Data at Trustpilot Everything we build must be tracked and measured: - 100 GB of log files each day - 3.5 million tracking events each day We’re extremely data driven: data always wins.
  28. 28. Traditional data warehousing didn’t work anymore Some of the issues we encountered: - Teams were stepping on each others’ toes - Not a clear source of truth - Difficult discovery of data to gain insights - Poor (or no) data governance - Couldn’t “just” store data - Storage is expensive
  29. 29. Data Lake to the rescue “A Data Lake is a central repository to store massive amounts of data in its natural format.” Some of the benefits of a Data Lake: - Teams can implement compute jobs (ETL/MR) independently - Clear source of truth and easier discovery of data - Clear path to implement data governance (e.g. security, privacy) - Just store it (schema-on-read) - Storage is cheap (separation of compute and storage)
  30. 30. How we built a Data Lake Components: - Ingestion - Central Storage - Processing & Analytics - Access & User Interface - Catalog & Search
  31. 31. Ingestion - Quick ingestion of raw data - Support for any type of data - Unstructured - Semi-structured (JSON, XML) - Structured (CSV, Columnar) - No need to force data into a pre-defined schema - Batch and Stream support
  32. 32. Central Storage on S3 - High availability (system uptime) - High durability (data redundancy) - Store massive amounts of data - Cheap (starts at $0.023 per GB) S3 Event Triggers - Lambda or SQS, SNS
  33. 33. Catalog & Search - Avoid the “Data Swamp” - Discovery of data - Metadata storage
  34. 34. Access & User Interface - Ingestion via Upload - Access data catalog and metadata - Data Lake API AWS Data Lake Solution - goo.gl/8k1MXq
  35. 35. Processing - ETL with AWS Batch - AWS EMR (Spark & Hive) - Amazon Machine Learning Analytics - 3rd party analytics tools (e.g. Chartio) - Amazon Athena
  36. 36. How the Data Lake helped us - Getting our data sane again - Data is easier to discover - Teams can move faster - Analytics are much faster - Cost savings Lessons learned - S3 Event Triggers + Lambdas rock - Meta data is fuzzy and hard to get right
  37. 37. Thank you ;) Martin Buberl Director of Engineering at Trustpilot mbl@trustpilot.com | @martinbuberl
  38. 38. A Data Lake on AWS Catalog & Search Access and search metadata Access & User Interface Give your users easy and secure access DynamoDB Elasticsearch API Gateway Identity & Access Management Cognito QuickSight Amazon AI EMR Redshift Athena Kinesis Analytics RDS Central Storage Secure, cost-effective Storage in Amazon S3 S3 Snowball Database Migration Service Kinesis Firehose Direct Connect Data Ingestion Get your data into S3 Quickly and securely Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified Processing & Analytics Use of predictive and prescriptive analytics to gain better understanding Security Token Service CloudWatch CloudTrail Key Management Service
  39. 39. Recommended next session: 13:15 - Getting Started with Amazon QuickSight 14:00 - Big Data Architectural Patterns and Best Practices
  40. 40. Thank You ! jarkkoh@amazon.com

×