Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ABD317_Building Your First Big Data Application on AWS - ABD317

1,708 views

Published on

Want to ramp up your knowledge of AWS big data web services and launch your first big data application on the cloud? We walk you through simplifying big data processing as a data bus comprising ingest, store, process, and visualize. You build a big data application using AWS managed services, including Amazon Athena, Amazon Kinesis, Amazon DynamoDB, and Amazon S3. Along the way, we review architecture design patterns for big data applications and give you access to a take-home lab so that you can rebuild and customize the application yourself. You should bring your own laptop and have some familiarity with AWS services to get the most from this session.

  • Be the first to comment

ABD317_Building Your First Big Data Application on AWS - ABD317

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building Your First Big Data Application on AWS - ABD317 B e n S n i v e l y , S p e c i a l i s t S o l u t i o n s A r c h i t e c t , A W S R y a n N i e n h u s , S r . P M , A m a z o n K i n e s i s R a d h i k a R a v i r a l a , E M R S o l u t i o n s A r c h i t e c t , A W S D a r i o R i v e r a , S p e c i a l i s t S o l u t i o n s A r c h i t e c t , A W S A l l a n M a c I n n i s , K i n e s i s S o l u t i o n s A r c h i t e c t , A W S C h r i s M a r s h a l l , S o l u t i o n s A r c h i t e c t , A W S N o v e m b e r 2 0 1 7 AWS re:INVENT
  2. 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Your Application Architecture Amazon Kinesis Producer UI Amazon Kinesis Firehose AmazonKinesis Analytics AmazonKinesis Firehose Amazon EMR Amazon Redshift Amazon QuickSight Generate web logs Collect web logs and deliver to S3 Process & compute aggregate web log metrics Deliver processed web logs to Amazon Redshift Raw web logs from Firehose Run SQL queries on processed web logs Visualize web logs to discover insights Amazon S3 Bucket Interactive analysis of web logs Amazon Athena Interactive querying of web logs AWS Glue Extract metadata, create tables and transform Web logs from CSV to Parquet Amazon S3 Bucket Transformed web logs from Glue
  3. 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is qwikLABS? • Provides access to AWS services for this workshop • No need to provide a credit card • Automatically deleted when you’re finished
  4. 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sign in and start the lab Once the lab is started you will see a Lab setup progress bar. It takes ~10 min for the lab to be setup
  5. 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Navigating qwikLABS • Student Resources: Scripts for your labs • Open Console : Opens AWS Management Console • Addl Connection Details: Links to different Interfaces
  6. 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Everything you need for the lab Open the AWS Management Console, login and verify the following AWS resources are created: • One Amazon Kinesis Analytics application • One Kinesis Analytics preprocessing AWS Lambda function • Two Amazon Kinesis Firehose delivery streams • One Amazon EMR Cluster • One Amazon Redshift Cluster Sign up (later) for: • Amazon QuickSight
  7. 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis Amazon Kinesis Streams Amazon Kinesis Analytics Amazon Kinesis Firehose Build custom applications that process and analyze streaming data Easily process and analyze streaming data with standard SQL Easily load streaming data into AWS
  8. 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis Streams • Easy administration and low cost • Build real time applications with framework of choice • Secure, durable storage
  9. 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis Firehose • Zero administration and seamless elasticity • Direct-to-data store integration • Serverless, continuous data transformations Amazon S3 Amazon Redshift
  10. 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis - Firehose vs. Streams Amazon Kinesis Streams is for use cases that require custom processing, per incoming record, with sub-1 second processing latency, and a choice of stream processing frameworks. Amazon Kinesis Firehose is for use cases that require zero administration, ability to use existing analytics tools based on Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service, and a data latency of 60 seconds or higher.
  11. 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 1 Collect logs using a Kinesis Firehose delivery stream
  12. 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Your Application Architecture Amazon Kinesis Producer UI Amazon Kinesis Firehose AmazonKinesis Analytics AmazonKinesis Firehose Amazon EMR Amazon Redshift Amazon QuickSight Generate web logs Collect web logs and deliver to S3 Process & compute aggregate web log metrics Deliver processed web logs to Amazon Redshift Raw web logs from Firehose Run SQL queries on processed web logs Visualize web logs to discover insights Amazon S3 Bucket Interactive analysis of web logs Amazon Athena Interactive querying of web logs AWS Glue Extract metadata, create tables and transform Web logs from CSV to Parquet Amazon S3 Bucket Transformed web logs from AWS Glue
  13. 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Collect logs with a Kinesis Firehose delivery stream Time: 5 minutes We are going to: A. Write to a Firehose delivery stream - Simulate writing transformed Apache Web Logs to a Firehose delivery stream that is configured to deliver data into an S3 bucket. There are many different libraries that can be used to write data to a Firehose delivery stream. One popular option is called the Amazon Kinesis Agent.
  14. 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Collect logs with a Kinesis Firehose delivery stream Amazon Kinesis Agent • Standalone Java application to collect and send data to Firehose • Continuously monitors set of files • Handles file rotation, check-pointing and retry on failures • Emits Amazon CloudWatch metrics • Pre-process records parsed from monitored files
  15. 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Collect logs with a Kinesis Firehose delivery stream For example, the agent can transform an Apache Web Log to JSON. From: To: {"HOST" : "125.166.52.103", "IDENT" : null, "AUTHUSER" : null, "DATETIME" : "08/Mar/2017:17:06:44 -08:00", "REQUEST" : "GET /explore", "RESPONSE" : 200, "BYTES" : 2503, "REFERRER" : null, "AGENT" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 5.2; Trident/5.0)“} 125.166.52.103 - - [08/Mar/2017:17:06:44 -08:00] "GET /explore“ 200 2503 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 5.2; Trident/5.0)“}
  16. 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Collect logs with a Kinesis Firehose delivery stream So that we don’t have to install or setup software on your machine, we are going to use a utility called the Kinesis Data Generator to simulate using the Amazon Kinesis agent. The Kinesis Data Generator can populate a Firehose delivery stream using a template and is simple to setup. Let’s get started!
  17. 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 1A: Working with the KDG Qwiklabs has already created and setup the Kinesis Firehose delivery stream for us. All we have to do is start writing data to it. 1. Go to this Kinesis Data Generator (KDG) Help Section at http://tinyurl.com/kinesispublisher: Which will redirect to: https://s3.amazonaws.com/kinesis-data-producer-test/help.html
  18. 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 1A: Working with the KDG 2. Click “Create Amazon Cognito User with AWS CloudFormation” This link will take you to a service called AWS CloudFormation. AWS CloudFormation gives developers and systems administrators an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion. We use AWS CloudFormation to create the necessary user credentials for you to use the Kinesis Data Generator.
  19. 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 1A: Working with the KDG 3. On the next screen, click next. (Here we are using a template stored in an Amazon S3 bucket to create the necessary credentials).
  20. 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 1A: Working with the KDG 4. Specify a user name and password (and remember them!), and then click next. This user name and password will be used to sign in to the Kinesis Data Generator. The password must be at least 6 alpha- numeric characters, and contain at least one number.
  21. 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 1A: Working with the KDG We use a service called Amazon Cognito to create these credentials. Amazon Cognito lets you easily add user sign-up and sign-in to your mobile and web apps. With Amazon Cognito, you also have the options to authenticate users through social identity providers such as Facebook, Twitter, or Amazon, with SAML identity solutions, or by using your own identity system.
  22. 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 1A: Working with the KDG 5. The next screen has some additional options for the CloudFormation stack which are not needed. Click next.
  23. 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 1A: Working with the KDG 6. This screen is a review screen so you can verify you have selected the correct options. When you are ready, check the “acknowledge” button and click create.
  24. 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 1A: Working with the KDG 7. You are taken to a screen that shows the stack creation process. (You may have to refresh the page). In approximately one minute, the stack will complete and you will see “CREATE_COMPLETE” under status. Once this occurs, a) select the template, b) select the outputs tab, c) click the link to navigate to your very own Kinesis Data Generator (hosted on Amazon S3).
  25. 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 1A: Working with the KDG 8. Login using your user name and password 9. Select the region “us-west-2” (Oregon) 10. Select the delivery stream with name (qls-<somerandomnumber>- FirehoseDeliveryStream-<somerandomnumber>-11111111111) 11. Specify a data rate (Records per second). Please choose a number less than 10.
  26. 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 1A: Working with the KDG We have chosen a number less than 10 because everyone is on the same WiFi and we want to be sure we don’t use all the bandwidth. Additionally, if you are not plugged in you may run into a battery issues at a higher rate. 12. For the Record Template, you should use the template (Apache Combined Log Template found in the Student Resources section). The template should look like the following:
  27. 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 1A: Working with the KDG 13. Click “Send Data to Kinesis”
  28. 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Review: Monitoring Your Delivery Stream Go to the Amazon CloudWatch Metrics Console and search “IncomingRecords”. Select this metric for your Firehose delivery stream and choose a 1 Minute SUM. What are the most important metrics to monitor?
  29. 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 2 Real-time data processing using Kinesis Analytics
  30. 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Your Application Architecture Amazon Kinesis Producer UI Amazon Kinesis Firehose AmazonKinesis Analytics AmazonKinesis Firehose Amazon EMR Amazon Redshift Amazon QuickSight Generate web logs Collect web logs and deliver to S3 Process & compute aggregate web log metrics Deliver processed web logs to Redshift Raw web logs from Firehose Run SQL queries on processed web logs Visualize web logs to discover insights Amazon S3 Bucket Interactive analysis of web logs Amazon Athena Interactive querying of web logs AWS Glue Extract metadata, create tables and transform Web logs from CSV to Parquet Amazon S3 Bucket Transformed web logs from Glue
  31. 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis Analytics • Powerful real time applications • Easy to use, fully managed • Automatic elasticity
  32. 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kinesis Analytics Applications Easily write SQL code to process streaming data Connect to streaming source Continuously deliver SQL results
  33. 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Process Data using Kinesis Analytics Time: 20 minutes We are going to:  Write a SQL query to compute an aggregate metrics for an interesting statistic on the incoming data  Write a SQL query using an anomaly detection function
  34. 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 2A: Start Amazon Kinesis Analytics App • Navigate to the Kinesis dashboard • Click on the Kinesis Analytics Application
  35. 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 2A: Start Kinesis App Click on “Go to SQL editor”. In the next screen, click on “Yes, start application”
  36. 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. View Sample Records in Kinesis App • Review sample records delivered to the source stream (SOURCE_SQL_STREAM_001)
  37. 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kinesis App Metadata Note that Amazon Kinesis adds metadata to each record being sent that was shown in the formatted record sample: • The ROWTIME represents the time the application read the record, and is a special column used for time series analytics. This is also known as the process time. • The APPROXIMATE_ARRIVAL_TIME is the time the delivery stream received the record. This is also known as ingest time.
  38. 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kinesis Analytics: How streaming app works Writing SQL over streaming data using Kinesis Analytics follows a two part model: 1. Create an in-application stream for storing intermediate SQL results. An in-application stream is like a SQL table, but is continuously updated. 2. Create a PUMP which will continuously read FROM one in-application stream and INSERT INTO a target in-application stream DESTINATION_STREAM Part 1 AGGREGATE_STREAM Part 2 SOURCE_SQL_STREAM_001 Source Stream OUTPUT_PUMP AGGREGATE_PUMP Send to Redshift
  39. 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 2B: Calculate an aggregate metric Calculate a count using a tumbling window and a GROUP BY clause. A tumbling window is similar to a periodic report, where you specify your query and a time range, and results are emitted at the end of a range. (EX: COUNT number of items by key for 10 seconds)
  40. 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 2C: Calculate an aggregate metric Tumbling Sliding Custom • Fixed size and non-overlapping • Use FLOOR() or STEP() function (coming soon) in a GROUP BY statement • Fixed size and overlapping; row boundaries are determined when new rows enter window • Use standard OVER and WINDOW clause (i.e. count (col) OVER (RANGE INTERVAL ‘5’ MIN) • Not fixed size and overlapping; row boundaries are determined by conditions • Implementations vary, but typically require two steps (Step 1 – Identify boundaries, Step 2 – perform computation)
  41. 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 2C: Calculate an aggregate metric The window is defined by the following statement in the SELECT statement. Note that the ROWTIME column is implicitly included in every stream query, and represents the processing time of the application. This is known as a tumbling window. Tumbling windows are always included in a GROUP BY clause and use a STEP function. The STEP function takes an interval to product the periodic reports. You can also use the SQL FLOOR() function to achieve the same thing. STEP(source_sql_stream_001.ROWTIME BY INTERVAL '10' SECOND)
  42. 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 2B: Calculate an aggregate metric Create an aggregate_stream using SQL command found in the Kinesis Analytics SQL file located in the Student Resources section of your lab. Copy and paste the SQL in the SQL editor underneath the “DESTINATION_SQL_STREAM” DDL. Click “Save and run SQL”
  43. 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 2C: Anomaly Detection • Kinesis Analytics includes some advanced algorithms in it that are extensions to the SQL language. These include approximate count distinct (hyperloglog), approximate top K (space saving), and anomaly detection (random cut forest). • The random cut forest algorithm will detect anomalies in real-time on multi-dimensional data sets. You pass the algorithm n number of numeric fields and the algorithm produces an anomaly score on your stream data. Higher scores are more anomalous. • The minimum anomaly score is 0 and the maximum is log2 s, where s is the subsample size parameter passed to random cut forest (third parameter). You will have to try the algorithm on your data to get a feel for the anomaly score, as the score is data-dependent.
  44. 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 2C: Anomaly Detection Create an anomaly_stream using the SQL command found in the Kinesis Analytics SQL file located in the Student Resources section of your lab. Append the SQL in your SQL editor. Click “Save and run SQL”
  45. 45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Review – In-Application SQL streams Your application has multiple in-application SQL streams including DESTINATION_SQL_STREAM and ANOMALY_STREAM. These in-application streams which are like SQL tables that are continuously updated. What else is unique about an in-application stream aside from its continuous nature?
  46. 46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 3 Deliver streaming results to Amazon Redshift
  47. 47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Your Application Architecture Amazon Kinesis Producer UI Amazon Kinesis Firehose AmazonKinesis Analytics AmazonKinesis Firehose Amazon EMR Amazon Redshift Amazon QuickSight Generate web logs Collect web logs and deliver to S3 Process & compute aggregate web log metrics Deliver processed web logs to Redshift Raw web logs from Firehose Run SQL queries on processed web logs Visualize web logs to discover insights Amazon S3 Bucket Interactive analysis of web logs Amazon Athena Interactive querying of web logs AWS Glue Extract metadata, create tables and transform Web logs from CSV to Parquet Amazon S3 Bucket Transformed web logs from Glue
  48. 48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 3: Deliver data to Amazon Redshift using Kinesis Firehose Time: 5 minutes We are going to: A. Connect to Amazon Redshift cluster and create a table to hold web logs data B. Update Kinesis Analytics application to send data to Amazon Redshift, via the Firehose delivery stream.
  49. 49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. You can connect with pgweb • Already Installed and configured for the Redshift Cluster • Just navigate to pgweb and start interacting Note: From the qwikLABS console, open the pgWeb link in a new window. Or, Use any JDBC/ODBC/libpq client • Aginity Workbench for Amazon Redshift • SQL Workbench/J • DBeaver • Datagrip If you use above SQL clients, • The username/password is in your qwikLABS console – scroll at the bottom. The end point is there too and the database is called logs. Activity 3A: Connect to Amazon Redshift
  50. 50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 3B: Create table in Amazon Redshift Create table weblogs to capture the in-coming data from a Firehose delivery stream Note: You can download Amazon Redshift SQL code from qwikLabs Student Resource Section (Click on Open Redshift SQL)
  51. 51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 3C: Deliver Data to Amazon Redshift using Firehose Update Kinesis Analytics application to send data to Firehose delivery stream. Firehose delivers the streaming data to Amazon Redshift. 1. Go to the Kinesis Analytics console 2. Choose the Amazon Redshift delivery stream as destination and click on the edit button (see the pencil icon in the figure below)
  52. 52. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 3C: Deliver Data to Amazon Redshift using Firehose 4. Validate your destination 1. Choose the Firehose “qls-xxxxxxx-RedshiftDeliveryStream- xxxxxxxx” delivery stream. 2. Keep the default for “Connect in-application stream” 3. Choose CSV as the “Output format” 4. Select “Choose from IAM roles that Kinesis Analytics can assume” 5. Click “Save and continue” 5. It will take about 1 – 2 minutes for everything to be updated and for data to start appearing in Amazon Redshift.
  53. 53. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 3C: Deliver Data to Amazon Redshift using Firehose
  54. 54. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Review: Amazon Redshift Test Queries Find distribution of response codes over days (Copy SQL from Redshift SQL file Count the number of 404 response codes
  55. 55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Review: Amazon Redshift Test Queries Show all requests paths with status “PAGE NOT FOUND
  56. 56. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Extract, Transform and Load (ETL) with AWS Glue
  57. 57. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fully managed ETL (extract, transform, and load) service AWS Glue • Categorize your data, clean it, enrich it and move it reliably between various data stores • Once catalogued, your data is immediately searchable and query-able across your data silos • Simple, flexible and cost-effective • Serverless; runs on a fully managed, scale-out Spark environment
  58. 58. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Components Data Catalog • Discover and Organize your data in various databases, data warehouses and data lakes • Runs jobs in Spark containers – automatic scaling based on SLA • Glue is serverless – only pay for the resources you consume Job Authoring • Focus on the writing transformations • Generate code through a wizard • Write your own code Job Execution
  59. 59. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue-How it works
  60. 60. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4: Transform Web logs to Parquet using AWS Glue
  61. 61. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Your Application Architecture Amazon Kinesis Producer UI Amazon Kinesis Firehose AmazonKinesis Analytics AmazonKinesis Firehose Amazon EMR Amazon Redshift Amazon QuickSight Generate web logs Collect web logs and deliver to S3 Process & compute aggregate web log metrics Deliver processed web logs to Redshift Raw web logs from Firehose Run SQL queries on processed web logs Visualize web logs to discover insights Amazon S3 Bucket Interactive analysis of web logs Amazon Athena Interactive querying of web logs AWS Glue Extract metadata, create tables and transform Web logs from CSV to Parquet Amazon S3 Bucket Transformed web logs from Glue
  62. 62. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity: Catalog and perform ETL on Weblogs Time: 40 minutes We are going to: A. Discover and catalog the web log data deposited into the S3 bucket using AWS Glue Crawler B. Transform Web logs to Parquet format with AWS Glue ETL Job Authoring tool
  63. 63. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4A: Discover dataset with AWS Glue We use AWS GLUE’s crawler to extract data and metadata. From the AWS Management Console, select AWS Glue Service, click on “Get Started” on the next screen.
  64. 64. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4A: Add crawler using AWS Glue Select Crawlers section on the left and click on Add crawler
  65. 65. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4A: ETL with AWS Glue Specify a name for the crawler. Click on folder icon to choose the data store on S3. Click Next
  66. 66. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4A: ETL with AWS Glue Provide S3 path location where the weblogs are deposited (navigate S3 path with contains the word “logs” in it and select the “raw” folder). Click Select
  67. 67. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4A: ETL with AWS Glue Click Next on the next screen to not add another data store
  68. 68. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4A: ETL with AWS Glue In the IAM Role section, select “Create an IAM role”, add “default” to the IAM role and click “Next”
  69. 69. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4A: Add crawler in AWS Glue Choose Run on demand, to run the crawler now and click Next
  70. 70. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4A: Add crawler with AWS Glue On the next screen, click on Add database to add a database (use weblogdb for the database name)
  71. 71. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4A: Add crawler with AWS Glue Review and click Finish to create a crawler
  72. 72. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4A: Add crawler with AWS Glue Click on Run it now link to run the crawler
  73. 73. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4A: Add crawler with AWS Glue Crawler shows a Ready status when it is finished running.
  74. 74. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4A: Table creation in AWS Glue Observe that the crawler has created a table for your dataset. The crawler automatically classified the dataset as combinedapache log format. Click the table to take a look at the properties
  75. 75. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4A: Table creation in AWS Glue Glue used the GrokSerDe (Serializer/Deserializer) to correctly interpret the web logs You can click on the View partitions link to look at the partitions in the dataset
  76. 76. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4A: Create ETL Job in AWS Glue With the dataset cataloged and table created, we are now ready to convert the web logs based apache combined log format to a more optimal Parquet format for querying. Click on Add job to begin creating the ETL job
  77. 77. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4B: ETL Job in AWS Glue • In the Job properties window, specify the Job Name • In this example, we will create a new ETL script. • Glue automatically chooses the name of the script file name and the path where the script will be persisted • For the “Temporary Directory” specify a S3 temporary directory in your lab account (use s3://<stack>-logs-<account>-us-west-2/ bucket and append a folder name temp) • The path should look like: s3://<stack>-logs-<account>-us-west-2/temp/
  78. 78. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4B: ETL Job in AWS Glue • DO NOT CLICK NEXT YET • Copy the temporary path to a text editor, modify the path as follows and save it in a file (we will need this path for storing parquet files) s3://<stack>-logs-<account>-us-west-2/weblogs/processed/parquet For e.g., s3://qls-108881-f75177905b7b5b0e-logs-XXXXXXXX3638-us-west- 2/weblogs/processed/parquet
  79. 79. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4B: ETL Job in AWS Glue • Expand “Script libraries and job parameters” section, and increase the DPUs to 20 • Let’s pass a job parameter to send the S3 path where parquet files will be deposited. • Specify the following values for Key and Value • Key: -- parquet_path (notice the 2 hyphens at the beginning) • Value: s3://<stack>-logs-<account>-us- west-2/weblogs/processed/parquet/ • Note: Value is the S3 path we stored in the previous section
  80. 80. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4B: ETL Job in AWS Glue Click Next in the following screen and click Finish to complete the job creation
  81. 81. Activity 4B: ETL Job in Glue • Close Script Editor tips window (if it appears) • In the Glue Script Editor, copy the ETL code by clicking on the “Open Glue ETL Code” link in Student Resources • Ensure that the database name (db_name) and table name reflect the database and table name created by the Glue Crawler
  82. 82. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4B: ETL Job in AWS Glue Click Save and then Run Job button to execute your ETL. Click on “Save and run job” in the next window.
  83. 83. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 4B: ETL Job in AWS Glue Run job to continue. This might take a few minutes. When the job finishes, weblogs will be transformed to parquet format
  84. 84. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Interactive Querying with Amazon Athena
  85. 85. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Interactive Query Service • Query directly from Amazon S3 • Use ANSI SQL • Serverless • Multiple Data Formats • Cost Effective Amazon Athena
  86. 86. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Familiar Technologies Under the Covers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality Complex data types Multitude of formats Supports data partitioning
  87. 87. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Comparing performance and cost savings for compression and columnar format Dataset Size on Amazon S3 Query Run time Data Scanned Cost Data stored as text files 1 TB 236 seconds 1.15 TB $5.75 Data stored in Apache Parquet format* 130 GB 6.78 seconds 2.51 GB $0.013 Savings / Speedup 87% less with Parquet 34x faster 99% less data scanned 99.7% savings (*compressed using Snappy compression) https://aws.amazon.com/blogs/big-data/analyzing-data-in-s3-using-amazon-athena/
  88. 88. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 5 Interactive Querying with Amazon Athena
  89. 89. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Your Application Architecture Amazon Kinesis Producer UI Amazon Kinesis Firehose AmazonKinesis Analytics AmazonKinesis Firehose Amazon EMR Amazon Redshift Amazon QuickSight Generate web logs Collect web logs and deliver to S3 Process & compute aggregate web log metrics Deliver processed web logs to Amazon Redshift Raw web logs from Firehose Run SQL queries on processed web logs Visualize web logs to discover insights Amazon S3 Bucket Interactive analysis of web logs Amazon Athena Interactive querying of web logs AWS Glue Extract metadata, create tables and transform Web logs from CSV to Parquet Amazon S3 Bucket Transformed web logs from Glue
  90. 90. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity: Interactive querying with Amazon Athena Time: 15 minutes We are going to: A. Create a table over the processed weblogs in S3. These are the parquet files created by AWS Glue ETL job in the previous section B. Run interactive queries on the Parquet’ed weblogs
  91. 91. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 5A: Setup Amazon Athena 1. From the AWS Management Console, search for Athena and click on the service
  92. 92. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 5A: Setup Amazon Athena 2. Select Amazon Athena from the Analytics section and click on Get Started on the next page
  93. 93. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 5A: Setup Amazon Athena 3. Dismiss the window for running the Athena tutorial. 4. Dismiss any other tutorial window
  94. 94. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 5A: Setup Amazon Athena 5. We are now ready to create a table in Athena. But before we do that, we need to get the S3 bucket location where AWS Glue job delivered parquet files. Click on Services, then choose S3 from the Storage Section.
  95. 95. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 5A: Set Up Amazon Athena 6. Locate the bucket looks like ‘qls-<stackname>-logs-#####-us-west-2’. Navigate to the parquet folder. Copy the name of the bucket into a text editor
  96. 96. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 5A: Setup Amazon Athena 7. Go back to the Athena Console 8. Let’s create a table in Athena on the Parquet dataset created by AWS Glue. 9. In the Athena console, choose the ‘weblogdb’ on the database dropdown 10. Enter the SQL command (found by clicking on the “Open Athena SQL” link in Student Resources section of qwiklabs) to create a table. 11. Make sure to replace the <your-parquet-path> with the bucket location for the parquet files you copied in the previous step. (Screen shot in next slide)
  97. 97. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 5B: Working with Amazon Athena
  98. 98. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 5B: Working with Amazon Athena 8. The SQL DDL in the previous step creates a table in Athena based on the parquet files we created with the Glue ETL Job 9. Select weblogdb from the database section and click on the three stacked dots icon to sample a few rows of the S3 data
  99. 99. Activity 5C: Interactive Querying with Amazon Athena • Run interactive queries (copy SQL queries from “Athena SQL” in Student Resources) and see the results on the console
  100. 100. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 5C: Interactive Querying with Athena 2. Optionally, you can save the results of a query to CSV by choosing the file icon on the Results pane. 3. You can also view the results of previous queries or queries that may take some time to complete. Choose History then either search for your query or choose View or Download to view or download the results of previous completed queries. This also displays the status of queries that are currently running.
  101. 101. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Review: Amazon Athena Interactive Queries Query results are also stored in Amazon S3 in a bucket called aws-athena- query-results-ACCOUNTID-REGION. Where can you can change the default location in the console?
  102. 102. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data processing with Amazon EMR
  103. 103. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EMR release Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Mahout, Sqoop HBase/Phoenix Presto Hue (SQL Interface/Metastore Management) Zeppelin (Interactive Notebook) Ganglia (Monitoring) HiveServer2/Spark Thriftserver (JDBC/ODBC) Amazon EMR service Streaming Flink
  104. 104. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. On-cluster UIs Manage applicationsNotebooks SQL editor, Workflow designer, Metastore browser Design and execute queries and workloads And more using bootstrap actions!
  105. 105. The Hadoop ecosystem can run in Amazon EMR
  106. 106. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Easy to use Spot Instances On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Spot Instances for task nodes Up to 90% off Amazon EC2 on-demand pricing Meet SLA at predictable cost Exceed SLA at lower cost
  107. 107. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 as your persistent data store Separate compute and storage Resize and shut down Amazon EMR clusters with no data loss Point multiple Amazon EMR clusters at same data in Amazon S3 EMR EMR Amazon S3
  108. 108. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. EMRFS makes it easier to leverage S3 Better performance and error handling options Transparent to applications – Use “s3://” Consistent view • For consistent list and read-after-write for new puts Support for Amazon S3 server-side and client-side encryption Faster listing using EMRFS metadata
  109. 109. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark • Fast, general-purpose engine for large- scale data processing • Write applications quickly in Java, Scala, or Python • Combine SQL, streaming, and complex analytics
  110. 110. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Zeppelin • Web-based notebook for interactive analytics • Multiple language back end • Apache Spark integration • Data visualization • Collaboration https://zeppelin.apache.org/
  111. 111. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 6 Interactive analysis using Amazon EMR
  112. 112. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Your Application Architecture Amazon Kinesis Producer UI Amazon Kinesis Firehose AmazonKinesis Analytics AmazonKinesis Firehose Amazon EMR Amazon Redshift Amazon QuickSight Generate web logs Collect web logs and deliver to S3 Process & compute aggregate web log metrics Deliver processed web logs to Amazon Redshift Raw web logs from Firehose Run SQL queries on processed web logs Visualize web logs to discover insights Amazon S3 Bucket Interactive analysis of web logs Amazon Athena Interactive querying of web logs AWS Glue Extract metadata, create tables and transform Web logs from CSV to Parquet Amazon S3 Bucket Transformed web logs from Glue
  113. 113. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 6: Process and Query data with Amazon EMR Time: 20 minutes We are going to: A. Use a Zeppelin Notebook to interact with Amazon EMR Cluster B. Process the data in Amazon S3 using Apache Spark C. Query the data processed in the earlier stage and create simple charts
  114. 114. Activity 6A: Open the Zeppelin interface 1. Copy the Zeppelin end point in Student Resources section in qwiklabs 2. Click on the “Open Zeppelin Notebook” link in Student Resources section to open the zeppelin link into a new window. 3. Download the file (or Copy and save it to file with .json extension) 4. Import the Notebook using the Import Note link on Zeppelin interface
  115. 115. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 6A: Open the Zeppelin interface • Use s3://<stack>-logs-<account>-us-west-2/processed/parquet bucket where the processed parquet files were deposited. • Run the paragraph
  116. 116. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 6B: Run the notebook Enter the S3 bucket name where the parquet files are stored. The bucket name begins with <stack>-*-logs-#####-region Execute Step 1 • Enter bucket name (<stack>-*-logs-##########-us-west-2) Execute Step 2 • Create a Dataframe with the parquet files from the Glue ETL job Execute Step 3 • Sample a few rows
  117. 117. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 6B: Run the notebook Execute Step 4 to process the data • Notice how the ‘AGENT’ field consists of the ’BROWSER’ at the beginning of the column value. Let’s extract the browser from that . • Create a UDF that will extract the browser part and add it to the Dataframe • Print the new Dataframe
  118. 118. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 6B: Run the notebook Execute Step 6 • Register the data frame as a temporary table • Now you can run SQL queries on the temporary tables. Execute the next 3 steps and observe the charts created • What did you learn about the dataset?
  119. 119. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Review : Interactive analysis using Amazon EMR You just learned on how to process and query data using Amazon EMR with Apache Spark Amazon EMR has many other frameworks available for you to use • Hive, Presto, Flink, Pig, MapReduce • Hue, Oozie, HBase
  120. 120. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Optional Exercise: Data Visualization with Amazon QuickSight
  121. 121. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon QuickSight Fast, Easy Interactive Analytics for Anyone, Everywhere Ease of use targeted at business users. Blazing fast performance powered by SPICE. Broad connectivity with AWS data services, on-premises data, files and business applications. Cloud-native solution that scales automatically. 1/10th the cost of traditional BI solutions. Create, share and collaborate with anyone in your organization, on the web or on mobile.
  122. 122. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Connect, SPICE, Analyze Amazon QuickSight allows you to connect to data from a wide variety of AWS, third party, and on-premises sources and import it to SPICE or query directly. Users can then easily explore, analyze, and share their insights with anyone. Amazon RDS Amazon S3 Amazon Redshift
  123. 123. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 7 Visualize results in Amazon QuickSight
  124. 124. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Your Application Architecture Amazon Kinesis Producer UI Amazon Kinesis Firehose AmazonKinesis Analytics AmazonKinesis Firehose Amazon EMR Amazon Redshift Amazon QuickSight Generate web logs Collect web logs and deliver to S3 Process & compute aggregate web log metrics Deliver processed web logs to Amazon Redshift Raw web logs from Firehose Run SQL queries on processed web logs Visualize web logs to discover insights Amazon S3 Bucket Interactive analysis of web logs Amazon Athena Interactive querying of web logs AWS Glue Extract metadata, create tables and transform Web logs from CSV to Parquet Amazon S3 Bucket Transformed web logs from Glue
  125. 125. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 7: Visualization with Amazon QuickSight We are going to: A. Register for a Amazon QuickSight account B. Connect to the Amazon Redshift Cluster C. Create visualizations for analysis to answer questions like: A. What are the most common http requests and how successful (response code of 200) are they B. Which are the most requested URIs
  126. 126. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 7A: Amazon QuickSight Registration • Go to AWS Console, click on QuickSight from the Analytics section. • Click on Signup in the next window • Make sure the subscription type is Standard and click Continue on the next screen
  127. 127. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 7A: Amazon QuickSight Registration • On the Subscription Type page, enter the account name (see note below) • Enter your email address • Select US West region • Check the S3 (all buckets) box • Note: Amazon QuickSight Account name is the AWS account number on the qwikLabs console
  128. 128. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 7A: Amazon QuickSight Registration • If a pop box to choose S3 buckets appears, click Select buckets • Click on Go To Amazon Quicksight • Dismiss welcome screen
  129. 129. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 7B: Connect to data source • Click on Manage Data and then select New Dataset to create a new data set in Amazon QuickSight • Choose Redshift (Auto- discovered) as the data source. Amazon QuickSight autodiscovers databases associated with your AWS account (Amazon Redshift database in this case)
  130. 130. Activity 7B: Connect to Amazon Redshift Note: Use ”dbadmin” as the username. You can get the Amazon Redshift database password from qwikLABS by navigating to the “Connection details” section (see below)
  131. 131. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 7C: Choose your weblogs Amazon Redshift table
  132. 132. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 7D: Ingest data into SPICE SPICE is Amazon QuickSight's in- memory optimized calculation engine, designed specifically for fast, Interactive data visualization You can improve the performance of database data sets by importing the data into SPICE instead of using a direct query to the database
  133. 133. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Activity 7E: Creating your first analysis What are the most requested http request types and their corresponding response codes for this site? Simply select request, response and let AUTOGRAPH create the optimal visualization
  134. 134. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Review – Creating your Analysis Exercise: Add a visual to demonstrate which URI are the most requested?
  135. 135. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Please don’t forgot to fill out your evaluations. THANK YOU!

×