Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

18

Share

Download to read offline

AWS September Webinar Series - Building Your First Big Data Application on AWS

Download to read offline

The Big Data ecosystem is moving so fast that is nearly impossible to keep pace. Meanwhile, the strong demand for high analytical and data management skills will continue to grow. So, how can you get up to speed?

Join us for this webinar where we will help you get ramped up on how to use Amazon’s Big Data web services. In just 50 minutes, we will build a Big Data application using Amazon Elastic MapReduce and other AWS Big Data Services. In addition, we will review best practices and architecture design patterns for Big Data. Attending re:Invent? One more reason not to miss this webinar, as it will help you get ready for some of our Big Data deep dives!

Learning Objectives:

Learn about key AWS Big Data services including Amazon S3, Amazon EMR, Amazon Kinesis, and Amazon Redshift
Learn about Big Data architectural patterns
How to ingest data to Amazon S3
How to start an Amazon EMR cluster
Help those attending re:Invent to get up to speed with Big Data services

Who Should Attend:

Architects and developers, interested in starting a Big Data initiative

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

AWS September Webinar Series - Building Your First Big Data Application on AWS

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rahul Bhartia, Ecosystem Solution Architect 15th September 2015 Building your first Big Data application on AWS
  2. 2. Amazon S3 Amazon Kinesis Amazon DynamoDB Amazon RDS (Aurora) AWS Lambda KCL Apps Amazon EMR Amazon Redshift Amazon Machine Learning CollectCollect ProcessProcess AnalyzeAnalyze StoreStore Data Collection and Storage Data Processing Event Processing Data Analysis Data Answers Big Data ecosystem on AWS
  3. 3. Your first Big Data application on AWS ?
  4. 4. Big Data ecosystem on AWS - Collect CollectCollect ProcessProcess AnalyzeAnalyze StoreStore Data Answers
  5. 5. Big Data ecosystem on AWS - Process CollectCollect ProcessProcess AnalyzeAnalyze StoreStore Data Answers
  6. 6. Big Data ecosystem on AWS - Analyze CollectCollect ProcessProcess AnalyzeAnalyze StoreStore Data Answers SQL
  7. 7. Setup
  8. 8. Resources 1. AWS Command Line Interface (aws-cli) configured 2. Amazon Kinesis stream with a single shard 3. Amazon S3 bucket to hold the files 4. Amazon EMR cluster (two nodes) with Spark and Hive 5. Amazon Redshift data warehouse cluster (single node)
  9. 9. Amazon Kinesis Create an Amazon Kinesis stream to hold incoming data: aws kinesis create-stream --stream-name AccessLogStream --shard-count 1
  10. 10. Amazon S3
  11. 11. Amazon EMR
  12. 12. Amazon Redshift
  13. 13. Your first Big Data application on AWS 1. COLLECT: Stream data into Kinesis with Log4J 2. PROCESS: Process data with EMR using Spark & Hive 3. ANALYZE: Analyze data in Redshift using SQL STORE SQL
  14. 14. 1. Collect
  15. 15. Amazon Kinesis Log4J Appender
  16. 16. Log file format
  17. 17. Spark •Fast and general engine for large- scale data processing •Write applications quickly in Java, Scala or Python •Combine SQL, streaming, and complex analytics.
  18. 18. Using Spark on EMR
  19. 19. Amazon Kinesis and Spark Streaming Producer Amazon Kinesis Amazon S3 DynamoD B KCL Spark-Streaming uses KCL for Kinesis Amazon EMR Spark-Streaming application to read from Kinesis and write to S3
  20. 20. Spark-streaming - Reading from Kinesis
  21. 21. Spark-streaming – Writing to S3
  22. 22. View the output files in Amazon S3
  23. 23. 2. Process
  24. 24. Amazon EMR’s Hive Adapts a SQL-like (HiveQL) query to run on Hadoop Schema on read: map table to the input data Access data in Amazon S3, Amazon DymamoDB, and Amazon Kinesis Query complex input formats using SerDe Transform data with User Defined Functions (UDF)
  25. 25. Using Hive on Amazon EMR
  26. 26. Create a table that points to your Amazon S3 bucket CREATE EXTERNAL TABLE access_log_raw( host STRING, identity STRING, user STRING, request_time STRING, request STRING, status STRING, size STRING, referrer STRING, agent STRING ) PARTITIONED BY (year INT, month INT, day INT, hour INT, min INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|[[^]]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*") ([^ "]*|"[^"]*"))?" ) LOCATION 's3://YOUR-S3-BUCKET/access-log-raw'; msck repair table access_log_raw;
  27. 27. Process data using Hive We will transform the data that is returned by the query before writing it to our Amazon S3-stored external Hive table Hive User Defined Functions (UDF) in use for the text transformations: from_unixtime, unix_timestamp and hour The “hour” value is important: this is what’s used to split and organize the output files before writing to Amazon S3. These splits will allow us to more efficiently load the data into Amazon Redshift later in the lab using the parallel “COPY” command
  28. 28. Create an external Hive table in Amazon S3
  29. 29. Configure partition and compression
  30. 30. Query Hive and write output to Amazon S3 -- convert the Apache log timestamp to a UNIX timestamp -- split files in Amazon S3 by the hour in the log lines INSERT OVERWRITE TABLE access_log_processed PARTITION (hour) SELECT from_unixtime(unix_timestamp(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]')), host, request, status, referrer, agent, hour(from_unixtime(unix_timestamp(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]'))) as hour FROM access_log_raw;
  31. 31. Viewing Job status http://127.0.0.1/9026
  32. 32. View the output files in Amazon S3
  33. 33. Spark SQL Spark's module for working with structured data using SQL Run unmodified Hive queries on existing data.
  34. 34. Using Spark-SQL on Amazon EMR
  35. 35. Query the data with Spark
  36. 36. 3. Analyze
  37. 37. Connect to Amazon Redshift
  38. 38. Create an Amazon Redshift table to hold your data
  39. 39. Loading data into Amazon Redshift “COPY” command loads files in parallel COPY accesslogs FROM 's3://YOUR-S3-BUCKET/access-log-processed' CREDENTIALS 'aws_access_key_id=YOUR-IAM-ACCESS_KEY; aws_secret_access_key=YOUR-IAM-SECRET-KEY' DELIMITER 't' IGNOREHEADER 0 MAXERROR 0 GZIP;
  40. 40. Amazon Redshift test queries
  41. 41. Your first Big Data application on AWS A favicon would fix 398 of the total 977 PAGE NOT FOUND (404) errors
  42. 42. …around the same cost as a cup of coffee Try it yourself on the AWS Cloud… Service Est. Cost* Amazon Kinesis $1.00 Amazon S3 (free tier) $0 Amazon EMR $0.44 Amazon Redshift $1.00 Est. Total $2.44 *Estimated costs assumes: use of free tier where available, lower cost instances, dataset no bigger than 10MB and instances running for less than 4 hours. Costs may vary depending on options selected, size of dataset, and usage. $3.50
  43. 43. Thank you AWS Big Data blog blogs.aws.amazon.com/bigdata
  • RavirajAdrangi

    Jun. 9, 2017
  • NeeleshGupta17

    Jun. 21, 2016
  • RajkumarVennelaganti

    Apr. 19, 2016
  • SteveHull2

    Oct. 30, 2015
  • seslava

    Oct. 26, 2015
  • AnandSurampudi

    Oct. 14, 2015
  • DeborahShaddon

    Oct. 13, 2015
  • bopdls

    Oct. 2, 2015
  • jeffhuangus

    Oct. 2, 2015
  • AndrewKayvanfar

    Sep. 29, 2015
  • rcnavas

    Sep. 29, 2015
  • SubramaniyanK

    Sep. 29, 2015
  • AceStooge

    Sep. 29, 2015
  • BillLynch4

    Sep. 28, 2015
  • jacobdanner

    Sep. 26, 2015
  • LucianoVolpe

    Sep. 26, 2015
  • ShawnDouglass1

    Sep. 25, 2015
  • choeungjin

    Sep. 25, 2015

The Big Data ecosystem is moving so fast that is nearly impossible to keep pace. Meanwhile, the strong demand for high analytical and data management skills will continue to grow. So, how can you get up to speed? Join us for this webinar where we will help you get ramped up on how to use Amazon’s Big Data web services. In just 50 minutes, we will build a Big Data application using Amazon Elastic MapReduce and other AWS Big Data Services. In addition, we will review best practices and architecture design patterns for Big Data. Attending re:Invent? One more reason not to miss this webinar, as it will help you get ready for some of our Big Data deep dives! Learning Objectives: Learn about key AWS Big Data services including Amazon S3, Amazon EMR, Amazon Kinesis, and Amazon Redshift Learn about Big Data architectural patterns How to ingest data to Amazon S3 How to start an Amazon EMR cluster Help those attending re:Invent to get up to speed with Big Data services Who Should Attend: Architects and developers, interested in starting a Big Data initiative

Views

Total views

5,327

On Slideshare

0

From embeds

0

Number of embeds

10

Actions

Downloads

173

Shares

0

Comments

0

Likes

18

×