Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS APAC Webinar Week - Launching Your First Big Data Project on AWS

817 views

Published on

Want to get ramped up on how to use Amazon's big data services and launch your first big data application on AWS?

Join us on a journey as we build a big data application in real-time using Amazon EMR, Amazon Redshift, Amazon Kinesis, Amazon DynamoDB, and Amazon S3.

In this session we review architecture design patterns for big data solutions on AWS, and give you access to everything you need so that you can rebuild and customize the application yourself.

Published in: Technology
  • Be the first to comment

AWS APAC Webinar Week - Launching Your First Big Data Project on AWS

  1. 1. aws.amazon.com/webinars/apac/webinar-week | #AWSWebinarWeek
  2. 2. Launching Your First Big Data Project on AWS Russell Nash – AWS Solutions Architect
  3. 3. Amazon S3 Amazon Kinesis Amazon DynamoDB Amazon RDS (Aurora) AWS Lambda KCL Apps Amazon EMR Amazon Redshift Amazon Machine Learning Collect Process Analyze Store Data Collection and Storage Data Processing Event Processing Data Analysis Data Answers
  4. 4. Your first big data application on AWS
  5. 5. Collect Process Analyze Store Data Answers
  6. 6. Collect Process Analyze Store Data Answers
  7. 7. Collect Process Analyze Store Data Answers SQL
  8. 8. Set up the AWS CLI
  9. 9. Amazon Kinesis Create a single-shard Amazon Kinesis stream for incoming log data: aws kinesis create-stream --stream-name AccessLogStream --shard-count 1
  10. 10. Amazon S3 YOUR-BUCKET-NAME
  11. 11. Amazon EMR Launch a 3-node Amazon EMR cluster with Spark and Hive: m3.xlarge YOUR-AWS-SSH-KEY
  12. 12. Amazon Redshift CHOOSE-A-REDSHIFT-PASSWORD
  13. 13. Your first big data application on AWS 2. PROCESS: Process data with Amazon EMR using Spark & Hive STORE 3. ANALYZE: Analyze data in Amazon Redshift using SQLSQL 1. COLLECT: Stream data into Amazon Kinesis with Log4J
  14. 14. 1. Collect
  15. 15. Amazon Kinesis Log4J Appender In a separate terminal window on your local machine, download Log4J Appender: Then download and save the sample Apache log file:
  16. 16. Amazon Kinesis Log4J Appender Create a file called AwsCredentials.properties with credentials for an IAM user with permission to write to Amazon Kinesis: accessKey=YOUR-IAM-ACCESS-KEY secretKey=YOUR-SECRET-KEY Then start the Amazon Kinesis Log4J Appender:
  17. 17. Log file format
  18. 18. v Spark • Fast, general purpose engine for large-scale data processing • Write applications quickly in Java, Scala, or Python • Combine SQL, streaming, and complex analytics
  19. 19. v Amazon Kinesis and Spark Streaming Log4J Appender Amazon Kinesis Amazon S3 Amazon DynamoDB Spark-Streaming uses Kinesis Client Library Amazon EMR
  20. 20. Using Spark Streaming on Amazon EMR -o TCPKeepAlive=yes -o ServerAliveInterval=30 YOUR-AWS-SSH-KEY YOUR-EMR-HOSTNAME On your cluster, download the Amazon Kinesis client for Spark:
  21. 21. Using Spark Streaming on Amazon EMR Cut down on console noise: Start the Spark shell: spark-shell --jars /usr/lib/spark/extras/lib/spark-streaming- kinesis-asl.jar,amazon-kinesis-client-1.6.0.jar --driver-java- options "- Dlog4j.configuration=file:///etc/spark/conf/log4j.properties"
  22. 22. Using Spark Streaming on Amazon EMR /* import required libraries */
  23. 23. Using Spark Streaming on Amazon EMR /* Set up the variables as needed */ YOUR-REGION YOUR-S3-BUCKET /* Reconfigure the spark-shell */
  24. 24. Reading Amazon Kinesis with Spark Streaming /* Setup the KinesisClient */ val kinesisClient = new AmazonKinesisClient(new DefaultAWSCredentialsProviderChain()) kinesisClient.setEndpoint(endpointUrl) /* Determine the number of shards from the stream */ val numShards = kinesisClient.describeStream(streamName).getStreamDescription().getShards().size () /* Create one worker per Kinesis shard */ val ssc = new StreamingContext(sc, outputInterval) val kinesisStreams = (0 until numShards).map { i => KinesisUtils.createStream(ssc, streamName, endpointUrl,outputInterval,InitialPositionInStream.TRIM_HORIZON, StorageLevel.MEMORY_ONLY) }
  25. 25. Writing to Amazon S3 with Spark Streaming /* Merge the worker Dstreams and translate the byteArray to string */ /* Write each RDD to Amazon S3 */
  26. 26. View the output files in Amazon S3 YOUR-S3-BUCKET YOUR-S3-BUCKET yyyy mm dd HH
  27. 27. 2. Process
  28. 28. v Spark SQL • Spark's module for working with structured data using SQL • Run unmodified Hive queries on existing data
  29. 29. Using Spark SQL on Amazon EMR YOUR-AWS-SSH-KEY YOUR-EMR-HOSTNAME Start the Spark SQL shell: spark-sql --driver-java-options "- Dlog4j.configuration=file:///etc/spark/conf/log4j.properties"
  30. 30. Create a table that points to your Amazon S3 bucket CREATE EXTERNAL TABLE access_log_raw( host STRING, identity STRING, user STRING, request_time STRING, request STRING, status STRING, size STRING, referrer STRING, agent STRING ) PARTITIONED BY (year INT, month INT, day INT, hour INT, min INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|[[^]]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*") ([^ "]*|"[^"]*"))?" ) LOCATION 's3://YOUR-S3-BUCKET/access-log-raw'; msck repair table access_log_raw;
  31. 31. Query the data with Spark SQL -- return the first row in the stream -- return count all items in the Stream -- find the top 10 hosts
  32. 32. v Preparing the data for Amazon Redshift import • We will transform the data that is returned by the query before writing it to our Amazon S3-stored external Hive table • Hive user-defined functions (UDF) in use for the text transformations: from_unixtime, unix_timestamp and hour • The “hour” value is important: this is what’s used to split and organize the output files before writing to Amazon S3. These splits will allow us to more efficiently load the data into Amazon Redshift later in the lab using the parallel “COPY” command.
  33. 33. Create an external table in Amazon S3 YOUR-S3-BUCKET
  34. 34. Configure partition and compression -- setup Hive's "dynamic partitioning" -- this will split output files when writing to Amazon S3 -- compress output files on Amazon S3 using Gzip
  35. 35. Write output to Amazon S3 -- convert the Apache log timestamp to a UNIX timestamp -- split files in Amazon S3 by the hour in the log lines INSERT OVERWRITE TABLE access_log_processed PARTITION (hour) SELECT from_unixtime(unix_timestamp(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]')), host, request, status, referrer, agent, hour(from_unixtime(unix_timestamp(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]'))) as hour FROM access_log_raw;
  36. 36. View the output files in Amazon S3 YOUR-S3-BUCKET YOUR-S3-BUCKET
  37. 37. 3. Analyze
  38. 38. Connect to Amazon Redshift # using the PostgreSQL CLI YOUR-REDSHIFT-ENDPOINT Or use any JDBC or ODBC SQL client with the PostgreSQL 8.x drivers or native Amazon Redshift support • Aginity Workbench for Amazon Redshift • SQL Workbench/J
  39. 39. Create an Amazon Redshift table to hold your data
  40. 40. v Loading data into Amazon Redshift • “COPY” command loads files in parallel COPY accesslogs FROM 's3://YOUR-S3-BUCKET/access-log-processed' CREDENTIALS 'aws_access_key_id=YOUR-IAM- ACCESS_KEY;aws_secret_access_key=YOUR-IAM-SECRET-KEY' DELIMITER 't' IGNOREHEADER 0 MAXERROR 0 GZIP;
  41. 41. Amazon Redshift test queries -- find distribution of status codes over days -- find the 404 status codes -- show all requests for status as PAGE NOT FOUND
  42. 42. Your first big data application on AWS A favicon would fix 398 of the total 977 PAGE NOT FOUND (404) errors
  43. 43. v Visualize the results • Client-side JavaScript example using Plottable, a library built on D3 • Hosted on Amazon S3 for pennies a month • AWS Lambda function used to query Amazon Redshift
  44. 44. …around the same cost as a cup of coffee Try it yourself on the AWS cloud… Service Est. Cost* Amazon Kinesis $1.00 Amazon S3 (free tier) $0 Amazon EMR $0.44 Amazon Redshift $1.00 Est. Total $2.44 *Estimated costs assumes: use of free tier where available, lower cost instances, dataset no bigger than 10MB and instances running for less than 4 hours. Costs may vary depending on options selected, size of dataset, and usage. $3.50
  45. 45. v Learn from AWS big data experts blogs.aws.amazon.com/bigdata
  46. 46. Online Labs & Training Gain confidence and hands-on experience with AWS. Watch free Instructional Videos and explore Self-Paced Labs Instructor Led Classes Learn how to design, deploy and operate highly available, cost-effective and secure applications on AWS in courses led by qualified AWS instructors Validate your technical expertise with AWS and use practice exams to help you prepare for AWS Certification AWS Certification More info at http://aws.amazon.com/training

×