The Big Data ecosystem is moving so fast that is nearly impossible to keep pace. Meanwhile, the strong demand for high analytical and data management skills will continue to grow. So, how can you get up to speed?
Join us for this webinar where we will help you get ramped up on how to use Amazon’s Big Data web services. In just 50 minutes, we will build a Big Data application using Amazon Elastic MapReduce and other AWS Big Data Services. In addition, we will review best practices and architecture design patterns for Big Data. Attending re:Invent? One more reason not to miss this webinar, as it will help you get ready for some of our Big Data deep dives!
Learning Objectives:
Learn about key AWS Big Data services including Amazon S3, Amazon EMR, Amazon Kinesis, and Amazon Redshift
Learn about Big Data architectural patterns
How to ingest data to Amazon S3
How to start an Amazon EMR cluster
Help those attending re:Invent to get up to speed with Big Data services
Who Should Attend:
Architects and developers, interested in starting a Big Data initiative
8. Resources
1. AWS Command Line Interface (aws-cli) configured
2. Amazon Kinesis stream with a single shard
3. Amazon S3 bucket to hold the files
4. Amazon EMR cluster (two nodes) with Spark and Hive
5. Amazon Redshift data warehouse cluster (single node)
9. Amazon Kinesis
Create an Amazon Kinesis stream to hold incoming data:
aws kinesis create-stream
--stream-name AccessLogStream
--shard-count 1
13. Your first Big Data application on AWS
1. COLLECT: Stream data into
Kinesis with Log4J
2. PROCESS: Process data
with EMR using Spark & Hive
3. ANALYZE: Analyze data in
Redshift using SQL
STORE
SQL
18. Spark
•Fast and general engine for large-
scale data processing
•Write applications quickly in Java,
Scala or Python
•Combine SQL, streaming, and
complex analytics.
20. Amazon Kinesis and Spark Streaming
Producer Amazon
Kinesis
Amazon
S3
DynamoD
B
KCL
Spark-Streaming uses
KCL for Kinesis
Amazon
EMR
Spark-Streaming application to read from Kinesis and write to S3
27. Amazon EMR’s Hive
Adapts a SQL-like (HiveQL) query to run on Hadoop
Schema on read: map table to the input data
Access data in Amazon S3, Amazon DymamoDB, and Amazon Kinesis
Query complex input formats using SerDe
Transform data with User Defined Functions (UDF)
29. Create a table that points to your Amazon S3 bucket
CREATE EXTERNAL TABLE access_log_raw(
host STRING, identity STRING,
user STRING, request_time STRING,
request STRING, status STRING,
size STRING, referrer STRING,
agent STRING
)
PARTITIONED BY (year INT, month INT, day INT, hour INT, min INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|[[^]]*])
([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*")
([^ "]*|"[^"]*"))?"
)
LOCATION 's3://YOUR-S3-BUCKET/access-log-raw';
msck repair table access_log_raw;
30.
31. Process data using Hive
We will transform the data that is returned by the query before writing it
to our Amazon S3-stored external Hive table
Hive User Defined Functions (UDF) in use for the text transformations:
from_unixtime, unix_timestamp and hour
The “hour” value is important: this is what’s used to split and organize
the output files before writing to Amazon S3. These splits will allow us
to more efficiently load the data into Amazon Redshift later in the lab
using the parallel “COPY” command
34. Query Hive and write output to Amazon S3
-- convert the Apache log timestamp to a UNIX timestamp
-- split files in Amazon S3 by the hour in the log lines
INSERT OVERWRITE TABLE access_log_processed PARTITION (hour)
SELECT
from_unixtime(unix_timestamp(request_time,
'[dd/MMM/yyyy:HH:mm:ss Z]')),
host,
request,
status,
referrer,
agent,
hour(from_unixtime(unix_timestamp(request_time,
'[dd/MMM/yyyy:HH:mm:ss Z]'))) as hour
FROM access_log_raw;
50. Your first Big Data application on AWS
A favicon would fix 398 of the total 977 PAGE NOT FOUND (404) errors
51. …around the same cost as a cup of coffee
Try it yourself on the AWS Cloud…
Service Est. Cost*
Amazon Kinesis $1.00
Amazon S3 (free tier) $0
Amazon EMR $0.44
Amazon Redshift $1.00
Est. Total $2.44
*Estimated costs assumes: use of free tier where available, lower cost instances, dataset no bigger than 10MB and instances running
for less than 4 hours. Costs may vary depending on options selected, size of dataset, and usage.
$3.50