(DEV309) Large-Scale Metrics Analysis in Ruby

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Alex Wood, AWS SDKs and Tools Team
October 2015
Large-Scale Metrics Analysis in Ruby
Data Processing from Scratch

Many Shapes, Sizes, and Sources

This Talk Is For Me, 2 Years Ago

What to expect from the session
• High-level overview
• Writing a log-processing job
• Log-processing automation
• Amazon Redshift ingestion
• Building reports
• Finer points and advanced techniques
• Conclusion

From web logs
10.3.0.210 7c667f5dcd mckenzieheathcote "Firefox" 7/Oct/2015 13:55:36 "GET /admin.html" 200 2326
337.899.380.827 5bb3ee4186 osvaldohuels "IE6" 7/Oct/2015 13:55:41 "GET /products/141.html" 200 1214
510.514.49.310 9dae697a8e - "Chrome" 7/Oct/2015 13:55:51 "GET /" 200 4132
205.67.420.496 080c8f7a44 - "Safari" 7/Oct/2015 13:56:01 "GET /" 200 4123
510.514.49.310 9dae697a8e - "Chrome" 7/Oct/2015 13:56:14 "GET /products/23.html" 200 1315
10.3.0.210 7c667f5dcd mckenzieheathcote "Firefox" 7/Oct/2015 13:57:11 "POST /admin.html" 204 34
10.3.0.210 7c667f5dcd mckenzieheathcote "Firefox" 7/Oct/2015 13:57:13 "GET /admin.html" 200 2312
510.514.49.310 9dae697a8e - "Chrome" 7/Oct/2015 13:57:29 "GET /" 200 4139

Reports:
Date Request Count
2015-10-01 26,781
2015-10-02 26,864
2015-10-03 20,310
2015-10-04 14,409
2015-10-05 29,029
2015-10-06 26,545
2015-10-07 27,940
To digestible output

Reports:
Date Request Count
2015-10-01 26,781
2015-10-02 26,864
2015-10-03 20,310
2015-10-04 14,409
2015-10-05 29,029
2015-10-06 26,545
2015-10-07 27,940
To digestible output
Ad hoc queries:
SELECT REQUEST,
SUM(REQUEST_COUNT) AS VISITS
FROM FACT_DAILY_REQUESTS
WHERE USERNAME != '-'
AND END_DATE = '2015-10-07'
GROUP BY REQUEST
ORDER BY VISITS DESC
LIMIT 1
{ "REQUEST" => "GET /",
"VISITS" => "14505" }

Log-processing system
Amazon
Elastic
MapReduce
RedshiftLogs in
Amazon S3
Reports

Log-processing system
EMR RedshiftLogs in S3 Reports

Example S3 objects
log/2015-10-06/22h.log
log/2015-10-06/23h.log
log/2015-10-07/0h.log
log/2015-10-07/1h.log
log/2015-10-07/2h.log
log/2015-10-07/3h.log
log/2015-10-07/4h.log
log/2015-10-07/5h.log
Separate logs with prefixes

Example S3 objects
log/2015-10-06/22h.log
log/2015-10-06/23h.log
log/2015-10-07/0h.log
log/2015-10-07/1h.log
log/2015-10-07/2h.log
log/2015-10-07/3h.log
log/2015-10-07/4h.log
log/2015-10-07/5h.log
Separate logs with prefixes
EMR w/ input prefix
"-input",
"s3://bucket/log/2015-10-07/"

Amazon Elastic MapReduce overview
Worker
Master Job
tracker
Mappers Reducers

Streaming jobs
Worker
Master Job
tracker
Mappers Reducers
• Built-in streaming JAR
• Bring your own mapper
• Bring your own reducer
• Hadoop does orchestration

Mapper
Worker
Master Job
tracker
Mappers Reducers

Mapper
Worker
Master Job
tracker
Mappers Reducers
• Input by line from STDIN
o Ruby ARGF
• Output to STDOUT
• Bottom line: Filter values

Reducer
Worker
Master Job
tracker
Mappers Reducers

Reducer
Worker
Master Job
tracker
Mappers Reducers
• Sorted by Hadoop
• Mapper output line by line
o Again using STDIN
• Transform output
• Count duplicates
• Output to STDOUT

Summary
• Streaming mappers and reducers are executable scripts.
• Hadoop manages streaming orchestration.
• Input comes through STDIN.
• Output sent to STDOUT.
• Can test locally:
• cat input.txt | ruby mapper.rb | sort | ruby reducer.rb > result.out

Concepts: Streaming step
• Mapper and reducer source files
• Input files
• Output destination

Concepts: Instance configuration
• How many? How big?
• Master vs. worker

Console
Console vs. SDK
AWS SDK for Ruby
@client =
Aws::EMR::Client.new
@client.run_job_flow(opts)

End state
Cluster
A
Step 1 Step 2
Cluster
B
Step 3 Step 4 Step 5
Cluster
C
Step 6 Step 7

Summary
• AWS SDKs enable automation at scale.
• Getting started is simple.
• Separate common configuration from job-specific.

Key concepts
• Redshift ingestion uses a SQL COPY command.
• One-to-one mapping with table columns, separated by a
delimiter.
o Must be in the same order as table columns.
o Default delimiter is the pipe "|" character, but you can specify
your own.

Our FACT Table
CREATE TABLE FACT_DAILY_REQUESTS(
USERNAME VARCHAR(30) NOT NULL DISTKEY,
SESSION_ID VARCHAR(10),
USER_AGENT VARCHAR(256) NOT NULL,
END_DATE DATE NOT NULL,
REQUEST VARCHAR(128) NOT NULL,
RESPONSE_CODE INTEGER NOT NULL,
REQUEST_COUNT INTEGER NOT NULL
)
INTERLEAVED SORTKEY(END_DATE,REQUEST,RESPONSE_CODE)

Copying from S3 to Redshift
COPY FACT_DAILY_REQUESTS
FROM 's3://bucket/output-prefix/part-'
DATEFORMAT AS 'DD/MON/YYYY'
delimiter 't'

Summary
• Amazon Redshift interfaces like SQL.
• You can alias an S3 source, as with EMR.
• If delimited, EMR's output structure is ready to load.

Simple Count
SELECT COUNT(DISTINCT USERNAME)

Date-range queries
SELECT END_DATE, SUM(REQUEST_COUNT)
WHERE END_DATE BETWEEN '2015-10-06' AND '2015-10-09'
GROUP BY END_DATE
ORDER BY END_DATE DESC

Advanced query – New user behavior
SELECT REQUEST, SUM(REQUEST_COUNT) AS TOTAL
FROM FACT_DAILY_REQUESTS f, DIM_USERS u
WHERE f.USERNAME = u.USERNAME
AND f.END_DATE BETWEEN '2015-10-01' AND '2015-10-07'
AND u.REGISTRATION_DATE >= '2015-10-01'
GROUP BY REQUEST
ORDER BY TOTAL DESC
LIMIT 10

Reports:
Date Request Count
2015-10-01 26,781
2015-10-02 26,864
2015-10-03 20,310
2015-10-04 14,409
2015-10-05 29,029
2015-10-06 26,545
2015-10-07 27,940
Supports planned and ad hoc reports
Ad hoc queries:
SELECT REQUEST,
SUM(REQUEST_COUNT) AS VISITS
WHERE USERNAME != '-'
AND END_DATE = '2015-10-07'
GROUP BY REQUEST
ORDER BY VISITS DESC
LIMIT 1
{ "REQUEST" => "GET /",
"VISITS" => "14505" }

Summary
• Programmatic reporting with SQL
• Query logic not tied to Redshift
• Columnar storage optimized for common DW queries
• Can use S3 to store reports
• Can take advantage of PostgreSQL features:
• Window functions
• Common table expressions

1 PB = 1000000000000000B = 1015 bytes = 1000 terabytes.

Got 5,000,000,000,000,000 problems

What did we learn?
• Master instance selection matters
o jobtracker-heap-size
• Worker memory matters
o mapreduce.map.memory.mb
o mapreduce.reduce.memory.mb
o mapred.tasktracker.map.tasks.maximum
o mapred.tasktracker.reduce.tasks.maximum
• Elasticity is AWESOME!

Production lessons learned
• Repeated manual tasks == Evil
• Multiple sources of truth
• Understand storage ramifications of table design
• Automate validation

You don't have to do it yourself
• Related services
• AWS Data Pipeline
• Amazon Machine Learning
• Amazon Kinesis
• Amazon Simple Email Service
• Amazon Simple Notification Service
• AWS Marketplace

Now you can:
• Write a streaming Amazon Elastic MapReduce job.
• Automate cluster creation with the AWS SDK for Ruby.
• Format results and ingest into Amazon Redshift.
• Create useful reports from Amazon Redshift.
• Start thinking about scaling and production deployment.

Resources
• Sample Code
• https://github.com/awslabs/reinvent2015-dev309
• Amazon Elastic MapReduce documentation
• http://aws.amazon.com/documentation/elasticmapreduce/
• Amazon Redshift documentation
• http://aws.amazon.com/documentation/redshift/
• AWS SDK for Ruby documentation
• http://docs.aws.amazon.com/sdkforruby/api/index.html
• Twitter: @alexwwood

Remember to complete
your evaluations!

Related sessions
• BDT305 - Amazon EMR Deep Dive and Best Practices
• BDT401 - Amazon Redshift Deep Dive: Tuning and Best
Practices
• DAT 201 - Introduction to Amazon Redshift

(DEV309) Large-Scale Metrics Analysis in Ruby

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to (DEV309) Large-Scale Metrics Analysis in Ruby

Similar to (DEV309) Large-Scale Metrics Analysis in Ruby (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

(DEV309) Large-Scale Metrics Analysis in Ruby