The need to crunch large amounts of data to extract useful statistics is increasingly common. Using services like Amazon Redshift and Amazon Elastic MapReduce, we will show how you can process log data to produce helpful reports and give your analysts the tools to find useful data. We will dive deep into these systems, building a usable example from scratch using the AWS SDK for Ruby.
6. What to expect from the session
• High-level overview
• Writing a log-processing job
• Log-processing automation
• Amazon Redshift ingestion
• Building reports
• Finer points and advanced techniques
• Conclusion
10. Reports:
Date Request Count
2015-10-01 26,781
2015-10-02 26,864
2015-10-03 20,310
2015-10-04 14,409
2015-10-05 29,029
2015-10-06 26,545
2015-10-07 27,940
To digestible output
Ad hoc queries:
SELECT REQUEST,
SUM(REQUEST_COUNT) AS VISITS
FROM FACT_DAILY_REQUESTS
WHERE USERNAME != '-'
AND END_DATE = '2015-10-07'
GROUP BY REQUEST
ORDER BY VISITS DESC
LIMIT 1
{ "REQUEST" => "GET /",
"VISITS" => "14505" }
43. Key concepts
• Redshift ingestion uses a SQL COPY command.
• One-to-one mapping with table columns, separated by a
delimiter.
o Must be in the same order as table columns.
o Default delimiter is the pipe "|" character, but you can specify
your own.
44. Our FACT Table
CREATE TABLE FACT_DAILY_REQUESTS(
USERNAME VARCHAR(30) NOT NULL DISTKEY,
SESSION_ID VARCHAR(10),
USER_AGENT VARCHAR(256) NOT NULL,
END_DATE DATE NOT NULL,
REQUEST VARCHAR(128) NOT NULL,
RESPONSE_CODE INTEGER NOT NULL,
REQUEST_COUNT INTEGER NOT NULL
)
INTERLEAVED SORTKEY(END_DATE,REQUEST,RESPONSE_CODE)
45. Our FACT Table
CREATE TABLE FACT_DAILY_REQUESTS(
USERNAME VARCHAR(30) NOT NULL DISTKEY,
SESSION_ID VARCHAR(10),
USER_AGENT VARCHAR(256) NOT NULL,
END_DATE DATE NOT NULL,
REQUEST VARCHAR(128) NOT NULL,
RESPONSE_CODE INTEGER NOT NULL,
REQUEST_COUNT INTEGER NOT NULL
)
INTERLEAVED SORTKEY(END_DATE,REQUEST,RESPONSE_CODE)
46. Copying from S3 to Redshift
COPY FACT_DAILY_REQUESTS
FROM 's3://bucket/output-prefix/part-'
DATEFORMAT AS 'DD/MON/YYYY'
delimiter 't'
53. Date-range queries
SELECT END_DATE, SUM(REQUEST_COUNT)
FROM FACT_DAILY_REQUESTS
WHERE END_DATE BETWEEN '2015-10-06' AND '2015-10-09'
GROUP BY END_DATE
ORDER BY END_DATE DESC
54. Advanced query – New user behavior
SELECT REQUEST, SUM(REQUEST_COUNT) AS TOTAL
FROM FACT_DAILY_REQUESTS f, DIM_USERS u
WHERE f.USERNAME = u.USERNAME
AND f.END_DATE BETWEEN '2015-10-01' AND '2015-10-07'
AND u.REGISTRATION_DATE >= '2015-10-01'
GROUP BY REQUEST
ORDER BY TOTAL DESC
LIMIT 10
55. Reports:
Date Request Count
2015-10-01 26,781
2015-10-02 26,864
2015-10-03 20,310
2015-10-04 14,409
2015-10-05 29,029
2015-10-06 26,545
2015-10-07 27,940
Supports planned and ad hoc reports
Ad hoc queries:
SELECT REQUEST,
SUM(REQUEST_COUNT) AS VISITS
FROM FACT_DAILY_REQUESTS
WHERE USERNAME != '-'
AND END_DATE = '2015-10-07'
GROUP BY REQUEST
ORDER BY VISITS DESC
LIMIT 1
{ "REQUEST" => "GET /",
"VISITS" => "14505" }
56. Summary
• Programmatic reporting with SQL
• Query logic not tied to Redshift
• Columnar storage optimized for common DW queries
• Can use S3 to store reports
• Can take advantage of PostgreSQL features:
• Window functions
• Common table expressions
64. What did we learn?
• Master instance selection matters
o jobtracker-heap-size
• Worker memory matters
o mapreduce.map.memory.mb
o mapreduce.reduce.memory.mb
o mapred.tasktracker.map.tasks.maximum
o mapred.tasktracker.reduce.tasks.maximum
• Elasticity is AWESOME!
65. Production lessons learned
• Repeated manual tasks == Evil
• Multiple sources of truth
• Understand storage ramifications of table design
• Automate validation
67. You don't have to do it yourself
• Related services
• AWS Data Pipeline
• Amazon Machine Learning
• Amazon Kinesis
• Amazon Simple Email Service
• Amazon Simple Notification Service
• AWS Marketplace
69. Now you can:
• Write a streaming Amazon Elastic MapReduce job.
• Automate cluster creation with the AWS SDK for Ruby.
• Format results and ingest into Amazon Redshift.
• Create useful reports from Amazon Redshift.
• Start thinking about scaling and production deployment.
73. Related sessions
• BDT305 - Amazon EMR Deep Dive and Best Practices
• BDT401 - Amazon Redshift Deep Dive: Tuning and Best
Practices
• DAT 201 - Introduction to Amazon Redshift