ARC 306: Lumberjacking on AWS
Cutting Through Logs to Find What Matters
Guy Ernest, Solutions Architecture
November 15, 20...
Progress Is Not Evenly Distributed

1980
$14,000,000/TB  450,000 ÷ 
 30,000 X 
100 MB
 50 X 
4 MB/s

Today
$30/TB
3 ...
by Kheel Center, Cornell University

Solution: More Spindles
Case Study – Foursquare
The Challenge
“…Foursquare streams hundreds
of millions of application logs
each day. The company relies on
analytics to r...
“Real” Project Requirements Example
Cost
Analysis

Marketing

Operations

Revenue

Data transfer

Top URLs

Error rates

T...
Viable Business
Revenues
# Users

Operation Costs

$ Money
Available Data Sources
Metric
Data transfer by date/time
Data transfer by edge location
Data transfer by date/time within ...
CloudFront Access Log Format
#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem...
Sample Your Data with R

>
>
>
>
>

sample_data <- read.delim(”SampleFiles/E123ABCDEF.2012-05-25-22.NEfbhLN3", header=F)
s...
Need a Lot of Memory?
OpenRefine Running on an EC2 Instance
Logs

E T
Web

L
OLAP

OLTP

CRM

ANALYST
DATAWAREHOUSE

OLTP

DB
Swedish public domain photo taken in 1918

Log Shipping
“Poor Man’s Log Shipping”
Embedding Poor-man Invisible Pixel
http://www.poor-mananalytics.com/__track.gif?idt=5.1.5&idc=5&utmn=1532897343&utmhn=www....
Open Source
Frameworks
Fluentd
Flume
Scribe
Chukwa
…

Fluentd Ascii Diagrams

Input
Output
+------------------------------...
Use Amazon Kinesis to Ship Your Logs

New
Aggregation with S3Distcp

Aggregated
Even-size
Compressed
S3distcp on EMR Job Sample
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar 
/home/hadoop/lib/emr-s3distcp-1.0.jar 
--a...
Pig for Access Logs Analysis
RAW_LOG = LOAD 's3://myoutputbucket/aggregate/' AS (ts:chararray, url:chararray…);
LOGS_BASE_...
Pig vs. Hive
• Pig is geared toward sequentially transforming data
– ETL
– Shell in scale (from local mode to any scale)

...
Monitoring Pig

https://github.com/netflix/lipstick
Another Monitoring
Tool

https://github.com/twitter/ambrose
Optimize Your EMR Cluster
Monitor Your EMR Cluster
Bootstrap Actions
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia
Management Console
Customers Tools
Gathering information about EMR
jobs from multiple sources and
presentation it in a textual and
graphic vi...
Completed Job View
Spot Bidding Strategies

Less
Interruptions
Not paying
more
Most Saving
Jeff Bezos (early Amazon days)
Data Sources

Value

Queries
More Trends to Consider
Transactional Processing

Analytical Processing

Transactional context

Global context

Latency

T...
COPY into Amazon Redshift
create table cf_logs
( d date, t char(8), edge char(4), bytes int, cip varchar(15),
verb char(3)...
COPY into Amazon Redshift with
AWS Data Pipeline
Charles Minard's flow map of Napoleon's March (1869)

Time for Data Visualization
Choose Your Favorite
Visualization Tool
Tableau (Windows instance)
R
Jaspersoft
QlikView
MicroStrategy
SiSense
…
Snapshot before Delete
Unload Data from Amazon Redshift
unload (“select * from cf_logs where date between '2013-11-03’ and '201311-10’“)
to 's3:/...
Reference Architecture
Partner Services
Loggly
Splunk
Stratalux (Logstash)
…

Loggly AWS Marketplace Page
What Else Can You Do with
Log Analysis?
Finally, a Small Warning

Abraham Wald (1902-1950)
B

C

A
Would You Like to Know More?
Further reading
http://aws.amazon.com/architecture

http://aws.amazon.com/articles
http://aws...
Please give us your feedback on this
presentation

ARC306
As a thank you, we will select prize
winners daily for completed...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013
Upcoming SlideShare
Loading in...5
×

Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

7,306
-1

Published on

AWS offers services that revolutionize the scale and cost for customers to extract information from large data sets, commonly called Big Data. This session analyzes Amazon CloudFront logs combined with additional structured data as a scenario for correlating log and transactional data. Successfully implementing this type of solution requires architects and developers to assemble a set of services with multiple decision points. The session provides a design and example of architecting and implementing the scenario using Amazon S3, AWS Data Pipeline, Amazon Elastic MapReduce, and Amazon Redshift. It explores loading, query performance, security, incremental updates, and design trade-off decisions.

Published in: Technology
1 Comment
6 Likes
Statistics
Notes
No Downloads
Views
Total Views
7,306
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
46
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide

Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

  1. 1. ARC 306: Lumberjacking on AWS Cutting Through Logs to Find What Matters Guy Ernest, Solutions Architecture November 15, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. Progress Is Not Evenly Distributed 1980 $14,000,000/TB  450,000 ÷   30,000 X  100 MB  50 X  4 MB/s Today $30/TB 3 TB 200 MB/s
  3. 3. by Kheel Center, Cornell University Solution: More Spindles
  4. 4. Case Study – Foursquare
  5. 5. The Challenge “…Foursquare streams hundreds of millions of application logs each day. The company relies on analytics to report on its daily usage, evaluate new offerings, and perform long-term trend analysis—and with millions of new check-ins each day, the workload is only growing…”
  6. 6. “Real” Project Requirements Example Cost Analysis Marketing Operations Revenue Data transfer Top URLs Error rates Top games • By date/time • By edge location • By date/time within an edge location • By top X URLs • By HTTP vs. HTTPS • • • • • By top X URLs • By edge location • By edge location and content type • By revenue • By edge location and revenue As-is count By content type By edge location By edge location and content type Top ads • That lead to a game purchase Requests served • By edge location Revenue • By edge location Top games • By age • By income • By gender
  7. 7. Viable Business Revenues # Users Operation Costs $ Money
  8. 8. Available Data Sources Metric Data transfer by date/time Data transfer by edge location Data transfer by date/time within an edge location Data transfer by top x URLs Data transfer by http vs HTTPS Top URLs Top URLs by Content Type Top URLs by Edge Location Top URLs by Edge Location and Content Type Error rates by top x URLs Error rate by edge location Error Rate by edge location and content type Requests served by edge location Revenue by edge location Top games segmented by age Top games segmented by income Top games segmented by gender Top games by revenue Top games by edge location and revenue Top game revenue segmented by age Sources CloudFront logs CloudFront logs CloudFront logs CloudFront logs, web servers logs CloudFront logs CloudFront logs, web servers logs CloudFront logs CloudFront logs CloudFront logs CloudFront logs, web servers logs CloudFront logs CloudFront logs CloudFront logs CloudFront logs, OrdersDB, app servers logs CloudFront logs, user profile CloudFront logs, user profile CloudFront logs, user profile CloudFront logs, OrdersDB CloudFront logs, OrdersDB CloudFront logs, OrdersDB, user profile
  9. 9. CloudFront Access Log Format #Version: 1.0 #Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query 2012-05-25 22:01:30 AMS1 4448 94.212.249.78 GET d1234567890213.cloudfront.net /YT0KthT/F5SOWdDPqNqQF07tiTOXqJMpfD dlb3LMwv3/jP3/CINm/yDSy0MsRcWJN/Simutrans.exe 200 http://AtRJw2kxg0EMW.com/kZetr/YCb6AM9N2xt2 Mozilla/5.0%20(compatible;%20M SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625181 2012-05-25 22:01:30 AMS1 4952 94.212.249.78 GET d1234567890213.cloudfront.net /66IG584/CPCxY0P44BGb5ZOd3qSUrauL05 0LOvFwaMj/eH/caw/Blob Wars-Blob And Conquer.exe 200 http://AtRJw2kxg0EMW.com/kZetr/YCb6AM9N2xt2 Mozilla/5.0%20(compatible;%20M SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625184 2012-05-25 22:01:30 AMS1 4556 78.8.5.135 GET d1234567890213.cloudfront.net /SwlufjC/xEjH3BRbXMXwmFWqzKt7od6tlW R3e13LhmH/V3eF/lo6g/AstroMenace.exe 200 http://AtRJw2kxg0EMW.com/AC1vg/1727EWfb7fPt Opera/9.80%20(Windows%20NT%205.1;%20U;%20pl)%2 0Presto/2.10.229%20Version/11.60 uid=100&oid=108625189 2012-05-25 22:01:30 AMS1 47172 78.8.5.135 GET d1234567890213.cloudfront.net /Di1cXoN/TskldkSHcgkvZXQEmv5vOVR25X 5UTisFkRq/pQa/wCjUXZb/Z1HRuGlo/Kroz.exe 200 http://AtRJw2kxg0EMW.com/AC1vg/1727EWfb7fPt Opera/9.80%20(Windows%20NT%205.1;%20U; %20pl)%20Presto/2.10.229%20Version/11.60 uid=100&oid=108625206
  10. 10. Sample Your Data with R > > > > > sample_data <- read.delim(”SampleFiles/E123ABCDEF.2012-05-25-22.NEfbhLN3", header=F) sample_data <- sample_data[-1:-2,] View(sample_data) m <- ggplot(sample_data, aes(x = factor(V9))) m + geom_histogram() + scale_y_log10() + xlab('Error Codes') + ylab('log(Frequency)')
  11. 11. Need a Lot of Memory?
  12. 12. OpenRefine Running on an EC2 Instance
  13. 13. Logs E T Web L OLAP OLTP CRM ANALYST DATAWAREHOUSE OLTP DB
  14. 14. Swedish public domain photo taken in 1918 Log Shipping
  15. 15. “Poor Man’s Log Shipping”
  16. 16. Embedding Poor-man Invisible Pixel http://www.poor-mananalytics.com/__track.gif?idt=5.1.5&idc=5&utmn=1532897343&utmhn=www.douban .com&utmcs=UTF-8&utmsr=1440x900&utmsc=24-bit&utmul=enus&utmje=1&utmfl=10.3%20r181&utmdt=%E8%B1%86%E7%93%A3&utmhid=571356425&utmr =-&utmp=%2F&utmac=UA-70197651&utmcc=__utma%3D30149280.1785629903.1314674330.1315290610.1315452707.10%3B %2B__utmz%3D30149280.1315452707.10.7.utmcsr%3Dbiaodianfu.com%7Cutmccn%3D(re ferral)%7Cutmcmd%3Dreferral%7Cutmcct%3D%2Fpoor-man-analyticsarchitecture.html%3B%2B__utmv%3D30149280.162%3B&utmu=qBM~
  17. 17. Open Source Frameworks Fluentd Flume Scribe Chukwa … Fluentd Ascii Diagrams Input Output +--------------------------------------------+ | | | Web Apps ---+ +--> File | | | | | | +--> ---+ | | /var/log ------> Fluentd ------> Mail | | +--> ---+ | | | | | | Apache ---+ +--> S3 | | | +--------------------------------------------+ Web Server +---------+ | Fluentd -------+ +---------+ | | Proxy Server | +---------+ +--> +---------+ | Fluentd ----------> | Fluentd | +---------+ +--> +---------+ | Database Server | +---------+ | | Fluentd -------+ +---------+
  18. 18. Use Amazon Kinesis to Ship Your Logs New
  19. 19. Aggregation with S3Distcp Aggregated Even-size Compressed
  20. 20. S3distcp on EMR Job Sample ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,s3://myoutputbucket/aggregate , --groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*, --targetSize,128, --outputCodec,lzo, --deleteOnSuccess'
  21. 21. Pig for Access Logs Analysis RAW_LOG = LOAD 's3://myoutputbucket/aggregate/' AS (ts:chararray, url:chararray…); LOGS_BASE_F = FILTER RAW_LOG BY url MATCHES '^GET /__track.*$’; LOGS_BASE_F_W_PARAM = FOREACH LOGS_BASE_F GENERATE url, Load and Filter DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') as dt, SUBSTRING(DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') ,0, 10 ) as day, (cat / grep) … status, REGEX_EXTRACT(url, '^GET /([^?]+)', 1) AS action: chararray, REGEX_EXTRACT(url, 'idt=([^&]+)', 1) AS idt: chararray, REGEX_EXTRACT(url, 'idc=([^&]+)', 1) AS idc: chararray; I1 = FILTER LOGS_BASE_F_W_PARAM by action == 'clic' or action == 'display'; Parse LOGS_SHORT = FOREACH I1 GENERATE uuid, action, dt, day, ida, idas, act, idp, idcmp (awk) ,idc; Store G1 = GROUP LOGS_SHORT BY (uuid,idc); store G1 into ‘s3://mybucket/sessions/’; (>)
  22. 22. Pig vs. Hive • Pig is geared toward sequentially transforming data – ETL – Shell in scale (from local mode to any scale) • Hive is for querying data – Data analysis / HQL – Some transformation, typically as a means to a goal i.e., temporary tables
  23. 23. Monitoring Pig https://github.com/netflix/lipstick
  24. 24. Another Monitoring Tool https://github.com/twitter/ambrose
  25. 25. Optimize Your EMR Cluster
  26. 26. Monitor Your EMR Cluster
  27. 27. Bootstrap Actions --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia
  28. 28. Management Console
  29. 29. Customers Tools Gathering information about EMR jobs from multiple sources and presentation it in a textual and graphic view github.com/Hi-Media/EmrMonitoring
  30. 30. Completed Job View
  31. 31. Spot Bidding Strategies Less Interruptions Not paying more Most Saving
  32. 32. Jeff Bezos (early Amazon days)
  33. 33. Data Sources Value Queries
  34. 34. More Trends to Consider Transactional Processing Analytical Processing Transactional context Global context Latency Throughput Indexed access Full table scans Random IO Sequential IO Disk seek times Disk transfer rate
  35. 35. COPY into Amazon Redshift create table cf_logs ( d date, t char(8), edge char(4), bytes int, cip varchar(15), verb char(3), distro varchar(MAX), object varchar(MAX), status int, Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) ) copy cf_logs from 's3://big-data/logs/E123ABCDEF/' credentials 'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>' IGNOREHEADER 2 GZIP DELIMITER 't' DATEFORMAT 'YYYY-MM-DD'
  36. 36. COPY into Amazon Redshift with AWS Data Pipeline
  37. 37. Charles Minard's flow map of Napoleon's March (1869) Time for Data Visualization
  38. 38. Choose Your Favorite Visualization Tool Tableau (Windows instance) R Jaspersoft QlikView MicroStrategy SiSense …
  39. 39. Snapshot before Delete
  40. 40. Unload Data from Amazon Redshift unload (“select * from cf_logs where date between '2013-11-03’ and '201311-10’“) to 's3://mybucket/unload_cf_logs_week_46' credentials 'aws_access_key_id=<key_id>; aws_secret_access_key=<secret_key>’ delimiter as 't’ GZIP;
  41. 41. Reference Architecture
  42. 42. Partner Services Loggly Splunk Stratalux (Logstash) … Loggly AWS Marketplace Page
  43. 43. What Else Can You Do with Log Analysis?
  44. 44. Finally, a Small Warning Abraham Wald (1902-1950)
  45. 45. B C A
  46. 46. Would You Like to Know More? Further reading http://aws.amazon.com/architecture http://aws.amazon.com/articles http://aws.typepad.com Re:invent sessions DAT205 - Amazon Redshift in Action: Enterprise, Big Data, and SaaS DAT305 - Getting Maximum Performance from Amazon Redshift BDT301 - Scaling your Analytics with Amazon Elastic MapReduce
  47. 47. Please give us your feedback on this presentation ARC306 As a thank you, we will select prize winners daily for completed surveys!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×