Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

  • 5,820 views
Uploaded on

AWS offers services that revolutionize the scale and cost for customers to extract information from large data sets, commonly called Big Data. This session analyzes Amazon CloudFront logs combined …

AWS offers services that revolutionize the scale and cost for customers to extract information from large data sets, commonly called Big Data. This session analyzes Amazon CloudFront logs combined with additional structured data as a scenario for correlating log and transactional data. Successfully implementing this type of solution requires architects and developers to assemble a set of services with multiple decision points. The session provides a design and example of architecting and implementing the scenario using Amazon S3, AWS Data Pipeline, Amazon Elastic MapReduce, and Amazon Redshift. It explores loading, query performance, security, incremental updates, and design trade-off decisions.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
5,820
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
37
Comments
1
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. ARC 306: Lumberjacking on AWS Cutting Through Logs to Find What Matters Guy Ernest, Solutions Architecture November 15, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 2. Progress Is Not Evenly Distributed 1980 $14,000,000/TB  450,000 ÷   30,000 X  100 MB  50 X  4 MB/s Today $30/TB 3 TB 200 MB/s
  • 3. by Kheel Center, Cornell University Solution: More Spindles
  • 4. Case Study – Foursquare
  • 5. The Challenge “…Foursquare streams hundreds of millions of application logs each day. The company relies on analytics to report on its daily usage, evaluate new offerings, and perform long-term trend analysis—and with millions of new check-ins each day, the workload is only growing…”
  • 6. “Real” Project Requirements Example Cost Analysis Marketing Operations Revenue Data transfer Top URLs Error rates Top games • By date/time • By edge location • By date/time within an edge location • By top X URLs • By HTTP vs. HTTPS • • • • • By top X URLs • By edge location • By edge location and content type • By revenue • By edge location and revenue As-is count By content type By edge location By edge location and content type Top ads • That lead to a game purchase Requests served • By edge location Revenue • By edge location Top games • By age • By income • By gender
  • 7. Viable Business Revenues # Users Operation Costs $ Money
  • 8. Available Data Sources Metric Data transfer by date/time Data transfer by edge location Data transfer by date/time within an edge location Data transfer by top x URLs Data transfer by http vs HTTPS Top URLs Top URLs by Content Type Top URLs by Edge Location Top URLs by Edge Location and Content Type Error rates by top x URLs Error rate by edge location Error Rate by edge location and content type Requests served by edge location Revenue by edge location Top games segmented by age Top games segmented by income Top games segmented by gender Top games by revenue Top games by edge location and revenue Top game revenue segmented by age Sources CloudFront logs CloudFront logs CloudFront logs CloudFront logs, web servers logs CloudFront logs CloudFront logs, web servers logs CloudFront logs CloudFront logs CloudFront logs CloudFront logs, web servers logs CloudFront logs CloudFront logs CloudFront logs CloudFront logs, OrdersDB, app servers logs CloudFront logs, user profile CloudFront logs, user profile CloudFront logs, user profile CloudFront logs, OrdersDB CloudFront logs, OrdersDB CloudFront logs, OrdersDB, user profile
  • 9. CloudFront Access Log Format #Version: 1.0 #Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query 2012-05-25 22:01:30 AMS1 4448 94.212.249.78 GET d1234567890213.cloudfront.net /YT0KthT/F5SOWdDPqNqQF07tiTOXqJMpfD dlb3LMwv3/jP3/CINm/yDSy0MsRcWJN/Simutrans.exe 200 http://AtRJw2kxg0EMW.com/kZetr/YCb6AM9N2xt2 Mozilla/5.0%20(compatible;%20M SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625181 2012-05-25 22:01:30 AMS1 4952 94.212.249.78 GET d1234567890213.cloudfront.net /66IG584/CPCxY0P44BGb5ZOd3qSUrauL05 0LOvFwaMj/eH/caw/Blob Wars-Blob And Conquer.exe 200 http://AtRJw2kxg0EMW.com/kZetr/YCb6AM9N2xt2 Mozilla/5.0%20(compatible;%20M SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625184 2012-05-25 22:01:30 AMS1 4556 78.8.5.135 GET d1234567890213.cloudfront.net /SwlufjC/xEjH3BRbXMXwmFWqzKt7od6tlW R3e13LhmH/V3eF/lo6g/AstroMenace.exe 200 http://AtRJw2kxg0EMW.com/AC1vg/1727EWfb7fPt Opera/9.80%20(Windows%20NT%205.1;%20U;%20pl)%2 0Presto/2.10.229%20Version/11.60 uid=100&oid=108625189 2012-05-25 22:01:30 AMS1 47172 78.8.5.135 GET d1234567890213.cloudfront.net /Di1cXoN/TskldkSHcgkvZXQEmv5vOVR25X 5UTisFkRq/pQa/wCjUXZb/Z1HRuGlo/Kroz.exe 200 http://AtRJw2kxg0EMW.com/AC1vg/1727EWfb7fPt Opera/9.80%20(Windows%20NT%205.1;%20U; %20pl)%20Presto/2.10.229%20Version/11.60 uid=100&oid=108625206
  • 10. Sample Your Data with R > > > > > sample_data <- read.delim(”SampleFiles/E123ABCDEF.2012-05-25-22.NEfbhLN3", header=F) sample_data <- sample_data[-1:-2,] View(sample_data) m <- ggplot(sample_data, aes(x = factor(V9))) m + geom_histogram() + scale_y_log10() + xlab('Error Codes') + ylab('log(Frequency)')
  • 11. Need a Lot of Memory?
  • 12. OpenRefine Running on an EC2 Instance
  • 13. Logs E T Web L OLAP OLTP CRM ANALYST DATAWAREHOUSE OLTP DB
  • 14. Swedish public domain photo taken in 1918 Log Shipping
  • 15. “Poor Man’s Log Shipping”
  • 16. Embedding Poor-man Invisible Pixel http://www.poor-mananalytics.com/__track.gif?idt=5.1.5&idc=5&utmn=1532897343&utmhn=www.douban .com&utmcs=UTF-8&utmsr=1440x900&utmsc=24-bit&utmul=enus&utmje=1&utmfl=10.3%20r181&utmdt=%E8%B1%86%E7%93%A3&utmhid=571356425&utmr =-&utmp=%2F&utmac=UA-70197651&utmcc=__utma%3D30149280.1785629903.1314674330.1315290610.1315452707.10%3B %2B__utmz%3D30149280.1315452707.10.7.utmcsr%3Dbiaodianfu.com%7Cutmccn%3D(re ferral)%7Cutmcmd%3Dreferral%7Cutmcct%3D%2Fpoor-man-analyticsarchitecture.html%3B%2B__utmv%3D30149280.162%3B&utmu=qBM~
  • 17. Open Source Frameworks Fluentd Flume Scribe Chukwa … Fluentd Ascii Diagrams Input Output +--------------------------------------------+ | | | Web Apps ---+ +--> File | | | | | | +--> ---+ | | /var/log ------> Fluentd ------> Mail | | +--> ---+ | | | | | | Apache ---+ +--> S3 | | | +--------------------------------------------+ Web Server +---------+ | Fluentd -------+ +---------+ | | Proxy Server | +---------+ +--> +---------+ | Fluentd ----------> | Fluentd | +---------+ +--> +---------+ | Database Server | +---------+ | | Fluentd -------+ +---------+
  • 18. Use Amazon Kinesis to Ship Your Logs New
  • 19. Aggregation with S3Distcp Aggregated Even-size Compressed
  • 20. S3distcp on EMR Job Sample ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,s3://myoutputbucket/aggregate , --groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*, --targetSize,128, --outputCodec,lzo, --deleteOnSuccess'
  • 21. Pig for Access Logs Analysis RAW_LOG = LOAD 's3://myoutputbucket/aggregate/' AS (ts:chararray, url:chararray…); LOGS_BASE_F = FILTER RAW_LOG BY url MATCHES '^GET /__track.*$’; LOGS_BASE_F_W_PARAM = FOREACH LOGS_BASE_F GENERATE url, Load and Filter DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') as dt, SUBSTRING(DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') ,0, 10 ) as day, (cat / grep) … status, REGEX_EXTRACT(url, '^GET /([^?]+)', 1) AS action: chararray, REGEX_EXTRACT(url, 'idt=([^&]+)', 1) AS idt: chararray, REGEX_EXTRACT(url, 'idc=([^&]+)', 1) AS idc: chararray; I1 = FILTER LOGS_BASE_F_W_PARAM by action == 'clic' or action == 'display'; Parse LOGS_SHORT = FOREACH I1 GENERATE uuid, action, dt, day, ida, idas, act, idp, idcmp (awk) ,idc; Store G1 = GROUP LOGS_SHORT BY (uuid,idc); store G1 into ‘s3://mybucket/sessions/’; (>)
  • 22. Pig vs. Hive • Pig is geared toward sequentially transforming data – ETL – Shell in scale (from local mode to any scale) • Hive is for querying data – Data analysis / HQL – Some transformation, typically as a means to a goal i.e., temporary tables
  • 23. Monitoring Pig https://github.com/netflix/lipstick
  • 24. Another Monitoring Tool https://github.com/twitter/ambrose
  • 25. Optimize Your EMR Cluster
  • 26. Monitor Your EMR Cluster
  • 27. Bootstrap Actions --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia
  • 28. Management Console
  • 29. Customers Tools Gathering information about EMR jobs from multiple sources and presentation it in a textual and graphic view github.com/Hi-Media/EmrMonitoring
  • 30. Completed Job View
  • 31. Spot Bidding Strategies Less Interruptions Not paying more Most Saving
  • 32. Jeff Bezos (early Amazon days)
  • 33. Data Sources Value Queries
  • 34. More Trends to Consider Transactional Processing Analytical Processing Transactional context Global context Latency Throughput Indexed access Full table scans Random IO Sequential IO Disk seek times Disk transfer rate
  • 35. COPY into Amazon Redshift create table cf_logs ( d date, t char(8), edge char(4), bytes int, cip varchar(15), verb char(3), distro varchar(MAX), object varchar(MAX), status int, Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) ) copy cf_logs from 's3://big-data/logs/E123ABCDEF/' credentials 'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>' IGNOREHEADER 2 GZIP DELIMITER 't' DATEFORMAT 'YYYY-MM-DD'
  • 36. COPY into Amazon Redshift with AWS Data Pipeline
  • 37. Charles Minard's flow map of Napoleon's March (1869) Time for Data Visualization
  • 38. Choose Your Favorite Visualization Tool Tableau (Windows instance) R Jaspersoft QlikView MicroStrategy SiSense …
  • 39. Snapshot before Delete
  • 40. Unload Data from Amazon Redshift unload (“select * from cf_logs where date between '2013-11-03’ and '201311-10’“) to 's3://mybucket/unload_cf_logs_week_46' credentials 'aws_access_key_id=<key_id>; aws_secret_access_key=<secret_key>’ delimiter as 't’ GZIP;
  • 41. Reference Architecture
  • 42. Partner Services Loggly Splunk Stratalux (Logstash) … Loggly AWS Marketplace Page
  • 43. What Else Can You Do with Log Analysis?
  • 44. Finally, a Small Warning Abraham Wald (1902-1950)
  • 45. B C A
  • 46. Would You Like to Know More? Further reading http://aws.amazon.com/architecture http://aws.amazon.com/articles http://aws.typepad.com Re:invent sessions DAT205 - Amazon Redshift in Action: Enterprise, Big Data, and SaaS DAT305 - Getting Maximum Performance from Amazon Redshift BDT301 - Scaling your Analytics with Amazon Elastic MapReduce
  • 47. Please give us your feedback on this presentation ARC306 As a thank you, we will select prize winners daily for completed surveys!