Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013


Published on

AWS offers services that revolutionize the scale and cost for customers to extract information from large data sets, commonly called Big Data. This session analyzes Amazon CloudFront logs combined with additional structured data as a scenario for correlating log and transactional data. Successfully implementing this type of solution requires architects and developers to assemble a set of services with multiple decision points. The session provides a design and example of architecting and implementing the scenario using Amazon S3, AWS Data Pipeline, Amazon Elastic MapReduce, and Amazon Redshift. It explores loading, query performance, security, incremental updates, and design trade-off decisions.

Published in: Technology

Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AWS re:Invent 2013

  1. 1. ARC 306: Lumberjacking on AWS Cutting Through Logs to Find What Matters Guy Ernest, Solutions Architecture November 15, 2013 © 2013, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of, Inc.
  2. 2. Progress Is Not Evenly Distributed 1980 $14,000,000/TB  450,000 ÷   30,000 X  100 MB  50 X  4 MB/s Today $30/TB 3 TB 200 MB/s
  3. 3. by Kheel Center, Cornell University Solution: More Spindles
  4. 4. Case Study – Foursquare
  5. 5. The Challenge “…Foursquare streams hundreds of millions of application logs each day. The company relies on analytics to report on its daily usage, evaluate new offerings, and perform long-term trend analysis—and with millions of new check-ins each day, the workload is only growing…”
  6. 6. “Real” Project Requirements Example Cost Analysis Marketing Operations Revenue Data transfer Top URLs Error rates Top games • By date/time • By edge location • By date/time within an edge location • By top X URLs • By HTTP vs. HTTPS • • • • • By top X URLs • By edge location • By edge location and content type • By revenue • By edge location and revenue As-is count By content type By edge location By edge location and content type Top ads • That lead to a game purchase Requests served • By edge location Revenue • By edge location Top games • By age • By income • By gender
  7. 7. Viable Business Revenues # Users Operation Costs $ Money
  8. 8. Available Data Sources Metric Data transfer by date/time Data transfer by edge location Data transfer by date/time within an edge location Data transfer by top x URLs Data transfer by http vs HTTPS Top URLs Top URLs by Content Type Top URLs by Edge Location Top URLs by Edge Location and Content Type Error rates by top x URLs Error rate by edge location Error Rate by edge location and content type Requests served by edge location Revenue by edge location Top games segmented by age Top games segmented by income Top games segmented by gender Top games by revenue Top games by edge location and revenue Top game revenue segmented by age Sources CloudFront logs CloudFront logs CloudFront logs CloudFront logs, web servers logs CloudFront logs CloudFront logs, web servers logs CloudFront logs CloudFront logs CloudFront logs CloudFront logs, web servers logs CloudFront logs CloudFront logs CloudFront logs CloudFront logs, OrdersDB, app servers logs CloudFront logs, user profile CloudFront logs, user profile CloudFront logs, user profile CloudFront logs, OrdersDB CloudFront logs, OrdersDB CloudFront logs, OrdersDB, user profile
  9. 9. CloudFront Access Log Format #Version: 1.0 #Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query 2012-05-25 22:01:30 AMS1 4448 GET /YT0KthT/F5SOWdDPqNqQF07tiTOXqJMpfD dlb3LMwv3/jP3/CINm/yDSy0MsRcWJN/Simutrans.exe 200 Mozilla/5.0%20(compatible;%20M SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625181 2012-05-25 22:01:30 AMS1 4952 GET /66IG584/CPCxY0P44BGb5ZOd3qSUrauL05 0LOvFwaMj/eH/caw/Blob Wars-Blob And Conquer.exe 200 Mozilla/5.0%20(compatible;%20M SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625184 2012-05-25 22:01:30 AMS1 4556 GET /SwlufjC/xEjH3BRbXMXwmFWqzKt7od6tlW R3e13LhmH/V3eF/lo6g/AstroMenace.exe 200 Opera/9.80%20(Windows%20NT%205.1;%20U;%20pl)%2 0Presto/2.10.229%20Version/11.60 uid=100&oid=108625189 2012-05-25 22:01:30 AMS1 47172 GET /Di1cXoN/TskldkSHcgkvZXQEmv5vOVR25X 5UTisFkRq/pQa/wCjUXZb/Z1HRuGlo/Kroz.exe 200 Opera/9.80%20(Windows%20NT%205.1;%20U; %20pl)%20Presto/2.10.229%20Version/11.60 uid=100&oid=108625206
  10. 10. Sample Your Data with R > > > > > sample_data <- read.delim(”SampleFiles/E123ABCDEF.2012-05-25-22.NEfbhLN3", header=F) sample_data <- sample_data[-1:-2,] View(sample_data) m <- ggplot(sample_data, aes(x = factor(V9))) m + geom_histogram() + scale_y_log10() + xlab('Error Codes') + ylab('log(Frequency)')
  11. 11. Need a Lot of Memory?
  12. 12. OpenRefine Running on an EC2 Instance
  14. 14. Swedish public domain photo taken in 1918 Log Shipping
  15. 15. “Poor Man’s Log Shipping”
  16. 16. Embedding Poor-man Invisible Pixel .com&utmcs=UTF-8&utmsr=1440x900&utmsc=24-bit&utmul=enus&utmje=1&utmfl=10.3%20r181&utmdt=%E8%B1%86%E7%93%A3&utmhid=571356425&utmr =-&utmp=%2F&utmac=UA-70197651&utmcc=__utma%3D30149280.1785629903.1314674330.1315290610.1315452707.10%3B ferral)%7Cutmcmd%3Dreferral%7Cutmcct%3D%2Fpoor-man-analyticsarchitecture.html%3B%2B__utmv%3D30149280.162%3B&utmu=qBM~
  17. 17. Open Source Frameworks Fluentd Flume Scribe Chukwa … Fluentd Ascii Diagrams Input Output +--------------------------------------------+ | | | Web Apps ---+ +--> File | | | | | | +--> ---+ | | /var/log ------> Fluentd ------> Mail | | +--> ---+ | | | | | | Apache ---+ +--> S3 | | | +--------------------------------------------+ Web Server +---------+ | Fluentd -------+ +---------+ | | Proxy Server | +---------+ +--> +---------+ | Fluentd ----------> | Fluentd | +---------+ +--> +---------+ | Database Server | +---------+ | | Fluentd -------+ +---------+
  18. 18. Use Amazon Kinesis to Ship Your Logs New
  19. 19. Aggregation with S3Distcp Aggregated Even-size Compressed
  20. 20. S3distcp on EMR Job Sample ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,s3://myoutputbucket/aggregate , --groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*, --targetSize,128, --outputCodec,lzo, --deleteOnSuccess'
  21. 21. Pig for Access Logs Analysis RAW_LOG = LOAD 's3://myoutputbucket/aggregate/' AS (ts:chararray, url:chararray…); LOGS_BASE_F = FILTER RAW_LOG BY url MATCHES '^GET /__track.*$’; LOGS_BASE_F_W_PARAM = FOREACH LOGS_BASE_F GENERATE url, Load and Filter DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') as dt, SUBSTRING(DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') ,0, 10 ) as day, (cat / grep) … status, REGEX_EXTRACT(url, '^GET /([^?]+)', 1) AS action: chararray, REGEX_EXTRACT(url, 'idt=([^&]+)', 1) AS idt: chararray, REGEX_EXTRACT(url, 'idc=([^&]+)', 1) AS idc: chararray; I1 = FILTER LOGS_BASE_F_W_PARAM by action == 'clic' or action == 'display'; Parse LOGS_SHORT = FOREACH I1 GENERATE uuid, action, dt, day, ida, idas, act, idp, idcmp (awk) ,idc; Store G1 = GROUP LOGS_SHORT BY (uuid,idc); store G1 into ‘s3://mybucket/sessions/’; (>)
  22. 22. Pig vs. Hive • Pig is geared toward sequentially transforming data – ETL – Shell in scale (from local mode to any scale) • Hive is for querying data – Data analysis / HQL – Some transformation, typically as a means to a goal i.e., temporary tables
  23. 23. Monitoring Pig
  24. 24. Another Monitoring Tool
  25. 25. Optimize Your EMR Cluster
  26. 26. Monitor Your EMR Cluster
  27. 27. Bootstrap Actions --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia
  28. 28. Management Console
  29. 29. Customers Tools Gathering information about EMR jobs from multiple sources and presentation it in a textual and graphic view
  30. 30. Completed Job View
  31. 31. Spot Bidding Strategies Less Interruptions Not paying more Most Saving
  32. 32. Jeff Bezos (early Amazon days)
  33. 33. Data Sources Value Queries
  34. 34. More Trends to Consider Transactional Processing Analytical Processing Transactional context Global context Latency Throughput Indexed access Full table scans Random IO Sequential IO Disk seek times Disk transfer rate
  35. 35. COPY into Amazon Redshift create table cf_logs ( d date, t char(8), edge char(4), bytes int, cip varchar(15), verb char(3), distro varchar(MAX), object varchar(MAX), status int, Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) ) copy cf_logs from 's3://big-data/logs/E123ABCDEF/' credentials 'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>' IGNOREHEADER 2 GZIP DELIMITER 't' DATEFORMAT 'YYYY-MM-DD'
  36. 36. COPY into Amazon Redshift with AWS Data Pipeline
  37. 37. Charles Minard's flow map of Napoleon's March (1869) Time for Data Visualization
  38. 38. Choose Your Favorite Visualization Tool Tableau (Windows instance) R Jaspersoft QlikView MicroStrategy SiSense …
  39. 39. Snapshot before Delete
  40. 40. Unload Data from Amazon Redshift unload (“select * from cf_logs where date between '2013-11-03’ and '201311-10’“) to 's3://mybucket/unload_cf_logs_week_46' credentials 'aws_access_key_id=<key_id>; aws_secret_access_key=<secret_key>’ delimiter as 't’ GZIP;
  41. 41. Reference Architecture
  42. 42. Partner Services Loggly Splunk Stratalux (Logstash) … Loggly AWS Marketplace Page
  43. 43. What Else Can You Do with Log Analysis?
  44. 44. Finally, a Small Warning Abraham Wald (1902-1950)
  45. 45. B C A
  46. 46. Would You Like to Know More? Further reading Re:invent sessions DAT205 - Amazon Redshift in Action: Enterprise, Big Data, and SaaS DAT305 - Getting Maximum Performance from Amazon Redshift BDT301 - Scaling your Analytics with Amazon Elastic MapReduce
  47. 47. Please give us your feedback on this presentation ARC306 As a thank you, we will select prize winners daily for completed surveys!