Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Collecting Big Data with S3/CloudFront Logging


Published on

There are several ways of collecting big data, one the most promising is S3/CloudFront logging. It’s low cost and quick to implement. Let's dive in and see how to setup S3/CloudFront logging with your application.

Published in: Data & Analytics, Technology
  • Be the first to comment

Collecting Big Data with S3/CloudFront Logging

  1. 1. COLLECTING BIG DATA WITH S3/CLOUDFRONT LOGGING Moty Michaely, VP R&D Xplenty Data Integration-as-a-Service
  2. 2. In our recent article, “Scale Your Data Collection on the Cloud Like a Champ”, we reviewed several ways of collecting big data, the most promising of which was S3/CloudFront logging. It’s low cost and quick to implement. Now we’d like to dig deeper and show how to setup S3/CloudFront logging with your application.
  3. 3. DEFINE APP DATA Sit back and think - which data would you like to collect? Which app events should be logged? These could be page visits, mouse clicks, logins, errors, etc. Some of them may include parameters such as the page visit URL. Write them all down. Be as thorough as possible so you don’t lose any precious data.
  4. 4. CREATE AN AWS ACCOUNT If you don’t already have an AWS (Amazon Web Services) account, you can sign up here. Registration is free with the basic support package.
  5. 5. CREATE AN S3 BUCKET Go to the S3 dashboard and create a bucket for saving the logs. Note that the bucket must have a unique name across Amazon’s service and adhere to DNS rules: 3-63 characters, only letters numbers and periods, shouldn't look like an IP address, and no underscores. Don’t turn on logging - we will do so via CloudFront. (See the screenshot on the next slide for a visual explanation)
  7. 7. CREATE EVENT IMAGES Set up directories in the image bucket, for example /mouse, to organize events by categories, and create 1x1 pixel images (see previous post) for all the events that you defined in the first step, e.g. click.png, login.png, error.png. Don’t worry about event parameters at the moment, we will deal with them shortly. All files uploaded to S3 are set as private, so make sure to change the file permissions to public. You may use tools such as CloudBerry Explorer or S3 Browser to do so and much more.
  8. 8. CREATE EVENT IMAGES CONT. Set HTTP headers for all the images so that they will be cached by CloudFront, thus saving GET requests from CloudFront edge locations to S3. Go to the relevant bucket, check the image files on the left, click Actions at the top, choose Properties, and open the Metadata section. Add the following metadata line and click save: ▪ Cache-Control: max-age=31536000
  10. 10. CREATE A CLOUDFRONT DISTRIBUTION Creating a CloudFront distribution costs extra, but it’s mandatory - it logs the query string, adds extra log info such as edge locations, and helps to deliver files via Amazon’s CDN to shorten load times. Access the CloudFront dashboard and create a web distribution for the image S3 bucket. Make sure that Use Origin Cache Headers is set under Object Caching (it’s the default setting).
  11. 11. CREATE A CLOUDFRONT DISTRIBUTION CONT. Note that the distribution gets a random domain name. It could take a while before it starts working because the DNS servers need to be updated to support it. You can also set a more friendly domain using the Alternate Domain Names (CNAMEs) option under Distribution Settings, though it requires configuring your DNS settings so that your domain points to CloudFront’s domain name. See Amazon’s documentation for more info.
  14. 14. TURN LOGGING ON Still in the CloudFront dashboard, check the distribution on the left, click Distribution Settings at the top, click Edit under the General tab, enable logging, and insert the bucket where you want to store the logs.
  17. 17. CODE A FUNCTION TO CALL EVENTS Time to get your hands dirty and write a method that registers events, or call one of your app’s developers to do it for you. The code could be on the client side, server side, or both depending on the architecture. The method should simply send an asynchronous HTTP GET request to the relevant image URL, e.g. to (links in this format for demo purposes only, not operational). If you need to send additional event parameters, use the query string (don’t forget URL encoding), e.g.
  18. 18. EXAMPLE CODE TO CALL EVENTS $.CloudFrontLog = function (attr) { var url = '' + attr.category + '/' + attr.action + '.png', data = { id:, url: attr.url }; return $.get(url, data); };
  19. 19. CALL THE EVENTS Dig through your app’s code and add event calls using the method that you’ve just written. This will collect the data that you defined in step 1. Here’s a jQuery code sample for logging client-side button clicks: $('.btn').click(function(e) { var id = $(this).attr('id'); $.CloudFrontLog({ action: 'click', category: 'mouse', id: id, url: location.href }); });
  20. 20. TEST Use your staging environment to call events via the application and check that the logs are generated accordingly. Patience young padawan, it may take an hour or so until Amazon writes them.
  21. 21. GO LIVE! Everything should be ready for you to collect big data like a champ - update the production environment and let the logging begin. Don't know what to do with the data? See how to analyze AWS logs in 15 minutes.