Successfully reported this slideshow.
Your SlideShare is downloading. ×

Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamSets Open Source Community

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 43 Ad

Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamSets Open Source Community

Download to read offline

On a typical day we see hundreds of downloads of StreamSets Data Collector, our open source data integration tool. We used to wrangle our download logs using a combination of the AWS S3 command line, sed, grep, awk and other tools, all run from a shell script (on my laptop!) once a week. This was a classic example of a brittle, hard to maintain, custom data integration. One day it dawned on me, "This is crazy, we have a tool that can do all this!". In this session, I'll explain how I built a dataflow pipeline to stream content delivery network (CDN) logs from S3 to MySQL in real-time, allowing us to gain valuable insights into our open source community. You'll also learn how we use the same techniques to not only gain insights into our community on Slack, but also build tools to better serve them.

On a typical day we see hundreds of downloads of StreamSets Data Collector, our open source data integration tool. We used to wrangle our download logs using a combination of the AWS S3 command line, sed, grep, awk and other tools, all run from a shell script (on my laptop!) once a week. This was a classic example of a brittle, hard to maintain, custom data integration. One day it dawned on me, "This is crazy, we have a tool that can do all this!". In this session, I'll explain how I built a dataflow pipeline to stream content delivery network (CDN) logs from S3 to MySQL in real-time, allowing us to gain valuable insights into our open source community. You'll also learn how we use the same techniques to not only gain insights into our community on Slack, but also build tools to better serve them.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamSets Open Source Community (20)

Advertisement

More from Pat Patterson (20)

Recently uploaded (20)

Advertisement

Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamSets Open Source Community

  1. 1. 1© StreamSets, Inc. All rights reserved. Project Ouroboros Using StreamSets Data Collector to Help Manage the StreamSets Open Source Community Pat Patterson / Director of Evangelism @metadaddy / pat@streamsets.com
  2. 2. 2© StreamSets, Inc. All rights reserved. Who Am I? Pat Patterson / pat@streamsets.com / @metadaddy Past: Sun Microsystems, Salesforce Present: Director of Evangelism, StreamSets I run far 🏃♂️
  3. 3. 3© StreamSets, Inc. All rights reserved. Who is StreamSets? Seasoned leadership team Customer base from global 8000 50% Unique commercial downloaders 2000+ Open source downloads worldwide 3,000,000+ Broad connectivity 50+ History of innovation streamsets.com/about-us
  4. 4. 4© StreamSets, Inc. All rights reserved. The StreamSets DataOps Platform Data Lake
  5. 5. 5© StreamSets, Inc. All rights reserved. A Swiss Army Knife for Data
  6. 6. 6© StreamSets, Inc. All rights reserved. Parse Fastly CDN logs Extract records relating to downloads Gain insights Companies downloading the binaries Geographic reach Metrics for different binary artifacts Objective
  7. 7. 7© StreamSets, Inc. All rights reserved. Bash script to download S3 objects using AWS CLI tool sed, grep, sort, uniq, awk, diff, xargs, curl Complex, hard to maintain, slow, essentially ‘write-only’ code cut -f 1 -d ' ' merge.log|sort|uniq > ips diff --new-line-format="" --unchanged-line- format="" ips allips > newips cat newips|xargs -L 1 -I% curl -s http://ipinfo.io/%/org|cut -f 2- -d ' '|sort|uniq>orgs && subl orgs Before
  8. 8. 8© StreamSets, Inc. All rights reserved. Mission creep Inertia Why??? Image Nyah S / Pexels / Pexels License
  9. 9. 9© StreamSets, Inc. All rights reserved. Data Flow StreamSets Data Collector ↘ ↘ Amazon S3 MySQL
  10. 10. 10© StreamSets, Inc. All rights reserved. Parse Fastly CDN log lines, send data to MySQL <134>2017-07-09T12:01:13Z cache-sjc3636 StreamSetsS3Bucket[60550]: 104.155.191.102 "-" "-" Sun, 09 Jul 2017 12:01:12 GMT GET /datacollector/latest/parcel/manifest.json 200 1295 Let’s Get Started!
  11. 11. 11© StreamSets, Inc. All rights reserved. Grok Patterns are designed for exactly this! Standard patterns for timestamps, HTTP verbs, filenames <%{NUMBER:priority}>%{TIMESTAMP_ISO8601:timestamp} %{HOSTNAME:cachenode} %{WORD:logname}[%{NUMBER:pid}]: %{IP:ip} "-" "-" %{DATESTAMP_FASTLY:datestamp} %{WORD:verb} %{PATH:file} %{NUMBER:code} %{SIZE_OR_NULL} Simple, Right?
  12. 12. 12© StreamSets, Inc. All rights reserved. First Cut
  13. 13. 13© StreamSets, Inc. All rights reserved. What??? An HTTP request isn’t supposed to include the protocol like that! Fastly records whatever the client sends, no matter how dumb. But... Record1-Error SERVICE_ERROR_001 - Cannot parse record from message 'rawData': com.streamsets.pipeline.api.service.dataformats.DataParserException: LOG_PARSER_03 - Log line '<134>2017-07-09T12:01:13Z cache-sjc3636 StreamSetsS3Bucket[60550]: 104.155.191.102 "-" "- Sun, 09 Jul 2017 12:01:12 GMT GET https://archives.streamsets.com/datacollector/latest/parcel/STREAMSETS_DATAC OLLECTOR-1.1.4-el6.parcel 404 0' does not conform to 'Grok Format
  14. 14. 14© StreamSets, Inc. All rights reserved. <%{NUMBER:priority}>%{TIMESTAMP_ISO8601:timestamp} %{HOSTNAME:cachenode} %{WORD:logname}[%{NUMBER:pid}]: %{IP:ip} "-" "-" %{DATESTAMP_FASTLY:datestamp} %{WORD:verb} %{NOTSPACE:file} %{NUMBER:code} %{SIZE_OR_NULL} Solution: Be Permissive with your Input
  15. 15. 15© StreamSets, Inc. All rights reserved. Even if you think you know the data schema - test with real data! First Lesson Learned
  16. 16. 16© StreamSets, Inc. All rights reserved. Second Cut
  17. 17. 17© StreamSets, Inc. All rights reserved. But Performance SUCKED!
  18. 18. 18© StreamSets, Inc. All rights reserved. Solution: Duplicate the Data CREATE TABLE download ( id int(11) AUTO_INCREMENT, ip varchar(64), date datetime, file varchar(767), PRIMARY KEY (`id`), KEY `date_idx` (`date`), KEY `file_idx` (`file`) );
  19. 19. 19© StreamSets, Inc. All rights reserved. Third Cut
  20. 20. 20© StreamSets, Inc. All rights reserved. 30x Better Performance!
  21. 21. 21© StreamSets, Inc. All rights reserved. Filtering Downloads
  22. 22. 22© StreamSets, Inc. All rights reserved. Fit the data model to the data Second Lesson Learned
  23. 23. 23© StreamSets, Inc. All rights reserved. Lookup company details from IP via Kickfire API What’s Next?
  24. 24. 24© StreamSets, Inc. All rights reserved. Fourth Cut
  25. 25. 25© StreamSets, Inc. All rights reserved. com.streamsets.pipeline.api.base.OnRecordErrorException: HTTP_01 - Error fetching resource. Status: 429 Reason: You have reached the maximum calls per second org.glassfish.jersey.message.internal.EntityInputStream@4cb3922b But... Kickfire API is rate limited! To deliver optimum performance to all of our API customers, KickFire balances transaction loads by using rate limits
  26. 26. 26© StreamSets, Inc. All rights reserved. Solution - Rate Limit
  27. 27. 27© StreamSets, Inc. All rights reserved. com.streamsets.pipeline.api.base.OnRecordErrorException: HTTP_01 - Error fetching resource. Status: 429 Reason: You have reached the maximum calls per month org.glassfish.jersey.message.internal.EntityInputStream@4cb3922b But... Kickfire API has a monthly call limit!
  28. 28. 28© StreamSets, Inc. All rights reserved. Solution - Don’t Ask For Data We Already Have
  29. 29. 29© StreamSets, Inc. All rights reserved. Know your API’s non-functional constraints! Third Lesson Learned
  30. 30. 30© StreamSets, Inc. All rights reserved. Fifth Cut
  31. 31. 31© StreamSets, Inc. All rights reserved. Leave to run for a few weeks... Image © Itzuvit / Wikimedia Commons / CC-BY-SA-3.0
  32. 32. 32© StreamSets, Inc. All rights reserved. com.streamsets.pipeline.api.base.OnRecordErrorException: HTTP_01 - Error fetching resource. Status: 429 Reason: You have reached the maximum calls per month org.glassfish.jersey.message.internal.EntityInputStream@4cb3922b But... Kickfire’s monthly call limit strikes again!
  33. 33. 33© StreamSets, Inc. All rights reserved. Root Cause Seeing large numbers of downloads from the same few IP addresses Data Collector has a microbatch architecture - database writes are committed at the end of the batch New IP address isn’t visible in the database until the start of the next batch Still making repeated requests to Kickfire for the same IP address!
  34. 34. 34© StreamSets, Inc. All rights reserved. Solution - Deduplicate records on IP Address
  35. 35. 35© StreamSets, Inc. All rights reserved. Data Collector operates batch-by-batch - design your pipelines accordingly! Fourth Lesson Learned
  36. 36. 36© StreamSets, Inc. All rights reserved. The Finished Article
  37. 37. 37© StreamSets, Inc. All rights reserved. A Closer Look
  38. 38. 38© StreamSets, Inc. All rights reserved. No plan survives first contact with the enemy Helmuth von Moltke the Elder, "On Strategy" (1871) Ultimate Lesson Learned Image in the public domain
  39. 39. 39© StreamSets, Inc. All rights reserved. or Ultimate Lesson Learned
  40. 40. 40© StreamSets, Inc. All rights reserved. Everybody has a plan until they get punched in the mouth Mike Tyson (1987) Ultimate Lesson Learned Image © Abelito Roldan / Flickr / CC BY 2.0
  41. 41. 41© StreamSets, Inc. All rights reserved. September 3-5, 2019 Tue, Sep 3 - Training & Tutorials Wed-Thu, Sep 4-5, Keynote & Breakouts Hilton Financial District (Tue|Wed|Thur)
  42. 42. 42© StreamSets, Inc. All rights reserved. Questions?
  43. 43. 43© StreamSets, Inc. All rights reserved. Thank you 43© StreamSets, Inc. All rights reserved. Pat Patterson / Director of Evangelism @metadaddy / pat@streamsets.com

×