Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Engineering with Open Source - Hyonjee Joo

1,696 views

Published on

Engineering systems using open source solutions can be a powerful way to leverage existing technology. However, not all open source solutions are made or supported equally, and it’s important to choose what you use carefully. In this talk, we’ll walk through building a metrics system for a high performance data platform, taking a look at some of the important factors to consider when choosing what open source offerings to use.

Published in: Engineering
  • Login to see the comments

  • Be the first to like this

Engineering with Open Source - Hyonjee Joo

  1. 1. Engineering with Open Source B U I L D I N G A H I G H P E R F O R M A N C E M E T R I C S S Y S T E M U S I N G O P E N S O U R C E S O F T W A R E #GHC18 Hyonjee Joo | @twosigma
  2. 2. PAGE 2 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity. The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. Legal Disclaimer #GHC 18
  3. 3. Introduction #GHC18
  4. 4. PAGE 4 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 4 My Background #GHC18 • Graduated from Columbia University • B.S. in computer science and psychology • My 4th GHC – 1st time participating in OSD! • Currently, a software engineer at Two Sigma in New York • What is Two Sigma? • Investment management firm that uses technology and lots of data to drive decisions
  5. 5. PAGE 5 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY • Walk through designing a metrics system for a high performance data platform 5 problem solution problem problem solution solution goal #GHC18 In this talk…
  6. 6. PAGE 6 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY • Walk through designing a metrics system for a high performance data platform • Using open source solutions every step of the way 6 problem solution problem problem solution solution goal #GHC18 In this talk…
  7. 7. PAGE 7 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 1. Engineering a new system can involve less code than you think 7 #GHC18 The Takeaways
  8. 8. PAGE 8 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 8 1. Engineering a new system can involve less code than you think 2. Know the problem before you look for a solution #GHC18 The Takeaways
  9. 9. PAGE 9 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 1. Engineering a new system can involve less code than you think 2. Know the problem before you look for a solution 3. Careful what you choose, not all open source tools are made (or supported) equally 9 #GHC18 The Takeaways
  10. 10. Let’s set up the problem #GHC18
  11. 11. PAGE 11 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY • We want to measure usage metrics because: • It’s important to know how our system is being used and by who • If we know what people want to do, we can do a better job doing it • We can identify trends and anticipate how user needs may change 11 #GHC18 Purpose of a Metrics System
  12. 12. PAGE 12 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 12 #GHC18 The Data Platform High performance data platform Up to 50,000 queries/sec 1.85 GiB/sec per node Example query: data = client.query( date_range=(20000101, 20180101), dataset=”x”, transformation=”log”)
  13. 13. PAGE 13 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 13 #GHC18 Query Data Query 1 { query_time: “20180209 09:00:05”, user: “user1”, dataset: “x”, date_range: { begin: “20000101”, end: “20180101” }, duration: 100, bytes: 350000000, query_param_1: 1.0, query_param_2: “log”, ... }
  14. 14. PAGE 14 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 14 #GHC18 Query Data Query 1 { query_time: “20180209 09:00:05”, user: “user1”, dataset: “x”, date_range: { begin: “20000101”, end: “20180101” }, duration: 100, bytes: 350000000, query_param_1: 1.0, query_param_2: “log”, ... }
  15. 15. PAGE 15 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 15 #GHC18 Query Data Query 1 { query_time: “20180209 09:00:05”, user: “user1”, dataset: “x”, date_range: { begin: “20000101”, end: “20180101” }, duration: 100, bytes: 350000000, query_param_1: 1.0, query_param_2: “log”, ... }
  16. 16. PAGE 16 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 16 #GHC18 Query Data Query 1 { query_time: “20180209 09:00:05”, user: “user1”, dataset: “x”, date_range: { begin: “20000101”, end: “20180101” }, duration: 100, bytes: 350000000, query_param_1: 1.0, query_param_2: “log”, ... }
  17. 17. PAGE 17 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 17 #GHC18 Query Data Query 1 { query_time: “20180209 09:00:05”, user: “user1”, dataset: “x”, date_range: { begin: “20000101”, end: “20180101” }, duration: 100, bytes: 350000000, query_param_1: 1.0, query_param_2: “log”, ... }
  18. 18. PAGE 18 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 18 #GHC18 Query Data Query 1 { query_time: “20180209 09:00:05”, user: “user1”, dataset: “x”, date_range: { begin: “20000101”, end: “20180101” }, duration: 100, bytes: 350000000, query_param_1: 1.0, query_param_2: “log”, ... }
  19. 19. PAGE 19 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 19 #GHC18 Query Data Query 1 { query_time: “20180209 09:00:05”, user: “user1”, dataset: “x”, date_range: { begin: “20000101”, end: “20180101” }, duration: 100, bytes: 350000000, query_param_1: 1.0, query_param_2: “log”, ... }
  20. 20. PAGE 20 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 20 #GHC18 Query Data Query 1 { query_time: “20180209 09:00:05”, user: “user1”, dataset: “x”, date_range: { begin: “20000101”, end: “20180101” }, duration: 100, bytes: 350000000, query_param_1: 1.0, query_param_2: “log”, ... } Important for product planning - What features and query parameters do people use? - How are queries distributed across data sets? - Who are our biggest users in terms of number of queries and bytes transferred? - How many distinct users do we have? - How has all of this changed over time?
  21. 21. PAGE 21 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 21 #GHC18 The Challenge of Query-Level Granularity Query 1
  22. 22. PAGE 22 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 22 #GHC18 The Challenge of Query-Level Granularity Query 1Query 2
  23. 23. PAGE 23 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 23 #GHC18 The Challenge of Query-Level Granularity Query 1Query 2 Query 2 Query 2 Query 2 Query n
  24. 24. PAGE 24 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 24 #GHC18 The Challenge of Query-Level Granularity Query 1Query 2 Query 2 Query 2 Query 2 Query n time QueryRate Bursts up to 50,000 queries/sec
  25. 25. PAGE 25 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 25 #GHC18 The End Goal High performance data platform Up to 50,000 queries/sec 1.85 GiB/sec per node Query 1Query 2 Queries We need more insight into who is using our data platform and how it’s being used. Our goal: collect and analyze usage metrics with query- query-level granularity without impacting the performance and reliability of the data platform.
  26. 26. Let’s build the metrics system #GHC18
  27. 27. PAGE 27 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 27 #GHC18 Problem: what to do with metrics data? Query 1Query 2 Queries • Store it with flexible schema • Be able to analyze & visualize the data quickly
  28. 28. PAGE 28 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 28 #GHC18 Problem: what to do with metrics data? Query 1Query 2 Queries • Store it with flexible schema • Be able to analyze & visualize the data quickly ---------------- Open Source Offerings ----------------
  29. 29. PAGE 29 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 29 #GHC18 How do we pick what OS offering to use? Checklist:  Does it have the right features and potential to solve your problem?  Is it internally available or supported?  Licensing?  Is it supported by an active OS community?  How many active developers?  When was the most recent commit/pull request?  Is it extensible? (e.g. plugins, patches)  Versioning? Backwards compatible changes?
  30. 30. PAGE 30 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Product Allows flexible data schema? Data analysis & visualization? Internally available or accessible at Two Sigma? OS community support? Extensible? Stable versioning? 30 #GHC18 Problem: what to do with metrics data? ---------------- Open Source Offerings ----------------
  31. 31. PAGE 31 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Product Allows flexible data schema? Data analysis & visualization? Internally available or accessible at Two Sigma? OS community support? Extensible? Stable versioning? 31 #GHC18 Problem: what to do with metrics data? ---------------- Open Source Offerings ----------------
  32. 32. PAGE 32 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 32 #GHC18 Solution: Elasticsearch Elasticsearch is an open source platform that can store event data for easy searching and analysis
  33. 33. PAGE 33 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 33 #GHC18 Solution: Elasticsearch + Kibana Elasticsearch is an open source platform that can store event data for easy searching and analysis There are plugins like Kibana, that make data analysis and visualization easy.
  34. 34. PAGE 34 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 34 #GHC18 Destination for the data is Elasticsearch We want to get our query data into “indexes” in Elasticsearch. An index per day makes for easy searching and archiving across time. metrics-2018-01-01 ... ... metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09 Query 1Query 2 Queries
  35. 35. PAGE 35 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Elasticsearch was not built to handle 50,000 msgs/sec 35 #GHC18 Problem: Elasticsearch can’t handle throughput We don’t want Elasticsearch performance to hurt the performance of our data platform. metrics-2018-01-01 ... ... metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09 Query 1Query 2 Queries ???
  36. 36. PAGE 36 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 36 #GHC18 Problem: Elasticsearch can’t handle throughput Idea: use a buffer to handle the throughput bursts metrics-2018-01-01 ... ... metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09 Query 1Query 2 Queries (Buffer) Input flow Output flow
  37. 37. PAGE 37 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Product Can handle throughput bursts? Internally available or accessible at Two Sigma? OS community support? Extensible? Stable versioning? 37 #GHC18 Problem: Elasticsearch can’t handle throughput ---------------- Open Source Offerings ---------------- As-a- service
  38. 38. PAGE 38 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Product Can handle throughput bursts? Internally available or accessible at Two Sigma? OS community support? Extensible? Stable versioning? 38 #GHC18 Problem: Elasticsearch can’t handle throughput ---------------- Open Source Offerings ---------------- As-a- service
  39. 39. PAGE 39 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 39 #GHC18 Solution: Kafka Topic (partitioned queue) Kafka is an open source streaming platform that allows you to produce data to & consume data from a Kafka topic. It’s designed for high throughput, low latency.
  40. 40. PAGE 40 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 40 #GHC18 Kafka is more suitable for our throughput Queries ... Our data platform Many asynchronous Java Client producers roundrobin Partition 0 Topic Kafka can handle the high throughput bursts in data that Elasticsearch couldn’t. Partition 1 Partition 2 Partition 3 Partition n
  41. 41. PAGE 41 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 41 #GHC18 Kafka is more suitable for our throughput Queries ... Our data platform Many asynchronous Java Client producers roundrobin Partition 0 Topic Kafka can handle the high throughput bursts in data that Elasticsearch couldn’t. Partition 1 Partition 2 Partition 3 Partition n (Kafka) Input flow Output flow
  42. 42. PAGE 42 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 42 #GHC18 Kafka as a Buffer Queries ... Our data platform Partition 0 We can use Kafka as an intermediary buffer to store our metrics before writing to Elasticsearch. Partition 1 Partition 2 Partition 3 Partition n metrics-2018-01-01 ... ... metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09
  43. 43. PAGE 43 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Our data platform 43 #GHC18 Problem: How to get from Kafka to Elasticsearch? Queries ... Partition 0 We can use Kafka as an intermediary buffer to store our metrics before writing to Elasticsearch. Partition 1 Partition 2 Partition 3 Partition n ... ... ? ? ? metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09
  44. 44. PAGE 44 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 44 #GHC18 Problem: How to get from Kafka to Elasticsearch? ---------------- Open Source Offerings ---------------- Product Easy reading from Kafka? Easy writing to Elasticsearch? Internally available or accessible at Two Sigma? OS community support? Extensible? Stable versioning?
  45. 45. PAGE 45 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 45 #GHC18 Problem: How to get from Kafka to Elasticsearch? ---------------- Open Source Offerings ---------------- Product Easy reading from Kafka? Easy writing to Elasticsearch? Internally available or accessible at Two Sigma? OS community support? Extensible? Stable versioning?
  46. 46. PAGE 46 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Our data platform 46 #GHC18 Queries ... Partition 0 Logstash is an open source data processing pipeline. It ingests -> transforms -> and “stashes” data. Partition 1 Partition 2 Partition 3 Partition n ... ... Solution: Logstash metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09
  47. 47. PAGE 47 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Our data platform 47 #GHC18 Queries ... Partition 0 We can use logstash to ingest data from Kafka, transform it as we’d like, and stash it in elasticsearch. Partition 1 Partition 2 Partition 3 Partition n ... ... Solution: Logstash metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09
  48. 48. PAGE 48 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Our data platform 48 #GHC18 Queries Logstash connects Kafka to Elasticsearch input { kafka { bootstrap_servers => "kafka-server.host:9095" topic_id => "usage-topic" codec => "json" group_id => "consumer-group-a" } } filter { date { match => ["query_time", "UNIX_MS"] remove_field => ["query_time"] } } output { elasticsearch { index => "metrics-%{+YYYY-MM-dd}" hosts => ["metrics.elasticsearch.host:443"] } } ... Partition 0 Partition 1 Partition 2 Partition 3 Partition n ... ... metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09
  49. 49. PAGE 49 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Our data platform 49 #GHC18 Queries Logstash can scale ... ... metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09 ... Partition 0 Partition 1 Partition 2 Partition 3 Partition n Can scale up to the number of partitions in the Kafka topic. Easy as starting more logstash instances with the same configuration.
  50. 50. PAGE 50 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Our data platform 50 #GHC18 Queries Problem: how to manage many logstash instances? ... ... metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09 ... Partition 0 Partition 1 Partition 2 Partition 3 Partition n With multiple logstash instances, we need a way to manage them. ? ? ?
  51. 51. PAGE 51 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 51 #GHC18 Problem: how to manage many logstash instances? ---------------- Open Source Offerings ---------------- Product Management of non-web services? Minimal overhead? Internally available or accessible at Two Sigma? OS community support? Extensible? Stable versioning? TS Waiter
  52. 52. PAGE 52 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 52 #GHC18 Problem: how to manage many logstash instances? ---------------- Open Source Offerings ---------------- Product Management of non-web services? Minimal overhead? Internally available or accessible at Two Sigma? OS community support? Extensible? Stable versioning? TS Waiter
  53. 53. PAGE 53 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Our data platform 53 #GHC18 Queries ... Partition 0 Marathon is an open source container orchestration platform. It schedules, monitors, and restarts applications as needed. Partition 1 Partition 2 Partition 3 Partition n ... ... Solution: Marathon metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09
  54. 54. PAGE 54 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Our data platform 54 #GHC18 Queries ... Partition 0 We use Marathon to manage our logstash instances. Partition 1 Partition 2 Partition 3 Partition n ... ... Marathon keeps logstash instances up and running metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09
  55. 55. That was a lot of new tech, let’s recap. #GHC18
  56. 56. PAGE 56 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 56 #GHC18 The Complete Metrics System Our data platform Queries ... Partition 0 Partition 1 Partition 2 Partition 3 Partition n ... ... metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09 Data platform with up to 50,000 queries/sec Kafka as a high throughput, low latency buffer Logstash instances running on marathon ingesting Kafka data and stashing it in Elasticsearch Elasticsearch indexes our metrics and we can analyze it using Kibana!
  57. 57. But wait, why did we use open source solutions in the first place? #GHC18
  58. 58. PAGE 58 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 58 #GHC18 The Alternatives • Write your own code from scratch • Support burden is all on you • Use internal solutions if they exist • E.g. Your company may choose not to develop custom solutions for process management if it’s business goals and strengths are more in the domain of data analysis and modeling.
  59. 59. PAGE 59 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 59 #GHC18 Open Source Benefits • Use solutions that have been tested and developed by a community of contributors and users • Save developer time • I was able to design, deliver, & deploy this metrics system in under a month • Can give back to the OS community. E.g. if you find a bug or missing feature • Report the issue • Make a contribution
  60. 60. Results #GHC18
  61. 61. PAGE 61 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 61 #GHC18 Metrics we’ve seen with our new system Sum of bytes aggregated by data set % distribution of ‘dataset’ query parameter Dataset test_data_x
  62. 62. PAGE 62 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 62 #GHC18 Metrics we’ve seen with our new system Number of unique users per dataset Number of unique users over time Dataset
  63. 63. PAGE 63 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 63 #GHC18 Metrics we’ve seen with our new system
  64. 64. PAGE 64 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 64 #GHC18 Engineering Lessons Learned • Requirements shape the system you build • Problem-solution approach • Use open source tools when you can • Why reinvent the wheel? • Support from the open source community • BUT be mindful when choosing open source solutions • Sometimes it’s about orchestrating the pieces together • Configurations are important
  65. 65. Thanks for listening! Stop by the Two Sigma booth Visit https://opensource.twosigma.com/ Email me at Hyonjee.Joo@twosigma.com #GHC18

×