Engineering with Open Source - Hyonjee Joo

Two Sigma
Two SigmaTwo Sigma
Engineering with Open Source
B U I L D I N G A H I G H P E R F O R M A N C E M E T R I C S S Y S T E M
U S I N G O P E N S O U R C E S O F T W A R E
#GHC18
Hyonjee Joo | @twosigma
PAGE 2 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
This document is being distributed for informational and
educational purposes only and is not an offer to sell or the
solicitation of an offer to buy any securities or other
instruments. The information contained herein is not intended
to provide, and should not be relied upon for, investment
advice. The views expressed herein are not necessarily the
views of Two Sigma Investments, LP or any of its affiliates
(collectively, “Two Sigma”). Such views reflect the
assumptions of the author(s) of the document and are subject
to change without notice. The document may employ data
derived from third-party sources. No representation is made by
Two Sigma as to the accuracy of such information and the use
of such information in no way implies an endorsement of the
source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos
or other material used herein may be owned by entities other
than Two Sigma. If so, such copyrights and/or trademarks are
most likely owned by the entity that created the material and
are used purely for identification and comment as fair use
under international copyright and/or trademark laws. Use of
such image, copyright or trademark does not imply any
association with such organization (or endorsement of such
organization) by Two Sigma, nor vice versa.
Legal
Disclaimer
#GHC 18
Introduction
#GHC18
PAGE 4 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 4
My Background
#GHC18
• Graduated from Columbia University
• B.S. in computer science and psychology
• My 4th GHC – 1st time participating in OSD!
• Currently, a software engineer at Two Sigma in New York
• What is Two Sigma?
• Investment management firm that uses technology and lots of data to drive
decisions
PAGE 5 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
• Walk through designing a metrics system for a high performance data
platform
5
problem
solution
problem
problem
solution solution
goal
#GHC18
In this talk…
PAGE 6 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
• Walk through designing a metrics system for a high performance data
platform
• Using open source solutions every step of the way
6
problem
solution
problem
problem
solution solution
goal
#GHC18
In this talk…
PAGE 7 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
1. Engineering a new system can involve less code than you think
7
#GHC18
The Takeaways
PAGE 8 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 8
1. Engineering a new system can involve less code than you think
2. Know the problem before you look for a solution
#GHC18
The Takeaways
PAGE 9 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
1. Engineering a new system can involve less code than you think
2. Know the problem before you look for a solution
3. Careful what you choose, not all open source tools are made (or
supported) equally
9
#GHC18
The Takeaways
Let’s set up the problem
#GHC18
PAGE 11 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
• We want to measure usage metrics because:
• It’s important to know how our system is being used and by who
• If we know what people want to do, we can do a better job doing it
• We can identify trends and anticipate how user needs may change
11
#GHC18
Purpose of a Metrics System
PAGE 12 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 12
#GHC18
The Data Platform
High performance data
platform
Up to 50,000 queries/sec
1.85 GiB/sec per node
Example query:
data = client.query(
date_range=(20000101, 20180101),
dataset=”x”,
transformation=”log”)
PAGE 13 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 13
#GHC18
Query Data
Query 1
{
query_time: “20180209 09:00:05”,
user: “user1”,
dataset: “x”,
date_range: {
begin: “20000101”,
end: “20180101”
},
duration: 100,
bytes: 350000000,
query_param_1: 1.0,
query_param_2: “log”,
...
}
PAGE 14 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 14
#GHC18
Query Data
Query 1
{
query_time: “20180209 09:00:05”,
user: “user1”,
dataset: “x”,
date_range: {
begin: “20000101”,
end: “20180101”
},
duration: 100,
bytes: 350000000,
query_param_1: 1.0,
query_param_2: “log”,
...
}
PAGE 15 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 15
#GHC18
Query Data
Query 1
{
query_time: “20180209 09:00:05”,
user: “user1”,
dataset: “x”,
date_range: {
begin: “20000101”,
end: “20180101”
},
duration: 100,
bytes: 350000000,
query_param_1: 1.0,
query_param_2: “log”,
...
}
PAGE 16 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 16
#GHC18
Query Data
Query 1
{
query_time: “20180209 09:00:05”,
user: “user1”,
dataset: “x”,
date_range: {
begin: “20000101”,
end: “20180101”
},
duration: 100,
bytes: 350000000,
query_param_1: 1.0,
query_param_2: “log”,
...
}
PAGE 17 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 17
#GHC18
Query Data
Query 1
{
query_time: “20180209 09:00:05”,
user: “user1”,
dataset: “x”,
date_range: {
begin: “20000101”,
end: “20180101”
},
duration: 100,
bytes: 350000000,
query_param_1: 1.0,
query_param_2: “log”,
...
}
PAGE 18 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 18
#GHC18
Query Data
Query 1
{
query_time: “20180209 09:00:05”,
user: “user1”,
dataset: “x”,
date_range: {
begin: “20000101”,
end: “20180101”
},
duration: 100,
bytes: 350000000,
query_param_1: 1.0,
query_param_2: “log”,
...
}
PAGE 19 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 19
#GHC18
Query Data
Query 1
{
query_time: “20180209 09:00:05”,
user: “user1”,
dataset: “x”,
date_range: {
begin: “20000101”,
end: “20180101”
},
duration: 100,
bytes: 350000000,
query_param_1: 1.0,
query_param_2: “log”,
...
}
PAGE 20 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 20
#GHC18
Query Data
Query 1
{
query_time: “20180209 09:00:05”,
user: “user1”,
dataset: “x”,
date_range: {
begin: “20000101”,
end: “20180101”
},
duration: 100,
bytes: 350000000,
query_param_1: 1.0,
query_param_2: “log”,
...
}
Important for product planning
- What features and query parameters do
people use?
- How are queries distributed across data
sets?
- Who are our biggest users in terms of
number of queries and bytes transferred?
- How many distinct users do we have?
- How has all of this changed over time?
PAGE 21 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 21
#GHC18
The Challenge of Query-Level Granularity
Query 1
PAGE 22 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 22
#GHC18
The Challenge of Query-Level Granularity
Query 1Query 2
PAGE 23 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 23
#GHC18
The Challenge of Query-Level Granularity
Query 1Query 2
Query 2
Query 2
Query 2
Query n
PAGE 24 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 24
#GHC18
The Challenge of Query-Level Granularity
Query 1Query 2
Query 2
Query 2
Query 2
Query n
time
QueryRate
Bursts up to 50,000
queries/sec
PAGE 25 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 25
#GHC18
The End Goal
High performance data
platform
Up to 50,000 queries/sec
1.85 GiB/sec per node
Query 1Query 2
Queries
We need more insight into who is using our data platform
and how it’s being used.
Our goal: collect and analyze usage metrics with query-
query-level granularity without impacting the
performance and reliability of the data platform.
Let’s build the metrics system
#GHC18
PAGE 27 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 27
#GHC18
Problem: what to do with metrics data?
Query 1Query 2
Queries
• Store it with flexible schema
• Be able to analyze & visualize the data quickly
PAGE 28 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 28
#GHC18
Problem: what to do with metrics data?
Query 1Query 2
Queries
• Store it with flexible schema
• Be able to analyze & visualize the data quickly
---------------- Open Source Offerings ----------------
PAGE 29 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 29
#GHC18
How do we pick what OS offering to use?
Checklist:
 Does it have the right features and potential to solve your problem?
 Is it internally available or supported?
 Licensing?
 Is it supported by an active OS community?
 How many active developers?
 When was the most recent commit/pull request?
 Is it extensible? (e.g. plugins, patches)
 Versioning? Backwards compatible changes?
PAGE 30 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
Product
Allows flexible
data schema?
Data analysis &
visualization?
Internally
available or
accessible at
Two Sigma?
OS community
support?
Extensible?
Stable
versioning?
30
#GHC18
Problem: what to do with metrics data?
---------------- Open Source Offerings ----------------
PAGE 31 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
Product
Allows flexible
data schema?
Data analysis &
visualization?
Internally
available or
accessible at
Two Sigma?
OS community
support?
Extensible?
Stable
versioning?
31
#GHC18
Problem: what to do with metrics data?
---------------- Open Source Offerings ----------------
PAGE 32 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 32
#GHC18
Solution: Elasticsearch
Elasticsearch is an open source platform that can store event data for easy
searching and analysis
PAGE 33 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 33
#GHC18
Solution: Elasticsearch + Kibana
Elasticsearch is an open source platform that can store event data for easy
searching and analysis
There are plugins like Kibana, that
make data analysis and
visualization easy.
PAGE 34 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 34
#GHC18
Destination for the data is Elasticsearch
We want to get our query data into “indexes” in Elasticsearch. An index per
day makes for easy searching and archiving across time.
metrics-2018-01-01
...
...
metrics-2018-01-02
metrics-2018-01-08
metrics-2018-01-09
Query 1Query 2
Queries
PAGE 35 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
Elasticsearch
was not built to
handle 50,000
msgs/sec
35
#GHC18
Problem: Elasticsearch can’t handle throughput
We don’t want Elasticsearch performance to hurt the performance of our data
platform.
metrics-2018-01-01
...
...
metrics-2018-01-02
metrics-2018-01-08
metrics-2018-01-09
Query 1Query 2
Queries
???
PAGE 36 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 36
#GHC18
Problem: Elasticsearch can’t handle throughput
Idea: use a buffer to handle the throughput bursts
metrics-2018-01-01
...
...
metrics-2018-01-02
metrics-2018-01-08
metrics-2018-01-09
Query 1Query 2
Queries
(Buffer)
Input flow Output flow
PAGE 37 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
Product
Can handle
throughput
bursts?
Internally available
or accessible at
Two Sigma?
OS community
support?
Extensible?
Stable
versioning?
37
#GHC18
Problem: Elasticsearch can’t handle throughput
---------------- Open Source Offerings ----------------
As-a-
service
PAGE 38 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
Product
Can handle
throughput
bursts?
Internally available
or accessible at
Two Sigma?
OS community
support?
Extensible?
Stable
versioning?
38
#GHC18
Problem: Elasticsearch can’t handle throughput
---------------- Open Source Offerings ----------------
As-a-
service
PAGE 39 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 39
#GHC18
Solution: Kafka
Topic (partitioned queue)
Kafka is an open source
streaming platform that allows
you to
produce data to
&
consume data from
a Kafka topic.
It’s designed for high throughput,
low latency.
PAGE 40 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 40
#GHC18
Kafka is more suitable for our throughput
Queries
...
Our data platform
Many asynchronous
Java Client producers
roundrobin
Partition 0
Topic
Kafka can handle the high throughput bursts in data that Elasticsearch couldn’t.
Partition 1
Partition 2
Partition 3
Partition n
PAGE 41 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 41
#GHC18
Kafka is more suitable for our throughput
Queries
...
Our data platform
Many asynchronous
Java Client producers
roundrobin
Partition 0
Topic
Kafka can handle the high throughput bursts in data that Elasticsearch couldn’t.
Partition 1
Partition 2
Partition 3
Partition n
(Kafka)
Input flow Output flow
PAGE 42 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 42
#GHC18
Kafka as a Buffer
Queries
...
Our data platform
Partition 0
We can use Kafka as an intermediary buffer to store our metrics before writing
to Elasticsearch.
Partition 1
Partition 2
Partition 3
Partition n
metrics-2018-01-01
...
...
metrics-2018-01-02
metrics-2018-01-08
metrics-2018-01-09
PAGE 43 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
Our data platform
43
#GHC18
Problem: How to get from Kafka to Elasticsearch?
Queries
...
Partition 0
We can use Kafka as an intermediary buffer to store our metrics before writing
to Elasticsearch.
Partition 1
Partition 2
Partition 3
Partition n
...
...
? ? ?
metrics-2018-01-01
metrics-2018-01-02
metrics-2018-01-08
metrics-2018-01-09
PAGE 44 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 44
#GHC18
Problem: How to get from Kafka to Elasticsearch?
---------------- Open Source Offerings ----------------
Product
Easy reading
from Kafka?
Easy writing to
Elasticsearch?
Internally available
or accessible at
Two Sigma?
OS community
support?
Extensible?
Stable
versioning?
PAGE 45 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 45
#GHC18
Problem: How to get from Kafka to Elasticsearch?
---------------- Open Source Offerings ----------------
Product
Easy reading
from Kafka?
Easy writing to
Elasticsearch?
Internally available
or accessible at
Two Sigma?
OS community
support?
Extensible?
Stable
versioning?
PAGE 46 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
Our data platform
46
#GHC18
Queries
...
Partition 0
Logstash is an open source data processing pipeline.
It ingests -> transforms -> and “stashes” data.
Partition 1
Partition 2
Partition 3
Partition n
...
...
Solution: Logstash
metrics-2018-01-01
metrics-2018-01-02
metrics-2018-01-08
metrics-2018-01-09
PAGE 47 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
Our data platform
47
#GHC18
Queries
...
Partition 0
We can use logstash to ingest data from Kafka, transform it as we’d like, and
stash it in elasticsearch.
Partition 1
Partition 2
Partition 3
Partition n
...
...
Solution: Logstash
metrics-2018-01-01
metrics-2018-01-02
metrics-2018-01-08
metrics-2018-01-09
PAGE 48 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
Our data platform
48
#GHC18
Queries
Logstash connects Kafka to Elasticsearch
input {
kafka {
bootstrap_servers => "kafka-server.host:9095"
topic_id => "usage-topic"
codec => "json"
group_id => "consumer-group-a"
}
}
filter {
date {
match => ["query_time", "UNIX_MS"]
remove_field => ["query_time"]
}
}
output {
elasticsearch {
index => "metrics-%{+YYYY-MM-dd}"
hosts => ["metrics.elasticsearch.host:443"]
}
}
...
Partition 0
Partition 1
Partition 2
Partition 3
Partition n
...
...
metrics-2018-01-01
metrics-2018-01-02
metrics-2018-01-08
metrics-2018-01-09
PAGE 49 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
Our data platform
49
#GHC18
Queries
Logstash can scale
...
...
metrics-2018-01-01
metrics-2018-01-02
metrics-2018-01-08
metrics-2018-01-09
...
Partition 0
Partition 1
Partition 2
Partition 3
Partition n
Can scale up to the number of partitions in the Kafka topic.
Easy as starting more logstash instances with the same configuration.
PAGE 50 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
Our data platform
50
#GHC18
Queries
Problem: how to manage many logstash instances?
...
...
metrics-2018-01-01
metrics-2018-01-02
metrics-2018-01-08
metrics-2018-01-09
...
Partition 0
Partition 1
Partition 2
Partition 3
Partition n
With multiple logstash instances, we need a way to manage them.
? ? ?
PAGE 51 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 51
#GHC18
Problem: how to manage many logstash instances?
---------------- Open Source Offerings ----------------
Product
Management of
non-web
services?
Minimal
overhead?
Internally available
or accessible at
Two Sigma?
OS community
support?
Extensible?
Stable
versioning?
TS
Waiter
PAGE 52 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 52
#GHC18
Problem: how to manage many logstash instances?
---------------- Open Source Offerings ----------------
Product
Management of
non-web
services?
Minimal
overhead?
Internally available
or accessible at
Two Sigma?
OS community
support?
Extensible?
Stable
versioning?
TS
Waiter
PAGE 53 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
Our data platform
53
#GHC18
Queries
...
Partition 0
Marathon is an open source container orchestration platform.
It schedules, monitors, and restarts applications as needed.
Partition 1
Partition 2
Partition 3
Partition n
...
...
Solution: Marathon
metrics-2018-01-01
metrics-2018-01-02
metrics-2018-01-08
metrics-2018-01-09
PAGE 54 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY
Our data platform
54
#GHC18
Queries
...
Partition 0
We use Marathon to manage our logstash instances.
Partition 1
Partition 2
Partition 3
Partition n
...
...
Marathon keeps logstash instances up and running
metrics-2018-01-01
metrics-2018-01-02
metrics-2018-01-08
metrics-2018-01-09
That was a lot of new tech,
let’s recap.
#GHC18
PAGE 56 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 56
#GHC18
The Complete Metrics System
Our data platform
Queries
...
Partition 0
Partition 1
Partition 2
Partition 3
Partition n
...
...
metrics-2018-01-01
metrics-2018-01-02
metrics-2018-01-08
metrics-2018-01-09
Data platform with up
to 50,000 queries/sec
Kafka as a high
throughput, low
latency buffer
Logstash instances running on
marathon ingesting Kafka data and
stashing it in Elasticsearch
Elasticsearch indexes
our metrics and we can
analyze it using Kibana!
But wait, why did we use open
source solutions in the first
place?
#GHC18
PAGE 58 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 58
#GHC18
The Alternatives
• Write your own code from scratch
• Support burden is all on you
• Use internal solutions if they exist
• E.g. Your company may choose not to develop custom solutions for process
management if it’s business goals and strengths are more in the domain of data
analysis and modeling.
PAGE 59 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 59
#GHC18
Open Source Benefits
• Use solutions that have been tested and developed by a community of contributors
and users
• Save developer time
• I was able to design, deliver, & deploy this metrics system in under a month
• Can give back to the OS community. E.g. if you find a bug or missing feature
• Report the issue
• Make a contribution
Results
#GHC18
PAGE 61 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 61
#GHC18
Metrics we’ve seen with our new system
Sum of bytes aggregated by data set
% distribution of
‘dataset’ query
parameter
Dataset test_data_x
PAGE 62 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 62
#GHC18
Metrics we’ve seen with our new system
Number of unique users per dataset
Number of unique users over time
Dataset
PAGE 63 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 63
#GHC18
Metrics we’ve seen with our new system
PAGE 64 | GRACE HOPPER CELEBRATION 2018
PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 64
#GHC18
Engineering Lessons Learned
• Requirements shape the system you build
• Problem-solution approach
• Use open source tools when you can
• Why reinvent the wheel?
• Support from the open source community
• BUT be mindful when choosing open source solutions
• Sometimes it’s about orchestrating the pieces together
• Configurations are important
Thanks for listening!
Stop by the Two Sigma booth
Visit https://opensource.twosigma.com/
Email me at Hyonjee.Joo@twosigma.com
#GHC18
1 of 65

Recommended

Archival Storage at Two Sigma - Josh Leners by
Archival Storage at Two Sigma - Josh LenersArchival Storage at Two Sigma - Josh Leners
Archival Storage at Two Sigma - Josh LenersTwo Sigma
1K views35 slides
Smooth Storage - A distributed storage system for managing structured time se... by
Smooth Storage - A distributed storage system for managing structured time se...Smooth Storage - A distributed storage system for managing structured time se...
Smooth Storage - A distributed storage system for managing structured time se...Two Sigma
530 views33 slides
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye by
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia YeResponsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia Ye
Responsive and Scalable Real-time Data Analytics for SHPE 2017 - Cecilia YeTwo Sigma
470 views37 slides
Future of Pandas - Jeff Reback by
Future of Pandas - Jeff RebackFuture of Pandas - Jeff Reback
Future of Pandas - Jeff RebackTwo Sigma
920 views51 slides
Wizard Driven AI Anomaly Detection with Databricks in Azure by
Wizard Driven AI Anomaly Detection with Databricks in AzureWizard Driven AI Anomaly Detection with Databricks in Azure
Wizard Driven AI Anomaly Detection with Databricks in AzureDatabricks
237 views20 slides
Going Beyond Rows and Columns with Graph Analytics by
Going Beyond Rows and Columns with Graph AnalyticsGoing Beyond Rows and Columns with Graph Analytics
Going Beyond Rows and Columns with Graph AnalyticsCambridge Semantics
355 views32 slides

More Related Content

What's hot

Should a Graph Database Be in Your Next Data Warehouse Stack? by
Should a Graph Database Be in Your Next Data Warehouse Stack?Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Cambridge Semantics
5.6K views20 slides
Fraud prevention is better with TigerGraph inside by
Fraud prevention is better with  TigerGraph insideFraud prevention is better with  TigerGraph inside
Fraud prevention is better with TigerGraph insideTigerGraph
96 views21 slides
Graph Hardware Architecture - Enterprise graphs deserve great hardware! by
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!TigerGraph
60 views17 slides
Agile, Automated, Aware: How to Model for Success by
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessInside Analysis
906 views38 slides
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric by
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricCambridge Semantics
368 views18 slides
Graph-Based Identity Resolution at Scale by
Graph-Based Identity Resolution at ScaleGraph-Based Identity Resolution at Scale
Graph-Based Identity Resolution at ScaleTigerGraph
97 views21 slides

What's hot(20)

Should a Graph Database Be in Your Next Data Warehouse Stack? by Cambridge Semantics
Should a Graph Database Be in Your Next Data Warehouse Stack?Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?
Cambridge Semantics5.6K views
Fraud prevention is better with TigerGraph inside by TigerGraph
Fraud prevention is better with  TigerGraph insideFraud prevention is better with  TigerGraph inside
Fraud prevention is better with TigerGraph inside
TigerGraph96 views
Graph Hardware Architecture - Enterprise graphs deserve great hardware! by TigerGraph
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
TigerGraph60 views
Agile, Automated, Aware: How to Model for Success by Inside Analysis
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for Success
Inside Analysis906 views
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric by Cambridge Semantics
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Graph-Based Identity Resolution at Scale by TigerGraph
Graph-Based Identity Resolution at ScaleGraph-Based Identity Resolution at Scale
Graph-Based Identity Resolution at Scale
TigerGraph97 views
Graph+AI for Fin. Services by TigerGraph
Graph+AI for Fin. ServicesGraph+AI for Fin. Services
Graph+AI for Fin. Services
TigerGraph92 views
Powering Real­time Decision Engines in Finance and Healthcare using Open Sour... by Greg Makowski
Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...
Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...
Greg Makowski1.4K views
TigerGraph UI Toolkits Financial Crimes by TigerGraph
TigerGraph UI Toolkits Financial CrimesTigerGraph UI Toolkits Financial Crimes
TigerGraph UI Toolkits Financial Crimes
TigerGraph116 views
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning by Cambridge Semantics
Risk Analytics Using Knowledge Graphs / FIBO with Deep LearningRisk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning
Sustainability Investment Research Using Cognitive Analytics by Cambridge Semantics
Sustainability Investment Research Using Cognitive AnalyticsSustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive Analytics
Data Discoverability at SpotHero by Maggie Hays
Data Discoverability at SpotHeroData Discoverability at SpotHero
Data Discoverability at SpotHero
Maggie Hays921 views
FrugalML: Using ML APIs More Accurately and Cheaply by Databricks
FrugalML: Using ML APIs More Accurately and CheaplyFrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and Cheaply
Databricks181 views
Horizon: Deep Reinforcement Learning at Scale by Databricks
Horizon: Deep Reinforcement Learning at ScaleHorizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at Scale
Databricks670 views
Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud ... by DataWorks Summit
Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud ...Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud ...
Large Scale Graph Processing & Machine Learning Algorithms for Payment Fraud ...
DataWorks Summit1.1K views
Mastering MapReduce: MapReduce for Big Data Management and Analysis by Teradata Aster
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Teradata Aster5.3K views
Necessity of Data Lakes in the Financial Services Sector by DataWorks Summit
Necessity of Data Lakes in the Financial Services SectorNecessity of Data Lakes in the Financial Services Sector
Necessity of Data Lakes in the Financial Services Sector
DataWorks Summit2.1K views
Big Data Scotland 2017 by Ray Bugg
Big Data Scotland 2017Big Data Scotland 2017
Big Data Scotland 2017
Ray Bugg2.4K views
Commercial Analytics at Scale in Pharma: From Hackathon to MVP with Azure Dat... by Databricks
Commercial Analytics at Scale in Pharma: From Hackathon to MVP with Azure Dat...Commercial Analytics at Scale in Pharma: From Hackathon to MVP with Azure Dat...
Commercial Analytics at Scale in Pharma: From Hackathon to MVP with Azure Dat...
Databricks564 views

Similar to Engineering with Open Source - Hyonjee Joo

Five Trends in Real Time Applications by
Five Trends in Real Time ApplicationsFive Trends in Real Time Applications
Five Trends in Real Time Applicationsconfluent
312 views43 slides
Big Data & Analytics (Conceptual and Practical Introduction) by
Big Data & Analytics (Conceptual and Practical Introduction)Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Yaman Hajja, Ph.D.
1K views68 slides
Microservics, serverless and real time; Building blocks of the modern data pi... by
Microservics, serverless and real time; Building blocks of the modern data pi...Microservics, serverless and real time; Building blocks of the modern data pi...
Microservics, serverless and real time; Building blocks of the modern data pi...Manisha Sule
115 views28 slides
"Alpha from Alternative Data" by Emmett Kilduff, Founder and CEO of Eagle Alpha by
"Alpha from Alternative Data" by Emmett Kilduff,  Founder and CEO of Eagle Alpha"Alpha from Alternative Data" by Emmett Kilduff,  Founder and CEO of Eagle Alpha
"Alpha from Alternative Data" by Emmett Kilduff, Founder and CEO of Eagle AlphaQuantopian
1.9K views42 slides
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau... by
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...Connected Data World
1.4K views37 slides
Ai design sprint - Finance - Wealth management by
Ai design sprint  - Finance - Wealth managementAi design sprint  - Finance - Wealth management
Ai design sprint - Finance - Wealth managementChinmay Patel
2K views19 slides

Similar to Engineering with Open Source - Hyonjee Joo(20)

Five Trends in Real Time Applications by confluent
Five Trends in Real Time ApplicationsFive Trends in Real Time Applications
Five Trends in Real Time Applications
confluent312 views
Big Data & Analytics (Conceptual and Practical Introduction) by Yaman Hajja, Ph.D.
Big Data & Analytics (Conceptual and Practical Introduction)Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)
Microservics, serverless and real time; Building blocks of the modern data pi... by Manisha Sule
Microservics, serverless and real time; Building blocks of the modern data pi...Microservics, serverless and real time; Building blocks of the modern data pi...
Microservics, serverless and real time; Building blocks of the modern data pi...
Manisha Sule115 views
"Alpha from Alternative Data" by Emmett Kilduff, Founder and CEO of Eagle Alpha by Quantopian
"Alpha from Alternative Data" by Emmett Kilduff,  Founder and CEO of Eagle Alpha"Alpha from Alternative Data" by Emmett Kilduff,  Founder and CEO of Eagle Alpha
"Alpha from Alternative Data" by Emmett Kilduff, Founder and CEO of Eagle Alpha
Quantopian1.9K views
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau... by Connected Data World
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
How Graphs Continue to Revolutionize The Prevention of Financial Crime & Frau...
Ai design sprint - Finance - Wealth management by Chinmay Patel
Ai design sprint  - Finance - Wealth managementAi design sprint  - Finance - Wealth management
Ai design sprint - Finance - Wealth management
Chinmay Patel2K views
Robert Murphy Driving Value from Smart Manufacturing by Rockwell Automation
Robert Murphy Driving Value from Smart ManufacturingRobert Murphy Driving Value from Smart Manufacturing
Robert Murphy Driving Value from Smart Manufacturing
BDW Chicago 2016 - John K. Thompson, GM for Advanced Analytics Dell Statisti... by Big Data Week
BDW Chicago 2016 - John K. Thompson, GM for Advanced Analytics  Dell Statisti...BDW Chicago 2016 - John K. Thompson, GM for Advanced Analytics  Dell Statisti...
BDW Chicago 2016 - John K. Thompson, GM for Advanced Analytics Dell Statisti...
Big Data Week91 views
How to make your data scientists happy by Hussain Sultan
How to make your data scientists happy How to make your data scientists happy
How to make your data scientists happy
Hussain Sultan159 views
Introduction to Google Cloud Platform for Big Data - Trusted Conf by In Marketing We Trust
Introduction to Google Cloud Platform for Big Data - Trusted ConfIntroduction to Google Cloud Platform for Big Data - Trusted Conf
Introduction to Google Cloud Platform for Big Data - Trusted Conf
Tiger graph 2021 corporate overview [read only] by ercan5
Tiger graph 2021 corporate overview [read only]Tiger graph 2021 corporate overview [read only]
Tiger graph 2021 corporate overview [read only]
ercan5333 views
Scaling Your Enterprise With Data Science by SuperFluid Labs
Scaling Your Enterprise With Data ScienceScaling Your Enterprise With Data Science
Scaling Your Enterprise With Data Science
SuperFluid Labs129 views
Edge Computing: Bringing the Internet Closer to You by Megan O'Keefe
Edge Computing: Bringing the Internet Closer to YouEdge Computing: Bringing the Internet Closer to You
Edge Computing: Bringing the Internet Closer to You
Megan O'Keefe2.3K views
Moneyball: Using Advanced Account Insights for Effective ABM Activation by Engagio
Moneyball: Using Advanced Account Insights for Effective ABM ActivationMoneyball: Using Advanced Account Insights for Effective ABM Activation
Moneyball: Using Advanced Account Insights for Effective ABM Activation
Engagio801 views
Big Data LDN 2018: THE NEXT WAVE: DATA, AI AND ANALYTICS IN 2019 AND BEYOND by Matt Stubbs
Big Data LDN 2018: THE NEXT WAVE: DATA, AI AND ANALYTICS IN 2019 AND BEYONDBig Data LDN 2018: THE NEXT WAVE: DATA, AI AND ANALYTICS IN 2019 AND BEYOND
Big Data LDN 2018: THE NEXT WAVE: DATA, AI AND ANALYTICS IN 2019 AND BEYOND
Matt Stubbs753 views
CHAT-GPT Prompts for Grant Writing, Fundraising, and Marketing.pdf by TechSoup
CHAT-GPT Prompts for Grant Writing, Fundraising, and Marketing.pdfCHAT-GPT Prompts for Grant Writing, Fundraising, and Marketing.pdf
CHAT-GPT Prompts for Grant Writing, Fundraising, and Marketing.pdf
TechSoup 707 views
How To Drive Exponential Growth Using Unconventional Data Sources by Chartio
How To Drive Exponential Growth Using Unconventional Data SourcesHow To Drive Exponential Growth Using Unconventional Data Sources
How To Drive Exponential Growth Using Unconventional Data Sources
Chartio599 views
Big Data & Analytics, Peter Jönsson by IBM Danmark
Big Data & Analytics, Peter JönssonBig Data & Analytics, Peter Jönsson
Big Data & Analytics, Peter Jönsson
IBM Danmark2.9K views

More from Two Sigma

The State of Open Data on School Bullying by
The State of Open Data on School BullyingThe State of Open Data on School Bullying
The State of Open Data on School BullyingTwo Sigma
960 views39 slides
Halite @ Google Cloud Next 2018 by
Halite @ Google Cloud Next 2018Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018Two Sigma
292 views25 slides
BeakerX - Tiezheng Li by
BeakerX - Tiezheng LiBeakerX - Tiezheng Li
BeakerX - Tiezheng LiTwo Sigma
1.2K views21 slides
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson by
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel HudsonBringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel HudsonTwo Sigma
504 views48 slides
Waiter: An Open-Source Distributed Auto-Scaler by
Waiter: An Open-Source Distributed Auto-ScalerWaiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-ScalerTwo Sigma
564 views15 slides
The Language of Compression - Leif Walsh by
The Language of Compression - Leif WalshThe Language of Compression - Leif Walsh
The Language of Compression - Leif WalshTwo Sigma
197 views153 slides

More from Two Sigma(16)

The State of Open Data on School Bullying by Two Sigma
The State of Open Data on School BullyingThe State of Open Data on School Bullying
The State of Open Data on School Bullying
Two Sigma960 views
Halite @ Google Cloud Next 2018 by Two Sigma
Halite @ Google Cloud Next 2018Halite @ Google Cloud Next 2018
Halite @ Google Cloud Next 2018
Two Sigma292 views
BeakerX - Tiezheng Li by Two Sigma
BeakerX - Tiezheng LiBeakerX - Tiezheng Li
BeakerX - Tiezheng Li
Two Sigma1.2K views
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson by Two Sigma
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel HudsonBringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Bringing Linux back to the Server BIOS with LinuxBoot - Trammel Hudson
Two Sigma504 views
Waiter: An Open-Source Distributed Auto-Scaler by Two Sigma
Waiter: An Open-Source Distributed Auto-ScalerWaiter: An Open-Source Distributed Auto-Scaler
Waiter: An Open-Source Distributed Auto-Scaler
Two Sigma564 views
The Language of Compression - Leif Walsh by Two Sigma
The Language of Compression - Leif WalshThe Language of Compression - Leif Walsh
The Language of Compression - Leif Walsh
Two Sigma197 views
Identifying Emergent Behaviors in Complex Systems - Jane Adams by Two Sigma
Identifying Emergent Behaviors in Complex Systems - Jane AdamsIdentifying Emergent Behaviors in Complex Systems - Jane Adams
Identifying Emergent Behaviors in Complex Systems - Jane Adams
Two Sigma300 views
Algorithmic Data Science = Theory + Practice by Two Sigma
Algorithmic Data Science = Theory + PracticeAlgorithmic Data Science = Theory + Practice
Algorithmic Data Science = Theory + Practice
Two Sigma1.3K views
HUOHUA: A Distributed Time Series Analysis Framework For Spark by Two Sigma
HUOHUA: A Distributed Time Series Analysis Framework For SparkHUOHUA: A Distributed Time Series Analysis Framework For Spark
HUOHUA: A Distributed Time Series Analysis Framework For Spark
Two Sigma383 views
Improving Python and Spark Performance and Interoperability with Apache Arrow by Two Sigma
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Two Sigma585 views
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix... by Two Sigma
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
Two Sigma1.1K views
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar... by Two Sigma
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Exploring the Urban – Rural Incarceration Divide: Drivers of Local Jail Incar...
Two Sigma983 views
Graph Summarization with Quality Guarantees by Two Sigma
Graph Summarization with Quality GuaranteesGraph Summarization with Quality Guarantees
Graph Summarization with Quality Guarantees
Two Sigma991 views
Rademacher Averages: Theory and Practice by Two Sigma
Rademacher Averages: Theory and PracticeRademacher Averages: Theory and Practice
Rademacher Averages: Theory and Practice
Two Sigma619 views
Credit-Implied Volatility by Two Sigma
Credit-Implied VolatilityCredit-Implied Volatility
Credit-Implied Volatility
Two Sigma2.5K views
Principles of REST API Design by Two Sigma
Principles of REST API DesignPrinciples of REST API Design
Principles of REST API Design
Two Sigma1.6K views

Recently uploaded

Proposal Presentation.pptx by
Proposal Presentation.pptxProposal Presentation.pptx
Proposal Presentation.pptxkeytonallamon
29 views36 slides
zincalume water storage tank design.pdf by
zincalume water storage tank design.pdfzincalume water storage tank design.pdf
zincalume water storage tank design.pdf3D LABS
5 views1 slide
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,... by
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...AakashShakya12
72 views115 slides
Saikat Chakraborty Java Oracle Certificate.pdf by
Saikat Chakraborty Java Oracle Certificate.pdfSaikat Chakraborty Java Oracle Certificate.pdf
Saikat Chakraborty Java Oracle Certificate.pdfSaikatChakraborty787148
15 views1 slide
Design of machine elements-UNIT 3.pptx by
Design of machine elements-UNIT 3.pptxDesign of machine elements-UNIT 3.pptx
Design of machine elements-UNIT 3.pptxgopinathcreddy
32 views31 slides
DESIGN OF SPRINGS-UNIT4.pptx by
DESIGN OF SPRINGS-UNIT4.pptxDESIGN OF SPRINGS-UNIT4.pptx
DESIGN OF SPRINGS-UNIT4.pptxgopinathcreddy
19 views47 slides

Recently uploaded(20)

zincalume water storage tank design.pdf by 3D LABS
zincalume water storage tank design.pdfzincalume water storage tank design.pdf
zincalume water storage tank design.pdf
3D LABS5 views
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,... by AakashShakya12
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
AakashShakya1272 views
Design of machine elements-UNIT 3.pptx by gopinathcreddy
Design of machine elements-UNIT 3.pptxDesign of machine elements-UNIT 3.pptx
Design of machine elements-UNIT 3.pptx
gopinathcreddy32 views
MSA Website Slideshow (16).pdf by msaucla
MSA Website Slideshow (16).pdfMSA Website Slideshow (16).pdf
MSA Website Slideshow (16).pdf
msaucla68 views
Introduction to CAD-CAM.pptx by suyogpatil49
Introduction to CAD-CAM.pptxIntroduction to CAD-CAM.pptx
Introduction to CAD-CAM.pptx
suyogpatil495 views
_MAKRIADI-FOTEINI_diploma thesis.pptx by fotinimakriadi
_MAKRIADI-FOTEINI_diploma thesis.pptx_MAKRIADI-FOTEINI_diploma thesis.pptx
_MAKRIADI-FOTEINI_diploma thesis.pptx
fotinimakriadi8 views
NEW SUPPLIERS SUPPLIES (copie).pdf by georgesradjou
NEW SUPPLIERS SUPPLIES (copie).pdfNEW SUPPLIERS SUPPLIES (copie).pdf
NEW SUPPLIERS SUPPLIES (copie).pdf
georgesradjou15 views
Advances in micro milling: From tool fabrication to process outcomes by Shivendra Nandan
Advances in micro milling: From tool fabrication to process outcomesAdvances in micro milling: From tool fabrication to process outcomes
Advances in micro milling: From tool fabrication to process outcomes

Engineering with Open Source - Hyonjee Joo

  • 1. Engineering with Open Source B U I L D I N G A H I G H P E R F O R M A N C E M E T R I C S S Y S T E M U S I N G O P E N S O U R C E S O F T W A R E #GHC18 Hyonjee Joo | @twosigma
  • 2. PAGE 2 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity. The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. Legal Disclaimer #GHC 18
  • 4. PAGE 4 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 4 My Background #GHC18 • Graduated from Columbia University • B.S. in computer science and psychology • My 4th GHC – 1st time participating in OSD! • Currently, a software engineer at Two Sigma in New York • What is Two Sigma? • Investment management firm that uses technology and lots of data to drive decisions
  • 5. PAGE 5 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY • Walk through designing a metrics system for a high performance data platform 5 problem solution problem problem solution solution goal #GHC18 In this talk…
  • 6. PAGE 6 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY • Walk through designing a metrics system for a high performance data platform • Using open source solutions every step of the way 6 problem solution problem problem solution solution goal #GHC18 In this talk…
  • 7. PAGE 7 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 1. Engineering a new system can involve less code than you think 7 #GHC18 The Takeaways
  • 8. PAGE 8 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 8 1. Engineering a new system can involve less code than you think 2. Know the problem before you look for a solution #GHC18 The Takeaways
  • 9. PAGE 9 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 1. Engineering a new system can involve less code than you think 2. Know the problem before you look for a solution 3. Careful what you choose, not all open source tools are made (or supported) equally 9 #GHC18 The Takeaways
  • 10. Let’s set up the problem #GHC18
  • 11. PAGE 11 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY • We want to measure usage metrics because: • It’s important to know how our system is being used and by who • If we know what people want to do, we can do a better job doing it • We can identify trends and anticipate how user needs may change 11 #GHC18 Purpose of a Metrics System
  • 12. PAGE 12 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 12 #GHC18 The Data Platform High performance data platform Up to 50,000 queries/sec 1.85 GiB/sec per node Example query: data = client.query( date_range=(20000101, 20180101), dataset=”x”, transformation=”log”)
  • 13. PAGE 13 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 13 #GHC18 Query Data Query 1 { query_time: “20180209 09:00:05”, user: “user1”, dataset: “x”, date_range: { begin: “20000101”, end: “20180101” }, duration: 100, bytes: 350000000, query_param_1: 1.0, query_param_2: “log”, ... }
  • 14. PAGE 14 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 14 #GHC18 Query Data Query 1 { query_time: “20180209 09:00:05”, user: “user1”, dataset: “x”, date_range: { begin: “20000101”, end: “20180101” }, duration: 100, bytes: 350000000, query_param_1: 1.0, query_param_2: “log”, ... }
  • 15. PAGE 15 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 15 #GHC18 Query Data Query 1 { query_time: “20180209 09:00:05”, user: “user1”, dataset: “x”, date_range: { begin: “20000101”, end: “20180101” }, duration: 100, bytes: 350000000, query_param_1: 1.0, query_param_2: “log”, ... }
  • 16. PAGE 16 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 16 #GHC18 Query Data Query 1 { query_time: “20180209 09:00:05”, user: “user1”, dataset: “x”, date_range: { begin: “20000101”, end: “20180101” }, duration: 100, bytes: 350000000, query_param_1: 1.0, query_param_2: “log”, ... }
  • 17. PAGE 17 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 17 #GHC18 Query Data Query 1 { query_time: “20180209 09:00:05”, user: “user1”, dataset: “x”, date_range: { begin: “20000101”, end: “20180101” }, duration: 100, bytes: 350000000, query_param_1: 1.0, query_param_2: “log”, ... }
  • 18. PAGE 18 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 18 #GHC18 Query Data Query 1 { query_time: “20180209 09:00:05”, user: “user1”, dataset: “x”, date_range: { begin: “20000101”, end: “20180101” }, duration: 100, bytes: 350000000, query_param_1: 1.0, query_param_2: “log”, ... }
  • 19. PAGE 19 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 19 #GHC18 Query Data Query 1 { query_time: “20180209 09:00:05”, user: “user1”, dataset: “x”, date_range: { begin: “20000101”, end: “20180101” }, duration: 100, bytes: 350000000, query_param_1: 1.0, query_param_2: “log”, ... }
  • 20. PAGE 20 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 20 #GHC18 Query Data Query 1 { query_time: “20180209 09:00:05”, user: “user1”, dataset: “x”, date_range: { begin: “20000101”, end: “20180101” }, duration: 100, bytes: 350000000, query_param_1: 1.0, query_param_2: “log”, ... } Important for product planning - What features and query parameters do people use? - How are queries distributed across data sets? - Who are our biggest users in terms of number of queries and bytes transferred? - How many distinct users do we have? - How has all of this changed over time?
  • 21. PAGE 21 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 21 #GHC18 The Challenge of Query-Level Granularity Query 1
  • 22. PAGE 22 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 22 #GHC18 The Challenge of Query-Level Granularity Query 1Query 2
  • 23. PAGE 23 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 23 #GHC18 The Challenge of Query-Level Granularity Query 1Query 2 Query 2 Query 2 Query 2 Query n
  • 24. PAGE 24 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 24 #GHC18 The Challenge of Query-Level Granularity Query 1Query 2 Query 2 Query 2 Query 2 Query n time QueryRate Bursts up to 50,000 queries/sec
  • 25. PAGE 25 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 25 #GHC18 The End Goal High performance data platform Up to 50,000 queries/sec 1.85 GiB/sec per node Query 1Query 2 Queries We need more insight into who is using our data platform and how it’s being used. Our goal: collect and analyze usage metrics with query- query-level granularity without impacting the performance and reliability of the data platform.
  • 26. Let’s build the metrics system #GHC18
  • 27. PAGE 27 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 27 #GHC18 Problem: what to do with metrics data? Query 1Query 2 Queries • Store it with flexible schema • Be able to analyze & visualize the data quickly
  • 28. PAGE 28 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 28 #GHC18 Problem: what to do with metrics data? Query 1Query 2 Queries • Store it with flexible schema • Be able to analyze & visualize the data quickly ---------------- Open Source Offerings ----------------
  • 29. PAGE 29 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 29 #GHC18 How do we pick what OS offering to use? Checklist:  Does it have the right features and potential to solve your problem?  Is it internally available or supported?  Licensing?  Is it supported by an active OS community?  How many active developers?  When was the most recent commit/pull request?  Is it extensible? (e.g. plugins, patches)  Versioning? Backwards compatible changes?
  • 30. PAGE 30 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Product Allows flexible data schema? Data analysis & visualization? Internally available or accessible at Two Sigma? OS community support? Extensible? Stable versioning? 30 #GHC18 Problem: what to do with metrics data? ---------------- Open Source Offerings ----------------
  • 31. PAGE 31 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Product Allows flexible data schema? Data analysis & visualization? Internally available or accessible at Two Sigma? OS community support? Extensible? Stable versioning? 31 #GHC18 Problem: what to do with metrics data? ---------------- Open Source Offerings ----------------
  • 32. PAGE 32 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 32 #GHC18 Solution: Elasticsearch Elasticsearch is an open source platform that can store event data for easy searching and analysis
  • 33. PAGE 33 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 33 #GHC18 Solution: Elasticsearch + Kibana Elasticsearch is an open source platform that can store event data for easy searching and analysis There are plugins like Kibana, that make data analysis and visualization easy.
  • 34. PAGE 34 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 34 #GHC18 Destination for the data is Elasticsearch We want to get our query data into “indexes” in Elasticsearch. An index per day makes for easy searching and archiving across time. metrics-2018-01-01 ... ... metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09 Query 1Query 2 Queries
  • 35. PAGE 35 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Elasticsearch was not built to handle 50,000 msgs/sec 35 #GHC18 Problem: Elasticsearch can’t handle throughput We don’t want Elasticsearch performance to hurt the performance of our data platform. metrics-2018-01-01 ... ... metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09 Query 1Query 2 Queries ???
  • 36. PAGE 36 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 36 #GHC18 Problem: Elasticsearch can’t handle throughput Idea: use a buffer to handle the throughput bursts metrics-2018-01-01 ... ... metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09 Query 1Query 2 Queries (Buffer) Input flow Output flow
  • 37. PAGE 37 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Product Can handle throughput bursts? Internally available or accessible at Two Sigma? OS community support? Extensible? Stable versioning? 37 #GHC18 Problem: Elasticsearch can’t handle throughput ---------------- Open Source Offerings ---------------- As-a- service
  • 38. PAGE 38 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Product Can handle throughput bursts? Internally available or accessible at Two Sigma? OS community support? Extensible? Stable versioning? 38 #GHC18 Problem: Elasticsearch can’t handle throughput ---------------- Open Source Offerings ---------------- As-a- service
  • 39. PAGE 39 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 39 #GHC18 Solution: Kafka Topic (partitioned queue) Kafka is an open source streaming platform that allows you to produce data to & consume data from a Kafka topic. It’s designed for high throughput, low latency.
  • 40. PAGE 40 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 40 #GHC18 Kafka is more suitable for our throughput Queries ... Our data platform Many asynchronous Java Client producers roundrobin Partition 0 Topic Kafka can handle the high throughput bursts in data that Elasticsearch couldn’t. Partition 1 Partition 2 Partition 3 Partition n
  • 41. PAGE 41 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 41 #GHC18 Kafka is more suitable for our throughput Queries ... Our data platform Many asynchronous Java Client producers roundrobin Partition 0 Topic Kafka can handle the high throughput bursts in data that Elasticsearch couldn’t. Partition 1 Partition 2 Partition 3 Partition n (Kafka) Input flow Output flow
  • 42. PAGE 42 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 42 #GHC18 Kafka as a Buffer Queries ... Our data platform Partition 0 We can use Kafka as an intermediary buffer to store our metrics before writing to Elasticsearch. Partition 1 Partition 2 Partition 3 Partition n metrics-2018-01-01 ... ... metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09
  • 43. PAGE 43 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Our data platform 43 #GHC18 Problem: How to get from Kafka to Elasticsearch? Queries ... Partition 0 We can use Kafka as an intermediary buffer to store our metrics before writing to Elasticsearch. Partition 1 Partition 2 Partition 3 Partition n ... ... ? ? ? metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09
  • 44. PAGE 44 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 44 #GHC18 Problem: How to get from Kafka to Elasticsearch? ---------------- Open Source Offerings ---------------- Product Easy reading from Kafka? Easy writing to Elasticsearch? Internally available or accessible at Two Sigma? OS community support? Extensible? Stable versioning?
  • 45. PAGE 45 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 45 #GHC18 Problem: How to get from Kafka to Elasticsearch? ---------------- Open Source Offerings ---------------- Product Easy reading from Kafka? Easy writing to Elasticsearch? Internally available or accessible at Two Sigma? OS community support? Extensible? Stable versioning?
  • 46. PAGE 46 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Our data platform 46 #GHC18 Queries ... Partition 0 Logstash is an open source data processing pipeline. It ingests -> transforms -> and “stashes” data. Partition 1 Partition 2 Partition 3 Partition n ... ... Solution: Logstash metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09
  • 47. PAGE 47 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Our data platform 47 #GHC18 Queries ... Partition 0 We can use logstash to ingest data from Kafka, transform it as we’d like, and stash it in elasticsearch. Partition 1 Partition 2 Partition 3 Partition n ... ... Solution: Logstash metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09
  • 48. PAGE 48 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Our data platform 48 #GHC18 Queries Logstash connects Kafka to Elasticsearch input { kafka { bootstrap_servers => "kafka-server.host:9095" topic_id => "usage-topic" codec => "json" group_id => "consumer-group-a" } } filter { date { match => ["query_time", "UNIX_MS"] remove_field => ["query_time"] } } output { elasticsearch { index => "metrics-%{+YYYY-MM-dd}" hosts => ["metrics.elasticsearch.host:443"] } } ... Partition 0 Partition 1 Partition 2 Partition 3 Partition n ... ... metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09
  • 49. PAGE 49 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Our data platform 49 #GHC18 Queries Logstash can scale ... ... metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09 ... Partition 0 Partition 1 Partition 2 Partition 3 Partition n Can scale up to the number of partitions in the Kafka topic. Easy as starting more logstash instances with the same configuration.
  • 50. PAGE 50 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Our data platform 50 #GHC18 Queries Problem: how to manage many logstash instances? ... ... metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09 ... Partition 0 Partition 1 Partition 2 Partition 3 Partition n With multiple logstash instances, we need a way to manage them. ? ? ?
  • 51. PAGE 51 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 51 #GHC18 Problem: how to manage many logstash instances? ---------------- Open Source Offerings ---------------- Product Management of non-web services? Minimal overhead? Internally available or accessible at Two Sigma? OS community support? Extensible? Stable versioning? TS Waiter
  • 52. PAGE 52 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 52 #GHC18 Problem: how to manage many logstash instances? ---------------- Open Source Offerings ---------------- Product Management of non-web services? Minimal overhead? Internally available or accessible at Two Sigma? OS community support? Extensible? Stable versioning? TS Waiter
  • 53. PAGE 53 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Our data platform 53 #GHC18 Queries ... Partition 0 Marathon is an open source container orchestration platform. It schedules, monitors, and restarts applications as needed. Partition 1 Partition 2 Partition 3 Partition n ... ... Solution: Marathon metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09
  • 54. PAGE 54 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY Our data platform 54 #GHC18 Queries ... Partition 0 We use Marathon to manage our logstash instances. Partition 1 Partition 2 Partition 3 Partition n ... ... Marathon keeps logstash instances up and running metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09
  • 55. That was a lot of new tech, let’s recap. #GHC18
  • 56. PAGE 56 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 56 #GHC18 The Complete Metrics System Our data platform Queries ... Partition 0 Partition 1 Partition 2 Partition 3 Partition n ... ... metrics-2018-01-01 metrics-2018-01-02 metrics-2018-01-08 metrics-2018-01-09 Data platform with up to 50,000 queries/sec Kafka as a high throughput, low latency buffer Logstash instances running on marathon ingesting Kafka data and stashing it in Elasticsearch Elasticsearch indexes our metrics and we can analyze it using Kibana!
  • 57. But wait, why did we use open source solutions in the first place? #GHC18
  • 58. PAGE 58 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 58 #GHC18 The Alternatives • Write your own code from scratch • Support burden is all on you • Use internal solutions if they exist • E.g. Your company may choose not to develop custom solutions for process management if it’s business goals and strengths are more in the domain of data analysis and modeling.
  • 59. PAGE 59 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 59 #GHC18 Open Source Benefits • Use solutions that have been tested and developed by a community of contributors and users • Save developer time • I was able to design, deliver, & deploy this metrics system in under a month • Can give back to the OS community. E.g. if you find a bug or missing feature • Report the issue • Make a contribution
  • 61. PAGE 61 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 61 #GHC18 Metrics we’ve seen with our new system Sum of bytes aggregated by data set % distribution of ‘dataset’ query parameter Dataset test_data_x
  • 62. PAGE 62 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 62 #GHC18 Metrics we’ve seen with our new system Number of unique users per dataset Number of unique users over time Dataset
  • 63. PAGE 63 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 63 #GHC18 Metrics we’ve seen with our new system
  • 64. PAGE 64 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY 64 #GHC18 Engineering Lessons Learned • Requirements shape the system you build • Problem-solution approach • Use open source tools when you can • Why reinvent the wheel? • Support from the open source community • BUT be mindful when choosing open source solutions • Sometimes it’s about orchestrating the pieces together • Configurations are important
  • 65. Thanks for listening! Stop by the Two Sigma booth Visit https://opensource.twosigma.com/ Email me at Hyonjee.Joo@twosigma.com #GHC18

Editor's Notes

  1. Pulsar – incubation Scribe – decommissioned fluentd
  2. Pulsar – incubation Scribe – decommissioned fluentd