[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructure utilization and efficiency at scale

How built a framework to improve
infrastructure resource utilization at scale

★ Sr. Systems Engineer @Twitter

★ Proud Member of @TwitterWomen,
@WomenWhoCode
Iam@VinuCharanya
Hello!

3
1
2
3
4
History & Context

Chargeback @Twitter

Kite - Service Lifecycle Manager

Impact & Future Work
Agenda

INFRASTRUCTURE & DATACENTER MANAGEMENT
CORE APPLICATION
SERVICES
TWEETS
USERS
SOCIAL
GRAPH
PLATFORM SERVICES
SEARCH
MESSAGING &
QUEUES
CACHE
MONITORING AND
ALERTING
INGRESS &
PROXY
 
FRAMEWORK/
LIBRARIES
FINAGLE
(RPC)
SCALDING
(Map Reduce in
Scala)
HERON
(Streaming
Compute)
JVM
 
MANAGEMENT
TOOLS
SELF SERVE
SERVICE
DIRECTORY
CHARGEBACK
CONFIG
MGMT
DATA & ANALYTICS
PLATFORM
INTERACTIVE
QUERY
DATA
DISCOVERY
WORKFLOW
MANAGEMENT
INFRASTRUCTURE
SERVICES
MANHATTAN
BLOBSTORE
GRAPHSTORE
TIMESERIESDB
S
T
O
R
A
G
E
MESOS/AURORA
HADOOP
C
O
M
P
U
T
E
MYSQL
VERTICA
POSTGRES
D
B
/
D
W
DEPLOY 
(Workﬂows)

MESOS/AURORA
HADOOP
MANHATTAN
67%
NumberofServers

Number of Servers
MESOS/AURORA
HADOOP
MANHATTAN
67%
How to get visibility into resources used by

individual jobs & datasets?

Number of Servers
MESOS/AURORA
HADOOP
MANHATTAN
67%
How to attribute resource consumption 
to teams/organization?

Number of Servers
MESOS/AURORA
HADOOP
MANHATTAN
67%
How do you incentivize the right behavior to  
improve eﬃciency of resource usage?

Chargeback @Twitter
Ability to meter
allocation & utilization of resources

Chargeback @Twitter
Ability to meter

per service,

per project,

per engineering team

Chargeback @Twitter
Ability to meter

per service,

per project,

per engineering team

to improve visibility &

enable accountability

Features
Supports diverse
Infra Services
Chargeback @Twitter
18
Meters abstract
resources at daily
granularity
Detailed Reports

19
Chargeback @Twitter
1. Resource Catalog: Consistent way to inventory infrastructure
resources
Support diverse Infrastructure and Platform Services

20
Chargeback @Twitter
resources
• Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets /
second”). Extend existing resource

21
Chargeback @Twitter
resources
• Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets /
second”). Extend existing resource
2. Resource <> Client Identiﬁer Ownership: Map of client identiﬁer to an
owner to enable accountability

OFFER MEASURE COST
RESOURCE CATALOG ENTITY MODEL

OFFER MEASURES
OFFER MEASURE COST
1:N

PROVIDER
INFRASTRUCTURE
SERVICE
OFFERINGS
OFFER MEASURES
OFFER MEASURE COST
1:N
1:N
1:N
1:N

TWITTER DC/
PUBLIC CLOUD
COMPUTE
CORE-DAYS
$X
PROVIDER
INFRASTRUCTURE
SERVICE
OFFERINGS
OFFER MEASURES
OFFER MEASURE COST
1:N
1:N
1:N
1:N

TWITTER DC/
PUBLIC CLOUD
COMPUTE
CORE-DAYS
$X
PROVIDER
INFRASTRUCTURE
SERVICE
OFFERINGS
OFFER MEASURES
OFFER MEASURE COST
1:N
1:N
1:N
1:N
TWITTER DC
STORAGE
GB-
RAM
PROCESSING
CLUSTER
FILE
ACCESSES
…
…
GB-
RAM
FILE
ACCESSE
S
… …
$X $Y …$M $N… …

{
measures: [
{
"measure_id": 1,
"measure_label": "core-days",
"measure_unit_label": "per 1 core-day",
"offering_id": 1,
"offering_label": "Compute",
"infrastructure_id": 1,
"infrastructure_name": "Aurora"
},
{
"measure_id": 2,
"measure_label": "machine-days",
"measure_unit_label": "per 1 machine-day",
"offering_id": 2,
"offering_label": "zone:aquila",
"infrastructure_id": 8,
"infrastructure_name": "Physical Infrastructure",
},
{
/api/1/measures
Chargeback @Twitter

So, how do you incentivize the right behavior to  
improve eﬃciency of resource usage?

Operational Overhead
Headroom
Production Used Cores
Non-Prod Used Cores
Cost of Physical Server 
($X / day)
Total available Cores
Quota Buﬀer 
(Underutilized Quota)
Container Size Buﬀer 
(Underutilized Reservation)
Total Cost of Ownership for Aurora
$X core-day

Headroom
Non-Prod Used Cores
($X / day)
Quota Buﬀer 
Total used Cores
$X core-day

Headroom
Non-Prod Used Cores
($X / day)
Quota Buﬀer 
Total used Cores
Excess Cores (incl. DR,
Spikes, Overallocation)Total Cost of Ownership for Aurora
$X core-day

Headroom
Non-Prod Used Cores
($X / day)
Quota Buﬀer 
Total used Cores
Excess Cores (incl. DR,
Spikes, Overallocation)
Cores used by platform 
for operations &
maintenance
$X core-day

Features
Supports diverse
Infra/Platform
Services
Chargeback @Twitter
36
Meters abstract
resources at daily
granularity
Detailed Reports

37
Chargeback @Twitter
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
REPORT
REPORT
Metering Pipeline (ETL Job)
IDENTIFIER
OWNERSHIP
MAPPING
Metrics Ingestor
DATA FIDELITY

38
Chargeback @Twitter
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
REPORT
REPORT
IDENTIFIER
OWNERSHIP
MAPPING
Schema(client_identifier, offering_measure, volume, metadata, timestamp)
DATA FIDELITY

39
Chargeback @Twitter
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
IDENTIFIER
OWNERSHIP
MAPPING
REPORT
REPORT
Transformer
DATA FIDELITY

40
Chargeback @Twitter
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
IDENTIFIER
OWNERSHIP
MAPPING
REPORT
REPORT
1. Resolve Ownership
DATA FIDELITY

41
Chargeback @Twitter
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
IDENTIFIER
OWNERSHIP
MAPPING
REPORT
REPORT
2. Cost Computation
DATA FIDELITY

42
Chargeback @Twitter
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
DATA FIDELITY
REPORT
REPORT
IDENTIFIER
OWNERSHIP
MAPPING
Data Fidelity & Reporting

43
Chargeback @Twitter
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
REPORT
REPORT
IDENTIFIER
OWNERSHIP
MAPPING
1. Verify Data Integrity & Fidelity
DATA FIDELITY

44
Chargeback @Twitter
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
REPORT
REPORT
IDENTIFIER
OWNERSHIP
MAPPING
2. Alert when things don’t seem the way it should be
DATA FIDELITY

45
Chargeback @Twitter
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
EXPORT
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
IDENTIFIER
OWNERSHIP
DATA FIDELITY
REPORT
REPORT

Features
Supports diverse
Infra/Platform
Services
Chargeback @Twitter
46
Meters abstract
resources at daily
granularity
Detailed Reports

47
Chargeback @Twitter
Customers
Infrastructure & Platform Operators
Overall Cluster Growth

Allocation v/s Utilization of resources by Client/Tenant

Finance & Execs
Budget v/s Spend per Org

Infrastructure PnL

Overall Eﬃciency & Trends

Service Owners & Developers
Team Bill

Per Service Allocation vs. Utilization of Resources
Reports
Customers


Finance & Execs

Infrastructure PnL


49
Chargeback @Twitter
Customers


Finance & Execs

Infrastructure PnL


Service Owners & Developers
Team Bill

Per Service Allocation vs. Utilization of Resources
Reports

CHARGEBACK DRILLDOWN FOR A TEAM

Features
Supports diverse
Infra/Platform
Services
Chargeback @Twitter
52
Meters abstract
resources at daily
granularity
Detailed Reports

53
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data
Fidelity
Accurate Ownership
Mapping
Logical grouping
of resources
Change History
• Trust in data is most
important.

• Invest in monitoring &
alerting for data
inconsistencies

• Leverage this for
detecting abnormal
increase/decrease and
notify users
• Static mappings go out
of date quickly

• Invest in systems (ex,
Kite) for users to manage
it themselves
• Identiﬁers were too
granular and teams were
too broad.

• Find a good middle
ground and invest in
system (ex, Kite) to track,
understand and maintain
• Unit prices change over
time

• Orgs / Teams change
over time

• Resources get added /
removed

• Change history is
essential for consistency
which is used for CAP
planning

54
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data
Fidelity
Accurate Ownership
Mapping
Logical grouping
of resources
Change History
important.

alerting for data
inconsistencies

detecting abnormal
notify users
of date quickly

it themselves
too broad.

time

over time

removed

planning

55
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data
Fidelity
Accurate Ownership
Mapping
Logical grouping
of resources
Change History
important.

alerting for data
inconsistencies

detecting abnormal
notify users
of date quickly

it themselves
too broad.

time

over time

removed

planning

56
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data
Fidelity
Accurate Ownership
Mapping
Logical grouping
of resources
Change History
important.

alerting for data
inconsistencies

detecting abnormal
notify users
of date quickly

it themselves
too broad.

time

over time

removed

planning

57
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data
Fidelity
Accurate Ownership
Mapping
Logical grouping
of resources
Change History
important.

alerting for data
inconsistencies

detecting abnormal
notify users
of date quickly

it themselves
too broad.

time

over time

removed

planning

SERVICE IDENTITY
MANAGER
RESOURCE
PROVISIONING MANAGER
DASHBOARD
(SINGLE PANE OF GLASS)
REPORTING
INFRASTRUCTURE SERVICEINFRASTRUCTURE SERVICEINFRASTRUCTURE SERVICEINFRASTRUCTURE & PLATFORM SERVICE
SERVICE LIFECYCLE WORKFLOWS
METADATA
RESOURCE QUOTA
MANAGEMENT
METERING &
CHARGEBACK
CLIENT IDENTITY
PROVIDER APIS & ADAPTERS

10,000+ClientIdentifiers
1,000+ Projects
100+ Teams
8 InfrastructureServices
60
Kite @Twitter

61
Kite @Twitter
Identity System: Built a consistent way to group client identifiers of
different infrastructure services into a project and enabled ownership
• Capture Org Structure: Support org structure changes, project transfer
workflows to ensure up-to-date ownership of identifiers

• Unify client identifier provisioning workflow: Enables single source of truth
and reduces operator pain around provisioning and managing client identifiers.
Client Identifier Management

IDENTITY ENTITY MODEL
<INFRA, CLIENTID>
<Aurora,
tweetypie.prod.tweetypie>
<Aurora, ads-
prediction.prod.campaign-x>

SERVICE/ 
SYSTEM ACCOUNT
<INFRA, CLIENTID>
1:N
tweetypie
<Aurora,
ads-prediction
<Aurora, ads-

BUSINESS OWNER
TEAM
PROJECT
SERVICE/ 
SYSTEM ACCOUNT
<INFRA, CLIENTID>
1:N
1:N
1:N
1:N
INFRASTRUCTURE
TWEETYPIE
tweetypie
tweetypie
<Aurora,
ADS PREDICTION
prediction
ads-prediction
<Aurora, ads-
REVENUE

BUSINESS OWNER
TEAM
PROJECT
SERVICE/ 
SYSTEM ACCOUNT
<INFRA, CLIENTID>
1:N
1:N
1:N
1:N
INFRASTRUCTURE
TWEETYPIE
tweetypie
tweetypie
<Aurora,
ADS PREDICTION
prediction
ads-prediction
<Aurora, ads-
REVENUE
Entities are time varying dimensions

73
Future Work
1 2
Capacity Planning Extend Quota
Manager
• Provide historic trends
and help with forecast of
capacity
• Onboard Hadoop,
Storage and other
systems
3
Enable project
deprecation
• Detect unused
resources, notify users,
trigger deprecation
process based on policy

75
1 2
Future Work
Manager
capacity
• Onboard Hadoop,
Storage and other
systems
3
Enable project
deprecation
• Detect unused
trigger deprecation

76
1 2
Future Work
Manager
capacity
• Onboard Hadoop,
Storage and other
systems
3
Enable project
deprecation
• Detect unused
trigger deprecation

[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructure utilization and efficiency at scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructure utilization and efficiency at scale

Similar to [Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructure utilization and efficiency at scale (20)

Recently uploaded

Recently uploaded (20)

[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructure utilization and efficiency at scale