Project Deimos

“To provide a conceptual framework
for designing a dispatch engine that
reacts to a request by gathering
various inputs, dispatching requests
with the inputs to some pricing
engines, then reassembling the results
into a form the original requestor can
comprehend.” – Andrei

DEIMOS: short-term goal
Debugging:
storemore
logfaster

DEIMOS: long-term goal
Service-oriented performance profiling

DEIMOS: high level architecture
Storage
Computation /
Indexing
Ingestion /
Buffering
Apache Kakfa Apache Storm Apache HBase

“Distributed publish-subscribe
message queue”
Kafka: concept

Storm: concept
“Distributed real-time
computation graph”

Storm: topology
Number
Spout
Data-store
Odd
Bolt
Even
Bolt
[1,2,3,4, …]
[2,4, …]
[1,3, …]
Pair
Bolt
Log
Bolt
[(1,2),(3,4), …]

• 1 worker per node per topology
• 1 executor per core for CPU bound tasks
• 1-10 executors per core for IO bound tasks
• Compute total parallelism possible and
distribute it amongst slow and fast tasks. High
parallelism for slow tasks, low for fast tasks.
Storm: tuning guidelines

HBase: concept
“Distributed, non-relational,
key/value store based on HDFS”

HBase: schema design
Raw Table: "logTable"
Row Key
Family
"rrh" "rrb" "mh" "mb"
Qualifiers
entryId0 entryId1 entryId0 entryId1 entryId0 entryId1 entryId0 entryId1
rootRequestId0|requestId0
Uuid/LogTime Index Table: "indexTable"
Row Key
Family
"rrh"
Qualifiers
rootRequestId0 rootRequestId1
logTime0|uuid0
logTime0|uuid1
logTime1|uuid0
logTime1|uuid1

DEIMOS: detailed implementation
Kafka Cluster Storm Cluster HBase Cluster
xaplog
Kafka producer library
MARS & pricing
tasks
Marsloggerclient
library
BAS
mttsvc (java)
HBase client library
MTT MLOG
(Terminal)
MTT WEB
(PHP)
Index
Storage
KafkaSpout
IndexBolt
LogBolt
BAS

DEIMOS: extension
HBase Cluster
Index
Storage
Hadoop
Archive Job
Archive
Elasticsearch / Solr
Interactive Analytics

Stuff I learned the hard way
• Debugging is difficult (dbxtool > ddd)
• Always check version number of open source
libraries
• The right balance between planning and doing
• Use bcpc if you want to test things
• BASO is great
• Reading a book might be better than googling

Scalable Logging
To Assess and Improve Performance

Problem Statement
A customer is shouting at me!
How do I find what happened quickly?
How do I prevent it next time?
How can I anticipate entirely new problems?

Use Cases
(needed today)
• Debugging
– Goal: Investigate complaints by looking at the inputs that
went into a specific request.
– What needs to be fixed: NOT logging everything so a lot of
time wasted trying to reproduce customer problems
instead of having it already there.
– Motivation: Spending a week tracking down reproduction
data because logging subsystem cannot handle full
selective BAEL logging in production.

Use Cases
(planning for tomorrow)
• Automated Request Audit
– Goal: Need to know exact inputs, path it took through the input
system, and outputs provided (all based on the logs received).
– What needs to be fixed: We have no way to analyze the requests
we receive except manually one at a time. We cannot go back in
time to perform hypothesis testing and automatic auditing of
requests according to rules.
– Motivation: Recent malformed requests caused one of our
daemons to throw an exception and crash because number of
scenarios did not match number of dates in input. It is not
possible to see how many malformed requests we got in past or
detect this condition in production without deploying new code
in the actual system itself.

Use Cases
(planning for tomorrow)
• Aggregation of end-to-end trends
– Goal: Anomaly (spike/dip) detection (define a window and build a historical
distribution for the data).
– What needs to be fixed: Need to establish expected SLAs for each kind of request
received based on input sizes and estimations of downstream system performance.
– Motivation: MARS team received a complaint about processing being too slow. We
had no baseline. We had to use trial and error to determine what could be pushed
through the system. A lot of guesswork.
• Operational analysis of the dependent systems
– Goal: Capacity planning and performance optimization.
– What needs to be fixed: Problem detection by analyzing deviation from
historical trends for:
• Processing rates, error rates, and response times.
– Motivation: When the downstream mortgage services started throwing
errors it took a lot of manual reproduction attempts to figure out.

The Challenge
We are reactive instead of proactive
Need More Data

Data-driven
Evolution
In
Making
Operational
Substitutions

Definitions
• A log is some arbitrary sequence of events ordered in time representing state that
we want to preserve for later retrieval
• An event is a tuple representing an occurrence of
– Input system (system type + specific instance)
– Event time (start and end time)
– Event ID and Parent Event ID (to establish causation)
– Location (OS and Bloomberg process/task/machine name)
– Privilege information (UUID)
– Event data – can be an arbitrary object (input system provides direction of how to interpret event
data)
• Conceptually, the events are stored as directed acyclic graph with a start node,
where each node represents an event. (see the MTT tool as an example)
• Input systems
– Other systems that provide the event stream
– Two main input systems types:
• BAEL entries
• BAS requests
– Currently targeted input instance is only MARS

Event feed – take
responsibility for logging
events • MARS daemons – Sends actual log
events to xaplog instances.
• xaplog instances – Receives log
events and forwards them to Kafka
instance.
• Kafka
– Middleware to queue messages, it is
scalable and durable.
– Once Kafka accepts an event, the
associated xaplog instance is freed of
any further obligations.

Ingestion – group related
events together
• Kafka – Collects events into two main
queues.
– First queue: BAS messages
– Second queue: BAEL messages
– Log events are persisted onto disk.
– Serves as a shock absorber to handle
bursts in log event traffic (since it just
stores the messages, it doesn’t have to
process them).
• The rest of the system should be
designed to handle the average load
case.
• Storm Ingestion Topology – Groups
event stream by root request.
• Partitioner – Holds grouped events together.

Encoding – efficiently code
the event stream at the
binary level
• Partitioner – Writes the same request
chain under the same rows in Hbase.
– The data is split into three main content
types:
• BAS/BAEL headers
• BAS string data (XML)
• BAEL string data (trace information)
• Storm Encoding Topology – Writes each
group of events as one BLOB – with
special coding tailored to data type (i.e.
header data, XML, text).
• Log warehouse – Encoded blobs are
written to different tables for longer-
term archiving.

Indexing – speed up
access to relevant fields
for interactive querying
• Log warehouse – By storing similar
data together with specialized
encoding it can significantly reduce
storage costs.
• Storm Indexing Topology – Extracts
the relevant subset of data to feed
the indexes.
• Indexes – Underlying
implementation of the indexes. Basic
ones can be stored in HBase. More
complicated ones can be stored in
ElasticSearch/Solr.

Querying – let users
lookup the event stream
• Indexes / log warehouse –
– User queries would hit the indexes first.
– If additional data is needed and is not
available in an index it would need to
access the warehouse.
• xapqrylg – New daemons to marshal
requests from the UIs.
• MTT UIs – Would be unchanged.
More improvements can be added
later.

Phase I tasks
Replace MTT backend
• Code in xaplog to send events to Kafka queue
– Kafka & Storm will live on BCPC for proof-of-concept, need to see about production
– See if can reuse what pricing history team did?
• Maybe not, it should just be a simple push.
• Design Kafka queue layout (partitioning and topics)
– Two topics: BAS and BAEL
• Maybe: three later, BAS lite, BAS xml + BAEL – decouple the ingestion rates if better latency needed???
– Look at the best settings and make sure DRQS 54369477 doesn’t apply
• Storm Ingestion topology & HBase schema (in Java)
– Write each header-data row separately and let the encoding aggregate them.
– Blobs do not need any ingestion right now, they can be written to target table directly.
• Storm Encoding topology & HBase schema (in Java)
– Keeping it simple for now. Split up XML blobs from rest of data.
– Store all non-blob data grouped by root request id (protobuf??)
– For blob data do some basic XML to binary, and as part of key order responses and requests together.
– How to ensure if the same log data is fed more than once it only gets written once?
• Storm Indexing topology & HBase schema (in Java)
– A few simple indexes will live in HBase to allow query by UUID, date range, pricing #, and security.
– How to keep indexes synchronized with the warehouse tables?
• Xapqrylg – read HBase indexes and storage tables
– Reuse Kirill’s work on mttweb where it makes sense.

Q&A
"Go ahead, make my day.“ -Harry

Key Properties
…of a useful event stream logging
system

Required Properties
1. Ownership - It accepts logging data and takes responsibility so that input systems are freed from
offering any guarantees after handoff (logging is not the main task of input systems, just a side
effect)
1. Makes it easy to generate IDs to link events in a tree
2. Two main casual link models can be considered (explicitly is preferred):
1. Explicitly, by having each event have a parent event id as well as its own event id
2. Implicitly, by having a root request id, and then ordering by event time, and ingestion order
2. Durability - reduce chances of data loss, especially in the event of crashes
3. Idempotence - It correctly handles the same input log data if sent into the system more than once
1. Due to failures, input systems might send the same data twice – client side problems easy to handle: just send data
again
2. To support batch input of the data from other sources (“bulk import”) – to stand up another instance of the system
or migration from other systems in a consistent fashion
3. Replaying existing log data to simplify re-indexing and related side-effects
4. Time-invariance - Does not expect the event stream to be time ordered (even though it usually will
be), the output of the system might be different in-between, once the exact same overall data has
been fed to the system the outputs should be the same
5. Avoiding Lock-in - Allows easy export of data in bulk into a neutral form
1. for exporting into other systems or into another instance
2. don’t want the data to be stranded
6. Scalable – as close to linear as possible to improve performance by just adding more machines.

Required Properties (cont’d)
7. High Availability – have some form of redundancy so that if machines in the
system fail the system can still operate, maybe in a degraded state (performance-
wise).
8. Manageable - Export metrics to support decisions on the operation of the system
9. Schema-agnostic - Is as schema-less as possible
7. requires only to know about the fields it needs to index on
8. otherwise shouldn’t care about the data being in a specific format
9. the input format should be akin to a nested JSON object
10. but with a parent id to correlate to a parent and then ordered by time.
10. Space-efficient - Ability to optimize binary storage to …
7. Reduce disk space taken
8. Improve read times
9. …at the expense of increased complexity and CPU costs when writing the data

Why Current Solutions Are Inadequate
• APDX (and TRACK – a functional subset of APDX)
– Collects only numerical metrics with no ability to store arbitrary event
data or casual relationships between events. It just counts events.
– It can be used in parallel, but does not our nearly meet our needs.
• Splunk
– Lightweight analysis done based on:
• {TEAM MOB2:SPLUNK TUTORIAL<GO>}
• http://rndx.prod.bloomberg.com/questions/9584/how-should-we-do-distributed-
logging
– Main points that discourage further research:
• Splunk expects log lines only with no arbitrary data.
– Hard to save space
• Cost is per log volume (uncompressed) – we expect to easily exceed 100GiB of raw
logging volume a day (supposedly that will be a one-time cost of $110k).
• Better suited as a higher level tool that we could maybe use on top.

Project Deimos

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Project Deimos

Similar to Project Deimos (20)

Project Deimos

Editor's Notes