3. “To provide a conceptual framework
for designing a dispatch engine that
reacts to a request by gathering
various inputs, dispatching requests
with the inputs to some pricing
engines, then reassembling the results
into a form the original requestor can
comprehend.” – Andrei
16. • 1 worker per node per topology
• 1 executor per core for CPU bound tasks
• 1-10 executors per core for IO bound tasks
• Compute total parallelism possible and
distribute it amongst slow and fast tasks. High
parallelism for slow tasks, low for fast tasks.
Storm: tuning guidelines
24. Stuff I learned the hard way
• Debugging is difficult (dbxtool > ddd)
• Always check version number of open source
libraries
• The right balance between planning and doing
• Use bcpc if you want to test things
• BASO is great
• Reading a book might be better than googling
27. Problem Statement
A customer is shouting at me!
How do I find what happened quickly?
How do I prevent it next time?
How can I anticipate entirely new problems?
28. Use Cases
(needed today)
• Debugging
– Goal: Investigate complaints by looking at the inputs that
went into a specific request.
– What needs to be fixed: NOT logging everything so a lot of
time wasted trying to reproduce customer problems
instead of having it already there.
– Motivation: Spending a week tracking down reproduction
data because logging subsystem cannot handle full
selective BAEL logging in production.
29. Use Cases
(planning for tomorrow)
• Automated Request Audit
– Goal: Need to know exact inputs, path it took through the input
system, and outputs provided (all based on the logs received).
– What needs to be fixed: We have no way to analyze the requests
we receive except manually one at a time. We cannot go back in
time to perform hypothesis testing and automatic auditing of
requests according to rules.
– Motivation: Recent malformed requests caused one of our
daemons to throw an exception and crash because number of
scenarios did not match number of dates in input. It is not
possible to see how many malformed requests we got in past or
detect this condition in production without deploying new code
in the actual system itself.
30. Use Cases
(planning for tomorrow)
• Aggregation of end-to-end trends
– Goal: Anomaly (spike/dip) detection (define a window and build a historical
distribution for the data).
– What needs to be fixed: Need to establish expected SLAs for each kind of request
received based on input sizes and estimations of downstream system performance.
– Motivation: MARS team received a complaint about processing being too slow. We
had no baseline. We had to use trial and error to determine what could be pushed
through the system. A lot of guesswork.
• Operational analysis of the dependent systems
– Goal: Capacity planning and performance optimization.
– What needs to be fixed: Problem detection by analyzing deviation from
historical trends for:
• Processing rates, error rates, and response times.
– Motivation: When the downstream mortgage services started throwing
errors it took a lot of manual reproduction attempts to figure out.
33. Definitions
• A log is some arbitrary sequence of events ordered in time representing state that
we want to preserve for later retrieval
• An event is a tuple representing an occurrence of
– Input system (system type + specific instance)
– Event time (start and end time)
– Event ID and Parent Event ID (to establish causation)
– Location (OS and Bloomberg process/task/machine name)
– Privilege information (UUID)
– Event data – can be an arbitrary object (input system provides direction of how to interpret event
data)
• Conceptually, the events are stored as directed acyclic graph with a start node,
where each node represents an event. (see the MTT tool as an example)
• Input systems
– Other systems that provide the event stream
– Two main input systems types:
• BAEL entries
• BAS requests
– Currently targeted input instance is only MARS
35. Event feed – take
responsibility for logging
events • MARS daemons – Sends actual log
events to xaplog instances.
• xaplog instances – Receives log
events and forwards them to Kafka
instance.
• Kafka
– Middleware to queue messages, it is
scalable and durable.
– Once Kafka accepts an event, the
associated xaplog instance is freed of
any further obligations.
36. Ingestion – group related
events together
• Kafka – Collects events into two main
queues.
– First queue: BAS messages
– Second queue: BAEL messages
– Log events are persisted onto disk.
– Serves as a shock absorber to handle
bursts in log event traffic (since it just
stores the messages, it doesn’t have to
process them).
• The rest of the system should be
designed to handle the average load
case.
• Storm Ingestion Topology – Groups
event stream by root request.
• Partitioner – Holds grouped events together.
37. Encoding – efficiently code
the event stream at the
binary level
• Partitioner – Writes the same request
chain under the same rows in Hbase.
– The data is split into three main content
types:
• BAS/BAEL headers
• BAS string data (XML)
• BAEL string data (trace information)
• Storm Encoding Topology – Writes each
group of events as one BLOB – with
special coding tailored to data type (i.e.
header data, XML, text).
• Log warehouse – Encoded blobs are
written to different tables for longer-
term archiving.
38. Indexing – speed up
access to relevant fields
for interactive querying
• Log warehouse – By storing similar
data together with specialized
encoding it can significantly reduce
storage costs.
• Storm Indexing Topology – Extracts
the relevant subset of data to feed
the indexes.
• Indexes – Underlying
implementation of the indexes. Basic
ones can be stored in HBase. More
complicated ones can be stored in
ElasticSearch/Solr.
39. Querying – let users
lookup the event stream
• Indexes / log warehouse –
– User queries would hit the indexes first.
– If additional data is needed and is not
available in an index it would need to
access the warehouse.
• xapqrylg – New daemons to marshal
requests from the UIs.
• MTT UIs – Would be unchanged.
More improvements can be added
later.
40. Phase I tasks
Replace MTT backend
• Code in xaplog to send events to Kafka queue
– Kafka & Storm will live on BCPC for proof-of-concept, need to see about production
– See if can reuse what pricing history team did?
• Maybe not, it should just be a simple push.
• Design Kafka queue layout (partitioning and topics)
– Two topics: BAS and BAEL
• Maybe: three later, BAS lite, BAS xml + BAEL – decouple the ingestion rates if better latency needed???
– Look at the best settings and make sure DRQS 54369477 doesn’t apply
• Storm Ingestion topology & HBase schema (in Java)
– Write each header-data row separately and let the encoding aggregate them.
– Blobs do not need any ingestion right now, they can be written to target table directly.
• Storm Encoding topology & HBase schema (in Java)
– Keeping it simple for now. Split up XML blobs from rest of data.
– Store all non-blob data grouped by root request id (protobuf??)
– For blob data do some basic XML to binary, and as part of key order responses and requests together.
– How to ensure if the same log data is fed more than once it only gets written once?
• Storm Indexing topology & HBase schema (in Java)
– A few simple indexes will live in HBase to allow query by UUID, date range, pricing #, and security.
– How to keep indexes synchronized with the warehouse tables?
• Xapqrylg – read HBase indexes and storage tables
– Reuse Kirill’s work on mttweb where it makes sense.
43. Required Properties
1. Ownership - It accepts logging data and takes responsibility so that input systems are freed from
offering any guarantees after handoff (logging is not the main task of input systems, just a side
effect)
1. Makes it easy to generate IDs to link events in a tree
2. Two main casual link models can be considered (explicitly is preferred):
1. Explicitly, by having each event have a parent event id as well as its own event id
2. Implicitly, by having a root request id, and then ordering by event time, and ingestion order
2. Durability - reduce chances of data loss, especially in the event of crashes
3. Idempotence - It correctly handles the same input log data if sent into the system more than once
1. Due to failures, input systems might send the same data twice – client side problems easy to handle: just send data
again
2. To support batch input of the data from other sources (“bulk import”) – to stand up another instance of the system
or migration from other systems in a consistent fashion
3. Replaying existing log data to simplify re-indexing and related side-effects
4. Time-invariance - Does not expect the event stream to be time ordered (even though it usually will
be), the output of the system might be different in-between, once the exact same overall data has
been fed to the system the outputs should be the same
5. Avoiding Lock-in - Allows easy export of data in bulk into a neutral form
1. for exporting into other systems or into another instance
2. don’t want the data to be stranded
6. Scalable – as close to linear as possible to improve performance by just adding more machines.
44. Required Properties (cont’d)
7. High Availability – have some form of redundancy so that if machines in the
system fail the system can still operate, maybe in a degraded state (performance-
wise).
8. Manageable - Export metrics to support decisions on the operation of the system
9. Schema-agnostic - Is as schema-less as possible
7. requires only to know about the fields it needs to index on
8. otherwise shouldn’t care about the data being in a specific format
9. the input format should be akin to a nested JSON object
10. but with a parent id to correlate to a parent and then ordered by time.
10. Space-efficient - Ability to optimize binary storage to …
7. Reduce disk space taken
8. Improve read times
9. …at the expense of increased complexity and CPU costs when writing the data
45. Why Current Solutions Are Inadequate
• APDX (and TRACK – a functional subset of APDX)
– Collects only numerical metrics with no ability to store arbitrary event
data or casual relationships between events. It just counts events.
– It can be used in parallel, but does not our nearly meet our needs.
• Splunk
– Lightweight analysis done based on:
• {TEAM MOB2:SPLUNK TUTORIAL<GO>}
• http://rndx.prod.bloomberg.com/questions/9584/how-should-we-do-distributed-
logging
– Main points that discourage further research:
• Splunk expects log lines only with no arbitrary data.
– Hard to save space
• Cost is per log volume (uncompressed) – we expect to easily exceed 100GiB of raw
logging volume a day (supposedly that will be a one-time cost of $110k).
• Better suited as a higher level tool that we could maybe use on top.
Editor's Notes
Hello everyone, my name is Simon Suo. I am a co-op student from the University of Waterloo and I have been working with Andrei on some exciting stuff over the past four months.
This presentation is meant to showcase everything that I was told to do, what I actually did, and what I should have done.
So what exactly was I told to do? There’s no better way to present this than showing the exact quote from the project proposal document I received. So here it is, in its glorious entirety.
Upon arrival, I was told that there are more urgent matter to tend to before the more grandiose plan can be executed. I will be focusing on the DEIMOS project instead of the PHOBOS project. For us mere mortals who do not possess the extraordinary sense of humor that Andrei does: PHOBOS stands for “proving how our bottleneck opposes speed”, and DEIMOS stands for “Data-driven evolution in marking operational substitutions”. And in plain English, they refer to the dispatcher redesign project and the scalable logging project respectively. For those who are not aware, phobos and deimos are the names of the two largest moons of mars. So it is quite clever actually. Good job Andrei.
Let’s look at the high level architecture of such a scalable logging system. There are three major components to this system: data ingestion and buffering, computation and indexing, and finally storage.
To achieve the performance and scalability we need, we explored many cool new technologies and evaluated their effectiveness.
List of Technologies I got to play with:
Apache Kafka
Apache Storm
Apache Hbase
Apache Cassandra
ZeroMQ
Cap’n Proto
Google Protocol Buffers
Google Flatbuffers