HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS

Stock Market Order Flow
Reconstruction
Using HBase on AWS
Aaron Carreras, HBaseCon
– San Francisco, May 2015

About Presenter
•  Director of Enterprise Data Platforms at FINRA
•  Data Ingestion, Processing and Management

WHAT DO WE DO?
• Collect and Create
•  33B events/day
•  18 national exchanges
•  Equities, Options and Fixed Income
•  Reconstruct the market from trillions of events spanning years
• Detect & Investigate
•  Identify market manipulations, insider trading, fraud and compliance violations
• Enforce & Discipline
•  Ensure rule compliance
•  Fine and bar broker dealers
•  Refer matters to the SEC and other authorities
TRF
FIRM
Exchange
Dark Pool

What stock trade looks like to the investor

Example of what is actually happening

Conﬁgurations/Approaches in
Common

Logical Architecture
CDH 4.5; HBase 0.94.6; EC2 hs1.8xlarge – 16 (vCPU), 117 (GiB), 24 drives x 2,000 (GB)

Row Key Design & Pre-‐‑spliJing
•  Salt Our Row Keys
o  Our “natural” keys are
monotonically
increasing
o  Row Key = salt (PK) +
PK
•  Pre-split
•  Better control of
distribution of data
across regions

Compactions & SpliJing Conﬁgurations
Parameter
Default
Override
hbase.hregion.majorcompaction
7 days
0 (disable)
hbase.hstore.compactionThreshold
3
10
hbase.hstore.compaction.max
10
15
hbase.hregion.max.ﬁlesize
10 GB
200 GB
RegionSplitPolicy
IncreasingToUpperBoundRegionSplit
Policy
ConstantSizeRegionSplitPol
icy
hbase.hstore.useExploringCompati
on
false
true

OS Configuration Considerations
§  Some of these may not be relevant to you depending on your
OS/Version but are worth confirming
Parameter
Se1ing
redhat_transparent_hugepage/
defrag
never
nofile/nproc ulimit
32768
tcp_low_latency
1 (enabled)
vm.swappiness
0 (disabled)
selinux
Disabled
IPv6
no (disabled)
iptables
off/stop

Other Hadoop Configuration
Considerations
Where
Parameter
Se1ing
core-‐‑site.xml
ipc.client.tcpnodelay
true
core-‐‑site.xml
ipc.server.tcpnodelay
true
hdfs-‐‑site.xml
dfs.client.read.shortcircuit
true
hdfs-‐‑site.xml
fs.s3a.buffer.dir
[machine specific]
hbase-‐‑site.xml
hbase.snapshot.master.timeoutMillis
1800000
hbase.snapshot.master.timeout.millis
1800000
hbase.master.cleaner.interval
600000 (ms)

Use Case ‘A’: PaJerns

Use Case ‘A’: Background
•  Create graphs for historical market event data (trillion
records)
•  Basically a batch process
o  Each batch had ~ 4 billion events
o  Related events may span batches (e.g., root could arrive later, children
may be corrected, etc.)
•  Back process prior 18 months (540 batches)
•  Complete the project given the and

Use Case ‘A’: Utilize Bulk Loads
•  Back processing and ongoing update process is 100% Bulk HFile load
•  Our column families and processing aligned with this approach by splitting the linkage
and content into separate column families
•  Eliminate Puts completely and the WAL writes, memstore flushes, and additional
compactions that often accompany them
HFile Bulk Load

Use Case ‘A’: Optimize Gets
•  Used sorted / partitioned batched Gets
o  Minimize required RPC calls
o  Leverage sorting to better leverage block cache
•  Allocate more on-heap memory for reads
Parameter
Default
Override
hﬁle.block.cache.size
.4
.65
hbase.regionserver.global.memstore.upperLi
mit
.4
.15

Use Case ‘B’: PaJerns

Use Case ‘B’: Background
•  Not a once a day batch process, it must process the
data as it arrives
o  200+ business rules covering data validation, create/break linkages, and
identify compliance issues within SLA
o  Progressively build the tree
•  The different processes required different access
paths sometimes requiring multiple copies of some
portions of the data

Use Case ‘B’: Put Strategy
•  HFiles for the
incremental
processing
didn’t fit as
well here
•  Partitioned
Batch Puts
•  memstore vs
block cache
(50/50)

Use Case ‘B’: Scan
•  Scan
o  Distinct Daily along with a single Historical table to more naturally
support the processing
o  Scan Daily tables only
o  Switched from Get to Scan for rows with millions of columns

HBase Backup to S3
•  HBase ExportSnapshots to S3 didn’t really support
our use case
•  Significant updates to the ExportSnapshot for S3
o  Support for S3A (HADOOP-10400)
o  Remove the expensive rename operation on S3 (HBASE-11119)
S3

Disaster Recovery
o  AWS provides multiple
Availability Zones (AZ) in different
geographic regions
o  HBASE snapshots backed up to
S3 and to a separate cluster in a
different AZ
o  S3 buckets are backed up from
one region to another for cross-
region redundancy

Running Hadoop on AWS

Lessons Learned

Running Hadoop on AWS
•  S3
o  For now at least, s3a is probably the file system implementation you want to
use (if you are not using EMR)
o  Rename is not a logical operation and therefore expensive
o  Eventual consistency should be accounted for
o  Consider turning S3 versioning on
•  Instance Types / Topology
o  # of virtual instances on a single physical host impacts fault tolerance
o  Tradeoff between network performance and availability/capacity
•  Region - Availability Zone - Placement Group
o  Be aware that Availability Zone identifiers are intentionally inconsistent
across accounts

HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS

Similar to HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS (20)

More from HBaseCon

More from HBaseCon (20)

Recently uploaded

Recently uploaded (20)

HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS