Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Advertisement
Advertisement

HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS

  1. Stock  Market  Order  Flow   Reconstruction     Using  HBase  on  AWS Aaron Carreras, HBaseCon – San Francisco, May 2015
  2. About  Presenter •  Director of Enterprise Data Platforms at FINRA •  Data Ingestion, Processing and Management
  3. WHAT DO WE DO? • Collect and Create •  33B events/day •  18 national exchanges •  Equities, Options and Fixed Income •  Reconstruct the market from trillions of events spanning years • Detect & Investigate •  Identify market manipulations, insider trading, fraud and compliance violations • Enforce & Discipline •  Ensure rule compliance •  Fine and bar broker dealers •  Refer matters to the SEC and other authorities TRF FIRM Exchange Dark  Pool
  4. What  stock  trade  looks  like  to  the  investor
  5. Example  of  what  is  actually  happening
  6. Ingest/Access  PaJerns
  7. Configurations/Approaches  in   Common    
  8. Logical  Architecture CDH  4.5;  HBase  0.94.6;  EC2  hs1.8xlarge  –  16  (vCPU),  117  (GiB),  24  drives  x  2,000  (GB)  
  9. Row  Key  Design  &  Pre-­‐‑spliJing •  Salt Our Row Keys o  Our “natural” keys are monotonically increasing o  Row Key = salt (PK) + PK •  Pre-split •  Better control of distribution of data across regions
  10. Compactions  &  SpliJing  Configurations Parameter Default Override hbase.hregion.majorcompaction 7  days 0  (disable) hbase.hstore.compactionThreshold 3 10 hbase.hstore.compaction.max 10 15 hbase.hregion.max.filesize 10  GB 200  GB RegionSplitPolicy IncreasingToUpperBoundRegionSplit Policy ConstantSizeRegionSplitPol icy hbase.hstore.useExploringCompati on false true
  11. OS  Configuration  Considerations §  Some of these may not be relevant to you depending on your OS/Version but are worth confirming Parameter Se1ing redhat_transparent_hugepage/ defrag never nofile/nproc  ulimit 32768 tcp_low_latency 1  (enabled) vm.swappiness 0  (disabled) selinux Disabled IPv6 no  (disabled) iptables off/stop
  12. Other  Hadoop  Configuration   Considerations Where Parameter Se1ing core-­‐‑site.xml ipc.client.tcpnodelay true core-­‐‑site.xml ipc.server.tcpnodelay true hdfs-­‐‑site.xml dfs.client.read.shortcircuit true hdfs-­‐‑site.xml fs.s3a.buffer.dir [machine  specific] hbase-­‐‑site.xml hbase.snapshot.master.timeoutMillis 1800000 hbase-­‐‑site.xml hbase.snapshot.master.timeout.millis 1800000 hbase-­‐‑site.xml hbase.master.cleaner.interval 600000  (ms)
  13. Use  Case  ‘A’:  PaJerns
  14. Use  Case  ‘A’:  Background •  Create graphs for historical market event data (trillion records) •  Basically a batch process o  Each batch had ~ 4 billion events o  Related events may span batches (e.g., root could arrive later, children may be corrected, etc.) •  Back process prior 18 months (540 batches) •  Complete the project given the and
  15. Use  Case  ‘A’:  Utilize  Bulk  Loads •  Back processing and ongoing update process is 100% Bulk HFile load •  Our column families and processing aligned with this approach by splitting the linkage and content into separate column families •  Eliminate Puts completely and the WAL writes, memstore flushes, and additional compactions that often accompany them HFile  Bulk  Load
  16. Use  Case  ‘A’:  Optimize  Gets •  Used sorted / partitioned batched Gets o  Minimize required RPC calls o  Leverage sorting to better leverage block cache •  Allocate more on-heap memory for reads Parameter Default Override hfile.block.cache.size .4 .65 hbase.regionserver.global.memstore.upperLi mit .4 .15
  17. Use  Case  ‘B’:  PaJerns
  18. Use  Case  ‘B’:  Background •  Not a once a day batch process, it must process the data as it arrives o  200+ business rules covering data validation, create/break linkages, and identify compliance issues within SLA o  Progressively build the tree •  The different processes required different access paths sometimes requiring multiple copies of some portions of the data
  19. Use  Case  ‘B’:  Put  Strategy •  HFiles for the incremental processing didn’t fit as well here •  Partitioned Batch Puts •  memstore vs block cache (50/50)
  20. Use  Case  ‘B’:  Scan •  Scan o  Distinct Daily along with a single Historical table to more naturally support the processing o  Scan Daily tables only o  Switched from Get to Scan for rows with millions of columns
  21. Backup  and  DR    
  22. HBase  Backup  to  S3 •  HBase ExportSnapshots to S3 didn’t really support our use case •  Significant updates to the ExportSnapshot for S3 o  Support for S3A (HADOOP-10400) o  Remove the expensive rename operation on S3 (HBASE-11119) S3
  23. Disaster  Recovery o  AWS provides multiple Availability Zones (AZ) in different geographic regions o  HBASE snapshots backed up to S3 and to a separate cluster in a different AZ o  S3 buckets are backed up from one region to another for cross- region redundancy
  24. Running  Hadoop  on  AWS     Lessons  Learned    
  25. Running  Hadoop  on  AWS •  S3 o  For now at least, s3a is probably the file system implementation you want to use (if you are not using EMR) o  Rename is not a logical operation and therefore expensive o  Eventual consistency should be accounted for o  Consider turning S3 versioning on •  Instance Types / Topology o  # of virtual instances on a single physical host impacts fault tolerance o  Tradeoff between network performance and availability/capacity •  Region - Availability Zone - Placement Group o  Be aware that Availability Zone identifiers are intentionally inconsistent across accounts
  26. Questions?
Advertisement