FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveillance on AWS.pdf

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Capi tal Markets Di scovery: How F INRA Runs
Trad e Anal yti cs and Survei l l ance on AWS
R o b e r t K i s s e l l
S r . S o l u t i o n s A r c h i t e c t
W W P S F e d e r a l F i n a n c i a l s
A W S
J o h n H i t c h i n g h a m
S r . D i r e c t o r E n g i n e e r i n g
F I N R A
N o v e m b e r 2 7 , 2 0 1 7
F S V 3 0 7

Four pillars of the data lake
Scale
• Store and analyze all
data centrally
• Ingest data quickly
without predefined
schemas
• Separate storage and
compute, scaling each
component as needed
Cost
• Pay only for what you
need
• Use only the services you
need
• Utilize diverse services/
features to optimize cost
Security
Encryption at each step
• Explicit control of egress
and ingress points
• Compliance and
Governance of Data
access using AWS native
services/features
Agility
• Big data does not mean
just batch processing
• Mix and match on-
premises and cloud
• Custom development and
managed services

Data lake
Central Storage
Secure, cost-effective
storage in
Amazon S3
Data Ingestion
Get your data into S3 quickly and securely
Kinesis Firehose, Direct Connect,
AWS Snowball, Database Migration Service
Processing & Analytics
Use of predictive and prescriptive analytics
to gain better understanding
DynamoDB
Elasticsearch Service
Athena, Amazon QuickSight, Amazon EMR,
Amazon Redshift
Protect & Secure
Use entitlements to ensure data is
secure and users’ identities are
verified
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access

FINRA’s Data Lake
Surveilling markets with FINRA’s multi-petabyte enterprise-grade data
lake

Market regulation—analytics pipeline
Validation
Prepare for
Analytics
(ETL)
Run Automated
Detection
Models
Interactive
Analytics
Regulatory
Analyst
Explore
Investigate
Regulatory
Follow up
BDs Exchanges Reference
Data Providers
Trade execution records
Market reference data
Data
Scientist
Develop
Models
75B+ events 20+ PB of Data 3Yrs Prod on CloudMajor Exchange Clients

Cloud journey—data puddles to data lake
Database1
Storage
Query/Compute
Catalog
Database2
Storage
Query/Compute
Catalog
Databasen
Storage
Query/Compute
Catalog
Storage
Query/
Compute
Catalog
EMR LambdaEMR Presto EMR HBase
FINRA
herd
Hive
metastore
Silo
Amazon
S3
Scales

http://finraos.github.io/herd
Unified catalog
• Schemas
• Versions
• Encryption type
• Storage policies
Lineage and Usage
• Track publishers and consumers
• Easily identify jobs and derived data sets
Shared Metastore
• Common definition of tables and partitions
• Use with Spark, Presto, Hive, etc.
• Faster instantiation of clusters
Herd catalog—for centralized data management

Trades Surveillance
2017-03-01 v1
2017-03-02 v1
2017-03-01 v1
2017-03-02 v1
Regulatory
conclusion
Lineage
1
Trades Surveillance
2017-03-01 v1
2017-03-02 v1
2017-03-01 v1
2017-03-02 v1
Regulatory
conclusion
2 2017-03-01 v2
v2 Data Version
?
?
Example—lineage and data versioning

Files
Ingest
Define
Record
Legal Hold?
No IAM
role with
delete on bucket
Review/Approve
Process
Tag files
For delete
DM Managed
Amazon
S3 Bucket
Trade Reports
OATS Orders
Model Outputs
Delete
Delete files call
Herd—foundation for records management
Files
Herd
DM
Metadata
All deletes
via policy
based on tags
Register
Object
Store
file(s)
Set Record Flag
Set Record Period
Set Record Owner
Set / Clear
Legal Hold
Gen list of
Records eligible
for deletion
File life on Amazon S3

Universal data catalog—explore data
Analysts Data Scientists Developers
Built on

Catalog &
Storage
ETL
Normalize, Enrich, Reformat
Human
Analytics
Validation
Ingest
Broker Dealers
Exchanges
Third-Party
Providers
Data
Files
Analyst
Data Scientist
Regulatory User
Detection models (Patterns)
Automated Surveillance
P
P
P
A
A
P Processing Pipeline
A Analytics
Analytic data processing pipeline
on the data lake

ETL execution
Input Data Input Data Input Data Input Data Input Data
Job1 Job2 Job3
Job4 Job5 Job6 JobN
…
Output Data Output Data Output Data Output Data Output Data
Amazon
S3
Amazon
S3
Amazon
EMR
Orchestration
Data Location
Registration
Per Second BillingSpot Hive (Deprecated) Spark

Dynamic processing
0.0
1.0
2.0
3.0
4.0
5.0
11/1 11/8 11/15 11/22 11/29
Daily Order Volume (Billions)
0
2000
4000
6000
8000
10000
12000
2016-10-17T02
2016-10-17T08
2016-10-17T14
2016-10-17T20
2016-10-18T02
2016-10-18T08
2016-10-18T14
2016-10-18T20
2016-10-19T02
2016-10-19T08
2016-10-19T14
2016-10-19T20
2016-10-20T02
2016-10-20T08
2016-10-20T14
2016-10-20T20
2016-10-21T02
2016-10-21T08
2016-10-21T14
2016-10-21T20
2016-10-22T02
2016-10-24T03
2016-10-24T20
ComputeNodes
Hour of Day
Amazon EMR compute on Amazon EC2
EMR
20k – 25k EC2 nodes per day 93% of EC2 is on EMR
Avg EC2 node: 3 cores
Avg EC2 uptime: 3 hours
96% of EC2 nodes live < 24 hrsOver 50k nodes on peak day

Interactive analytics—fundamentals
Data
Analyst
Data
Scientist
JDBC/ODBC
Client
JDBC/ODBC
Client
Table 1
Table 2
AuthN
AuthZ
Metastore
Table N
Logical “Database” = 4+ PB
Amazon EMR

Achieving interactive query
Query Table size
(rows)
Output
size (rows)
ORC TXT/BZ2
select count(*) from TABLE_1
where trade_date = cast(‘2016-08-09’ as date)
2469171608 1 4s 1m56s
select col1, count(*) from TABLE_1 where col2 = cast('2016-
08-09' as date) group by col1 order by col1
2469171608 12 3s 1m51s
select col1, count(*) from TABLE_1 where col2 = cast('2016-
08-09' as date) group by col1 order by col1
2469171608 8364 5s 2m5s
select * from TABLE_1 where col2 = cast('2016-08-10' as
date) and col3='I' and col4='CR' and col5 between 100000.0
and 103000.0
2469171608 760 10s 2m3s
Test Config:
Presto 0.167.0.6t (Teradata) On EMR
Data on S3 (external tables)
Cluster size: 60 worker node x r4.4xlarge
Key points:
Use ORC (Or Parquet) for performant query

User A JDBC/ODBC
Client Table 1
Table 2
Metastore
Table N
Logical “Database”
JDBC/ODBC
Client
User B
JDBC App
Cluster A
Cluster B
Cluster N
Still One Copy
Of Data
Scaling out interactive query

FINRA’s interactive Big Data portfolio
Data Lake
Diver MIRS DOMT User-Directed FOLA Marketspace
Crosstab UI
Personal marts -
billons of rows
Domain-specific
interactive reports
and visualizations
Visualize
depth of market
Investigation
and data profiling
via SQL
Retrieve market
events to render
order lifecycle
Exception and
alert viewer

Data science ecosystem on data lake
Data
Scientist
JDBC/ODBC
Client
Logical ‘Database’
EMR Cluster Source
Data
Spark Cluster
DS-in-a-box
Data
Scientist
Notebook
Interface
Data
Scientist
Catalog
Notebook or Shell
Personal
Data Marts
Explore

Example—cross-market surveillance
NASDAQ
PSX
NYSE
AMEX
ARCA
OATS
TRF
ISG Audit
Trail
Cross-market Data Model
Unifies market
data into five
major events:
orders,
reports,
cancels,
trades, and
quotes.
Captures
events and
attributes
required for
patterns.
Provides
consistent
cross market
participant
definition.
Propagates
participant
information as
an order is
routed from
Firm to
Exchange and
from
Exchange to
Exchange
Calculates
open interest
for all orders
at any given
time during
the day
ETL
Data
Cross Market
Surveillance Models
(automated)
Depth of Market Tool
& Diver
(interactive)
Use Use

Surveillance execution (like ETL)
Input Data Input Data Input Data Input Data Input Data
Pattern1 Pattern2 Pattern3 Pattern4 Pattern5 Pattern6 PatternN…
Output Data Output Data Output Data Output Data Output Data
Amazon
S3
Amazon
EMR
Orchestration
Data Location
Registration
Fwk
Mgr
Dev Ops
Per Second BillingSpot Hive (Deprecated) Spark
Amazon
S3

Surveillance evolution
Execution Engine Relational DB Hive, Spark Spark
Language SQL SQL (HiveQL, Spark SQL) Scala, Python, R, SQL, Java
Production Logic SQL w/ some scripting SQL w/ some scripting ML model (H2O, MLlib)
Data Catalog N/A
Catalog provides schema/
location
Create dataframes
Catalog provides schema/
location
Data Framework N/A N/A
Data manipulated as dataframe
API for common manipulations
today
Before Cloud Cloud v1 Cloud v2

FINRA’s dynamic surveillance platform
Data Engineering Model Selection
ML Framework
Data Framework
Trained
Model
Scoring
Algorithms
EGRPython, R,
Scala, SQL
Scala
Python
Scala, Python, R
Test
Chosen
Model
Data
Observation-1
Observation-2
Observation-n
…
Notebook
Promotion
Data Lake
Amazon
EC2
Amazon
EC2
Amazon
S3
Model Development Prod
FINRA
herd
Python, R,
Scala
Data Framework
Scala
Python
Iterative

VPC isolation
Security Groups
VPC Endpoints
SDLC Isolation (Accts)
AWS KMS
EMR Security Configs
S3 SSE
S3 KMS
EBS KMS
AWS CloudTrail
Splunk
Nagios
Isolation Encryption MonitoringAuthN/AuthZ
Role-based access
IAM ADFS Federation
Temporary token access
AD LDAP Integration (Apps)
Security

Compliance—consistency, transparency
Compliance
Reports
FINRA Provision Tool
Compliant Stack Configs
FINRA Portus Tool
Approved Security Groups
Dev Account
QC Account
Prod Account
Security EA
FINRA IAMUS Tool
IAM Role Templates
Development Tools
Dev
Teams
Automated
Deploy
Automated
Deploy
Configs / Chg Events
CloudTrail
Policies
Reg SCI SoX SOC2SECAudits

Reporting/
Investigation
Data Science
Machine Learning
Data Management Data Processing Pipeline
Improved
Cost Reduction
Security
Regulatory Compliance



AchievedSimplified
Benefits of a data lake implementation

FINRA Presentations re:Invent 2017
FSV307 – Capital Markets Discovery: How FINRA Runs Trade Analytics and Surveillance on AWS
The FINRA analytics platform unlocks the value in capital markets data by accelerating trade analytics and providing a foundation for machine
learning at scale. Monday, Nov 27, 10:45 a.m. – 11:45 a.m. Venetian, Level 5, Palazzo P
SID326 – AWS Security State of the Union
Steve Schmidt, chief information security officer of AWS, addresses the current state of security in the cloud. As part of this pr esentation,
John Brady (CISO of FINRA) shares the FINRA journey to the cloud. Wednesday, Nov 29, 12:15 p.m. – 1:15 p.m. MGM, Level 3, Premier
Ballroom 316
ABD310 – How FINRA Secures Its Big Data and Data Science Platform on AWS
Learn how FINRA secures its Amazon S3 Data Lake and its data science platform on Amazon EMR and Amazon Redshift, while empowering
data scientists with tools they need to be effective. Wednesday, Nov 29, 11:30 a.m. – 12:30 p.m. Aria, Level 3, Juniper 3
ENT328 – FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud
The Financial Impact Regulatory Authority (FINRA) Technology Group has changed its customers' relationships with data by creating a
managed data lake Thursday, Nov 30, 1 p.m. – 2 p.m. MGM, Level 3, Premier Ballroom 319
DEV335 – Manage Infrastructure Securely at Scale and Eliminate Operational Risks
Managing AWS and hybrid environments securely and safely while having actionable insights is an operational priority and business driv er for
all customers. Thursday, Nov 30, 4 p.m. – 5 p.m. Venetian, Level 2, Venetian E

Thank you!

FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveillance on AWS.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveillance on AWS.pdf

Similar to FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveillance on AWS.pdf (20)

More from Amazon Web Services

More from Amazon Web Services (20)

FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveillance on AWS.pdf