Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Presto Summit 2018 - 08 - FINRA

272 views

Published on

Presto at FINRA – supporting market surveillance at scale (John Hitchingham, FINRA)
Presto Summit 2018 (https://www.starburstdata.com/technical-blog/presto-summit-2018-recap/)

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Presto Summit 2018 - 08 - FINRA

  1. 1. Presto at FINRA – Supporting market surveillance at scale John Hitchingham FINRA Engineering John.Hitchingham@finra.org
  2. 2. Market Regulation surveillance workflow BDs Exchanges Reference Data Providers 100B+ events 25+ PB of Data 3+ Yrs ProdMajor Exchange Clients Market Manipulation, Insider Trading, Fraud, Abuse
  3. 3. Data volume Incoming records • 6000+ business objects • 7+ million data partitions • 160+ million data objects • 25+ data publishers • 5+ PB of data
  4. 4. Data Fragmentation makes analytics difficult
  5. 5. Scale by separating storage and compute
  6. 6. Cloud Migration – Siloed Databases to Data Lake
  7. 7. Workflow in AWS Cloud
  8. 8. Herd Catalog http://finraos.github.io/he rd/
  9. 9. ETL
  10. 10. Isolate workloads and tune capacity per process
  11. 11. Interactive
  12. 12. Ad-Hoc query design for data lake
  13. 13. Main production “data warehouse” bucket growth…
  14. 14. Managed Data Lake (MDL) – Data Lake “in a box” Just released as open source Data lake implementation on AWS Featuring Presto as query endpoint https://finraos.github.io/herd-mdl /
  15. 15. Portfolio of interactive apps on data lake
  16. 16. Data Science Ecosystem
  17. 17. Query tool use at FINRA Hive Spark Presto HBase Status Deprecated General use General use Limited use Used For ETL/ELT ETL/ELT (replace Hive) Data Science Machine Learning Data Engineering Data Profiling BI Reporting Custom Apps requiring rapid “indexed” lookups
  18. 18. Future exploration with Presto o CBO o AuthN/AuthZ • Hive metastore – column, row – Ranger? • Federated database access (Postgres) – model to control authorization unique to principal • Federated AuthN (SAML, OAuth) o Athena?

×