Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud

FINRA’s Data Lake unlocks the value in its data to accelerate analytics and machine learning at scale.  FINRA's Technology group has changed its customer's relationship with data by creating a Managed Data Lake that enables discovery on Petabytes of capital markets data, while saving time and money over traditional analytics solutions. FINRA’s Managed Data Lake includes a centralized data catalog and separates storage from compute, allowing users to query from petabytes of data in seconds.  Learn how FINRA uses Spot instances and services such as Amazon S3, Amazon EMR, Amazon Redshift, and AWS Lambda to provide the 'right tool for the right job' at each step in the data processing pipeline.  All of this is done while meeting FINRA’s security and compliance responsibilities as a financial regulator.

  • Login to see the comments

FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Bob Griffiths, AWS Solutions Architect Manager John Hitchingham, FINRA Engineering August 14, 2017 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
  2. 2. Overview of Big Data Services
  3. 3. What is big data? When your data sets become so large and complex you have to start innovating around how to collect, store, process, analyze, and share them.
  4. 4. Collect AWS Import/Export AWS Direct Connect Amazon Kinesis Amazon EMR Amazon EC2 Process & Analyze Amazon Glacier Amazon S3 Store Amazon Machine Learning Amazon Redshift Amazon DynamoDB Amazon Kinesis Analytics Amazon QuickSight AWS Database Migration Service AWS Data Pipeline Amazon RDS, Amazon Aurora Big Data services on AWS Amazon Elasticsearch Service Amazon Athena AWS Glue AWS Snowball
  5. 5. Scale as your data and business grows The volume, variety, and velocity at which data is being generated are leaving organizations with new questions to answer, such as:
  6. 6. Data Lake Central Storage Secure, cost-effective storage in Amazon S3 Data Ingestion Get your data into S3 quickly and securely Kinesis Firehose, Direct Connect, Snowball, Database Migration Service Catalog & Search Access and search metadata Access & User Interface Give your users easy and secure access Processing & Analytics Use of predictive and prescriptive analytics to gain better understanding DynamoDB Elasticsearch Service API Gateway Directory Service Cognito Athena, QuickSight, EMR, Amazon Redshift IAM, CloudWatch, CloudTrail, KMS Protect & Secure Use entitlements to ensure data is secure and users’ identities are verified
  7. 7. Store and analyze all your data—structured and unstructured—from all of your sources, in one centralized location at low cost. Quickly ingest data without needing to force it into a predefined schema, enabling ad-hoc analysis by applying schemas on read, not write. Separating your storage and compute allows you to scale each component as required and attach multiple data processing and analytics services to the same data set. Scale
  8. 8. Use only the services you need Scale only the services you need Pay only for what you use Discounts through Reserved Instances Types including Spot, and upfront commitments Cost
  9. 9. Visibility/control of all APIs and retrievals Encryption of all data at each step Store an exabyte of data or more in Amazon S3 Analyze GB to PB using standard tools Control egress and ingress points using VPCs Security and scale
  10. 10. Big data does not mean just batch • Can be streamed in • Processed in real time • Can be used to respond quickly to requests and actionable events, generate business value You can mix and match • On-premises and cloud • Custom development and managed services Agility & actionable insights
  11. 11. FINRA’s Managed Data Lake
  12. 12. In order to solve its market regulation challenges, over the past three years, FINRA’s Technology team has pioneered a managed cloud service to operate big data workloads and perform analytics at large scale. The results of FINRA’s innovations have been significant. To achieve these gains and operate its big data ecosystem, FINRA Technology has built a set of cutting edge tools, processes, and know-how. FINRA’s experience A 30% operating cost reduction, in both labor and infrastructure A 5x increase in operational resiliency The business is able to perform analytics at an unprecedented scale and depth
  13. 13. Legacy pain points – infrastructure and ops Did not scale well as volumes and workloads increased Duplication of effort in data management (data lifecycle, retention, versioning, etc.) Data sync issues – manual effort to keep data in sync Costly system maintenance and upgrades
  14. 14. Legacy pain points – analytics and data science Business Analysts Data Scientists Data Analysts Data Engineers Ops What data do we have? What format is it in? Where to I get it? Get this data for them… Not on disk – pull from tape Wait for tapes from offsite Prepare & Format Oops, I need more data … Repeat! I need data in different format … Repeat! etc…, etc…
  15. 15. Summary of cloud drivers • Fast-growing data volumes YoY • High cost of pre-building for peak • Escalating costs of in-house technology infrastructure • Long time-to-market for finding insights in data • Appliance platforms were facing obsolescence and end-of life as a result of new big data technologies Keep spending more on legacy infrastructure or redirect dollars to core business of regulation?
  16. 16. FINRA cloud program business objectives • Discover data easily • Access (all the) data easily • Increase the power of analytic tools • Make data processing resilient • Make data processing cost effective Could this be achieved in the cloud?
  17. 17. Cloud architectural principles Manage Data Consistently • Define, store and share our data as an enterprise asset • All data should be enabled for analytics • Protect data in a holistic manner (data at rest and data in transit) Integrate our Portfolio • Shared solutions for common business processes across the organization • All "business" data in cloud will be tracked by a centralized Data Management System so that FINRA can manage the data lifecycle in a productive and cost effective manner • All FINRA-developed applications will have service interfaces Operational Resiliency • Multi-AZ components and fail-over • Auto-scaling and load balancing to achieve high availability • No logon to servers or services for routine operations • Applications should include automated operations jobs to handle known failure scenarios, recovery, data issues, and notifications.
  18. 18. From data puddles to Data Lake Database1 Storage Query/Compute Catalog Database2 Storage Query/Compute Catalog Databasen Storage Query/Compute Catalog Storage Query/ Compute Catalog EMR Spark LambdaEMR Presto EMR HBase herd Hive metastore FINRA in Data Center FINRA in AWS Scales Silo Amazon S3
  19. 19. Data processing stream on Data Lake Catalog & Storage ETL Normalize, Enrich, Reformat Human Analytics Validation Ingest Broker Dealers Exchanges 3rd Party Providers Data Files Analyst Data Scientist Regulatory User • Centralized Catalog • 100s of EMR clusters • As many Lambda functions as needed Patterns Automated Surveillance
  20. 20. Power of parallelization ETL Job1 Input Result ETL Job2 Input Result ETL Jobn Input Result Workloads run in parallel for workload isolation to meet SLAs
  21. 21. Processing scales to meet demand 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 11/1 11/3 11/5 11/7 11/9 11/11 11/13 11/15 11/17 11/19 11/21 11/23 11/25 11/27 11/29 Daily Order Volume (Billions) 0 2000 4000 6000 8000 10000 12000 2016-10-17T02 2016-10-17T08 2016-10-17T14 2016-10-17T20 2016-10-18T02 2016-10-18T08 2016-10-18T14 2016-10-18T20 2016-10-19T02 2016-10-19T08 2016-10-19T14 2016-10-19T20 2016-10-20T02 2016-10-20T08 2016-10-20T14 2016-10-20T20 2016-10-21T02 2016-10-21T08 2016-10-21T14 2016-10-21T20 2016-10-22T02 2016-10-24T03 2016-10-24T20 ComputeNodes Hour of Day AWS EMR compute on EC2 EMR
  22. 22. Catalog for centralized data management Unified catalog • Schemas • Versions • Encryption type • Storage policies Lineage and Usage • Track publishers and consumers • Easily identify jobs and derived data sets Shared Metastore • Common definition of tables and partitions • Use with Spark, Presto, Hive, etc. • Faster instantiation of clusters
  23. 23. Catalog and the Data Lake ecosystem Hive Metastore Data Catalog Data Catalog UI Analyst Data Scientist Explore Use Object Storage (S3) Custom Handler Request object Info Processing Get object info (optl. DDL) Knows Object/File Object/File Object/File Query (w/ DDL) Store Results Custom Handler Register Output Validation ETL Machine SurveillanceLambda EMR Interactive Analytics EMR Redshift (Spectrum) Get DDL
  24. 24. Analytics – one-stop shop for data Data Analyst Data Scientist JDBC Client JDBC Client Table 1 Table 2 AuthN AuthZ Metastore Table N Logical “Database”
  25. 25. Achieve interactive query speed with Data Lake Query Table size (rows) Output size (rows) ORC TXT/BZ2 select count(*) from TABLE_1 where trade_date = cast(‘2016-08-09’ as date) 2469171608 1 4s 1m56s select col1, count(*) from TABLE_1 where col2 = cast('2016- 08-09' as date) group by col1 order by col1 2469171608 12 3s 1m51s select col1, count(*) from TABLE_1 where col2 = cast('2016- 08-09' as date) group by col1 order by col1 2469171608 8364 5s 2m5s select * from TABLE_1 where col2 = cast('2016-08-10' as date) and col3='I' and col4='CR' and col5 between 100000.0 and 103000.0 2469171608 760 10s 2m3s Test Config: Presto (Teradata) On EMR Data on S3 (external tables) Cluster size: 60 worker node x r4.4xlarge Key points: Use ORC (Or Parquet) for performant query
  26. 26. Grow the data store with no work 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Main Production Data Store (Bucket on S3) Size (PB) • Data footprint grows seamlessly • All data accessible for interactive (or batch query) from moment it is stored
  27. 27. Or scale out with multiple clusters… User A JDBC Client Table 1 Table 2 AuthN Metastore Table N Logical “Database” JDBC ClientUser B JDBC App Cluster A Cluster B Cluster C Still One Copy Of Data!
  28. 28. Data needs for data science and ML • Allow discovery & exploration • Bring disparate sources of data together • Allows users to focus on problem not the infrastructure • Safeguard information with high degree of security and least privileges access
  29. 29. A single way to access all of the data Logical Data Repository 1 Data Scientist Logical Data Repository Accelerate discovery through self-service Logical Data Repository Logical Data Repository Data Scientist Data Engineer1 2 N Data Engineer Data Engineer Before Data Lake Data Lake
  30. 30. Data science on the Data Lake Data Scientist JDBC Client Logical ‘Database’ EMR Cluster Still one copy of data! Spark Cluster DS-in-a-box AuthN Data Scientist Notebook Interface Data Scientist Catalog Notebook or Shell
  31. 31. Universal Data Science Platform (UDSP) • Environment (EC2) for each Data Scientist • Simple provisioning interface • Right instance (memory or GPU) for job • Access to all the data in Data Lake • Shut off when not using for savings • Secure (LDAP AuthN/Z + Encryption) Data Scientist
  32. 32. UDSP – Inventory – not just R • R 3.2.5, Python (2.7.12 and 3.4.3) • Packages • R: 300+ Python: 100+ • Tools for Building Packages • gcc, gfortran, make, java, maven, ant… • IDEs • Jupyter, RStudio Server • Deep Learning • CUDA, CuDNN (if GPU present) • Theano, Caffe, Torch • TensorFlow 16
  33. 33. Some business benefits with Data Lake  Market volume changes no longer disruptive technology events  Regulatory analysts can now interactively analyze 1000x more market events (billons of rows vs millions before)  Easily reprocess data if there are upstream data errors – used to take weeks to find capacity now can be done in day/days.  Querying order route detail went from 10s of minutes to seconds  Quicker turnaround to provide data for oversight  Machine Learning model development is easier
  34. 34. Want to hear more? Feel free to contact me:
  35. 35. Thank you!