©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Next-gen Data Flow Platform for the Enterprise
Santosh Bardwaj
Vice President, Advanced Analytics
The opinions expressed in this presentation are those of the presenters,
in their individual capacities, and not necessarily those of Discover.
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Agenda
2
What it
takes to
build an
enterprise-
ready
platform
Discover’s
next-gen data
ingestion
platform
built on NiFi
Challenges
and how we
overcame
them
1 32
Next steps
with the
platform
4
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
$37Bn Consumer Deposits
$9Bn Private Student Loans
$7Bn Personal Loans
1 in 4 Households1
$60Bn in Credit Card Receivables
Leading Cash
Rewards
 $183Bn Payment Services Volume
 185+ Countries/Territories
Discover is a leading U.S. direct bank & payments partner
Note(s)
Balances as of March 31, 2017; volume based on the trailing four quarters ending 1Q17; direct-to-consumer deposits includes affinity deposits
1. TNS’ Consumer Payment Strategies Study
3
Deposits & Lending
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Advancing our
data-analytic
capabilities
Ingest, classify
and transform data
from “source to
insight” in minutes
Centralize data,
next-generation
analytic tools and
reporting on the
Hadoop Data Lake
Extend the
Data Lake and
advanced
analytic stack
on the Cloud to
enable speed
to market
Operationalize
business use
cases leveraging
advanced
analytic
capabilities
Provide real-time
customer insight and
rapid deployment of
new strategies into
the decision engines
Advanced
Analytics
Capabilities
1
5
4
3
2
From hours
to minutes Built around a
foundation of a
continuous data
pipeline and hybrid
data-analytic lake
4
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Unified data ingestion platform built on NiFi
5
Unified data ingestion platform
 Ingest data from source systems
 Push to the Enterprise Data Lake
 Governed process leveraging
common-reusable templates
What is NiFi?
 Enables automated data flow
management
 Acquires data from producers
 Delivers to consumers while
orchestrating the flow
Scalable and Customizable
Provenance
Promotes reuse
Secure
User Interface (drag & drop)
Why we chose NiFi to build our
data ingestion platform
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
The next-gen platform built on NiFi and Spark is designed to
streamline our data pipeline into a near real-time paradigm
6
Operational
Database
Raw Data Lake
(flat file)
Limited user
access and tools
Source
of Truth
Enterprise
DW
Database file
extracts
SFTP
ETL Grid ETL Grid
~24 hours
Raw
data
Source
of truth
Source of truth
- Enriched
Enterprise Data Lake
Phase 1
“True Sourcing”
Phase 2
“Enriched Sourcing”
Minutes
Nightly batch to near real-time
NiFi
Spark
NiFi
Hortonworks
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
We are also extending the capability of into the cloud
7
Batch
sources
Event
Bus
Mini-batch
Real-time
On-premise Data Lake
Model scoring/
decisioning
Real-time
analytics
History
Operational Data
Store
Real-time
AWS Data Lake
Kafka
Hortonworks
Amazon S3
Hortonworks
Spark
7
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Data Flow Categorization within the Hadoop Data Lake
8
System of
Record
(SOR)
Source of Truth
(SOT)
Source of Truth
– Enriched
(SOT-E)
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Detail flow and foundational components
9
SOURCES RAW SOR SOT SOT-E
Source files
Landing
area
File
Catalog
Convert to
standard
format
Schema
evolution
Apply
schema
changes
Raw data
consumable
Technical
metadata
Business
metadata
DQ checks
Data enrichment
(Business
transformation)
Ability to
export data
out of Lake
Continuous
integration
Monitoring Data lineage
Data
governance
Exception
handling
Security
Data
reconciliation
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Ingesting complex data - How complex?
Format of files will vary, some are easy to consume, others hard
Example: Records with Dynamic arrays/vectors of primitives or strings
Schema: First Name, Last Name, Array_size of Sibling_Name[], Sibling_Name[0-N], City
Data:
John, Doe, 2, Susie, Chris, Chicago
Mary, Johnston, 3, Ashley, Tom, Mike, Atlanta
Frank, Smith, 1, Ralph, Toronto
Example: Records with an array of Struct data types
Schema: First Name, Array_size of CompanyStruct[], CompanyStruct.Name, CompanyStruct.City,
CompanyStruct.YearsWorked, Age
Data:
John, 1, Discover, Chicago, 3 , 44
Mary, 3, Sales Unlimited, Dallas, 2, Auditors R’ Us, Atlanta, 5, Discover, Chicago 4, 35
10
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Our solution – A custom NiFi processor to handle complex data types
11
Spark
Converter
Discover schema.json
Data File.001
Data File.avsc
Data File 001.avro
Ingestion Pipeline
Source of
Truth - Source
NiFi Process
Group
System of Record
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Continuous improvement of real-time data ingestion using NiFi
NiFi Ingestion Flow Version I
Source : Flat File Destination: Hadoop
24 hours
NiFi Ingestion Flow Version II
Source : Event Bus Destination: Hadoop
Complex logic, limited scale
NiFi Ingestion Flow Version III
Source : Event Bus Destination: Hadoop
Custom NiFi processor developed in-house, reusable and scalable
Seconds
112
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
ETL on Hadoop progression
Version I
Traditional
ETL tool
Version II
ETL on
HiveQL
Version III
ETL on Spark
(hand-coded)
Coming soon
Automated
(flow-based)
ETL on Spark
13
~18 hours ~8 hours
Data enrichment from SOR to SOT (~600 jobs)
~1 hourRun time:
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Upcoming enhancements to our data pipeline
Integrating
data
quality,
catalog into
NiFi flow
Custom
processors
to parse
complex
data
structures
Enterprise
scale ETL
on Hadoop
using
Spark
Self-
service
data
pipelines
Integrating
batch and
real-time
data
pipelines
14
©2017 Discover Financial Services - Confidential and Proprietary - Do not copy or distribute
Hiring Data Engineers
Q & A

Continuous Data Ingestion pipeline for the Enterprise

  • 1.
    ©2017 Discover FinancialServices - Confidential and Proprietary - Do not copy or distribute Next-gen Data Flow Platform for the Enterprise Santosh Bardwaj Vice President, Advanced Analytics The opinions expressed in this presentation are those of the presenters, in their individual capacities, and not necessarily those of Discover.
  • 2.
    ©2017 Discover FinancialServices - Confidential and Proprietary - Do not copy or distribute Agenda 2 What it takes to build an enterprise- ready platform Discover’s next-gen data ingestion platform built on NiFi Challenges and how we overcame them 1 32 Next steps with the platform 4
  • 3.
    ©2017 Discover FinancialServices - Confidential and Proprietary - Do not copy or distribute $37Bn Consumer Deposits $9Bn Private Student Loans $7Bn Personal Loans 1 in 4 Households1 $60Bn in Credit Card Receivables Leading Cash Rewards  $183Bn Payment Services Volume  185+ Countries/Territories Discover is a leading U.S. direct bank & payments partner Note(s) Balances as of March 31, 2017; volume based on the trailing four quarters ending 1Q17; direct-to-consumer deposits includes affinity deposits 1. TNS’ Consumer Payment Strategies Study 3 Deposits & Lending
  • 4.
    ©2017 Discover FinancialServices - Confidential and Proprietary - Do not copy or distribute Advancing our data-analytic capabilities Ingest, classify and transform data from “source to insight” in minutes Centralize data, next-generation analytic tools and reporting on the Hadoop Data Lake Extend the Data Lake and advanced analytic stack on the Cloud to enable speed to market Operationalize business use cases leveraging advanced analytic capabilities Provide real-time customer insight and rapid deployment of new strategies into the decision engines Advanced Analytics Capabilities 1 5 4 3 2 From hours to minutes Built around a foundation of a continuous data pipeline and hybrid data-analytic lake 4
  • 5.
    ©2017 Discover FinancialServices - Confidential and Proprietary - Do not copy or distribute Unified data ingestion platform built on NiFi 5 Unified data ingestion platform  Ingest data from source systems  Push to the Enterprise Data Lake  Governed process leveraging common-reusable templates What is NiFi?  Enables automated data flow management  Acquires data from producers  Delivers to consumers while orchestrating the flow Scalable and Customizable Provenance Promotes reuse Secure User Interface (drag & drop) Why we chose NiFi to build our data ingestion platform
  • 6.
    ©2017 Discover FinancialServices - Confidential and Proprietary - Do not copy or distribute The next-gen platform built on NiFi and Spark is designed to streamline our data pipeline into a near real-time paradigm 6 Operational Database Raw Data Lake (flat file) Limited user access and tools Source of Truth Enterprise DW Database file extracts SFTP ETL Grid ETL Grid ~24 hours Raw data Source of truth Source of truth - Enriched Enterprise Data Lake Phase 1 “True Sourcing” Phase 2 “Enriched Sourcing” Minutes Nightly batch to near real-time NiFi Spark NiFi Hortonworks
  • 7.
    ©2017 Discover FinancialServices - Confidential and Proprietary - Do not copy or distribute We are also extending the capability of into the cloud 7 Batch sources Event Bus Mini-batch Real-time On-premise Data Lake Model scoring/ decisioning Real-time analytics History Operational Data Store Real-time AWS Data Lake Kafka Hortonworks Amazon S3 Hortonworks Spark 7
  • 8.
    ©2017 Discover FinancialServices - Confidential and Proprietary - Do not copy or distribute Data Flow Categorization within the Hadoop Data Lake 8 System of Record (SOR) Source of Truth (SOT) Source of Truth – Enriched (SOT-E)
  • 9.
    ©2017 Discover FinancialServices - Confidential and Proprietary - Do not copy or distribute Detail flow and foundational components 9 SOURCES RAW SOR SOT SOT-E Source files Landing area File Catalog Convert to standard format Schema evolution Apply schema changes Raw data consumable Technical metadata Business metadata DQ checks Data enrichment (Business transformation) Ability to export data out of Lake Continuous integration Monitoring Data lineage Data governance Exception handling Security Data reconciliation
  • 10.
    ©2017 Discover FinancialServices - Confidential and Proprietary - Do not copy or distribute Ingesting complex data - How complex? Format of files will vary, some are easy to consume, others hard Example: Records with Dynamic arrays/vectors of primitives or strings Schema: First Name, Last Name, Array_size of Sibling_Name[], Sibling_Name[0-N], City Data: John, Doe, 2, Susie, Chris, Chicago Mary, Johnston, 3, Ashley, Tom, Mike, Atlanta Frank, Smith, 1, Ralph, Toronto Example: Records with an array of Struct data types Schema: First Name, Array_size of CompanyStruct[], CompanyStruct.Name, CompanyStruct.City, CompanyStruct.YearsWorked, Age Data: John, 1, Discover, Chicago, 3 , 44 Mary, 3, Sales Unlimited, Dallas, 2, Auditors R’ Us, Atlanta, 5, Discover, Chicago 4, 35 10
  • 11.
    ©2017 Discover FinancialServices - Confidential and Proprietary - Do not copy or distribute Our solution – A custom NiFi processor to handle complex data types 11 Spark Converter Discover schema.json Data File.001 Data File.avsc Data File 001.avro Ingestion Pipeline Source of Truth - Source NiFi Process Group System of Record
  • 12.
    ©2017 Discover FinancialServices - Confidential and Proprietary - Do not copy or distribute Continuous improvement of real-time data ingestion using NiFi NiFi Ingestion Flow Version I Source : Flat File Destination: Hadoop 24 hours NiFi Ingestion Flow Version II Source : Event Bus Destination: Hadoop Complex logic, limited scale NiFi Ingestion Flow Version III Source : Event Bus Destination: Hadoop Custom NiFi processor developed in-house, reusable and scalable Seconds 112
  • 13.
    ©2017 Discover FinancialServices - Confidential and Proprietary - Do not copy or distribute ETL on Hadoop progression Version I Traditional ETL tool Version II ETL on HiveQL Version III ETL on Spark (hand-coded) Coming soon Automated (flow-based) ETL on Spark 13 ~18 hours ~8 hours Data enrichment from SOR to SOT (~600 jobs) ~1 hourRun time:
  • 14.
    ©2017 Discover FinancialServices - Confidential and Proprietary - Do not copy or distribute Upcoming enhancements to our data pipeline Integrating data quality, catalog into NiFi flow Custom processors to parse complex data structures Enterprise scale ETL on Hadoop using Spark Self- service data pipelines Integrating batch and real-time data pipelines 14
  • 15.
    ©2017 Discover FinancialServices - Confidential and Proprietary - Do not copy or distribute Hiring Data Engineers Q & A

Editor's Notes

  • #5 Discover has a tradition of operating on mature data – analytic platforms such as TD, SAS – Platforms are proprietary, expensive Since the beginning of this decade , there are 3 key trends that have influenced the future of the industry: Big Data, Open source tools, Real-Time analytics and  cloud Business – Reinvent our key decisioning platforms such as Fraud, Credit decisioning, Collections – Faster, Richer data, better quality insights , Faster development & deployment  Technology foundation consists of – Hadoop, a new Data pipeline Collectively should help improved our speed to market from days/ hours to minutes
  • #11 Multiple record formats within a single file Records will contain complex data structures (sub-records, dynamic arrays/vectors) Fixed width, single and multiple delimited, Mainframe
  • #12 Systematically convert source files to a standard format with schema information attached Apply our own “Discover Schema” (stored in json) to the raw source file (or use CopyBook for mainframe files) Feed the source data and our “Discover Schema” into a Spark application “Discover Schema” is needed so our convertor knows how to parse the incoming data file Output is an AVRO data file along with corresponding .avsc schema Avro data and schema is then passed on to the ingestion pipeline for further Hive Loading and processing