A Journey To Modernization
Shawn Benjamin & Prabha Rajendran
Problems Faced
q Informatica ETL pipeline was brittle
q Lengthy Informatica ETL development
cycle
q Lengthy load time workflows for ingestion
of data
q Lack of ability for real time/ near real-time
data
q Lack of data science platform
2
Legacy Architecture
Data Scientists
Statisticians
Business Analysts
Extracted
Data
R / Python Code Business
Insight
3
Datawarehouse
Source Systems
4
C3 CON
eCIMS
C3 LANS
C4
NFTS
ELIS
NASS
CPMS
Pay.gov
AR-11
RAPS
MFAS
VSS
ATS
PCTS
SNAP
RNACS
Benefits
Files
Payment
Verification
Validation
Cust Svc
Scheduler
AdminFraudFOIA
Benefits
Mart
Scheduler
Mart
Payment
Mart
Validation
Mart
APSS
C3 Con
C4
CIS
ELIS2
MFAS
NFTS
PCTS
RNACS
VSS
C3 LAN
AAO x2
CSC x2
MSC x2
NSC x2
TSC x2
VSC x2
CHAMPS
ECHO
iCLAIMS
IDCMS
QUEUE
NPWR
CAMINO VIBE
SODA
Data Marts
SCCLAIMS
FDNS-DS
SMART Subject Areas
SAS LibrarieseCISCOR
Direct Connects
Data
Marts
Direct Connects
Active ODSes Decom. ODSes
Treasury
CIR
CIS
x7
x1
x2
x1
x1
x5
x2
x5
x2
x1
x2
x6 x1
x1
x1
x1
x2
x1
x2
x1
x1
LEGEND
ODSes
VIS x2
CPMS x1
SRMT x1
NFTS x1x1
x1
x1
FACCON x1
ePULSE x1
BI Tools
Data Marts
Users
4
2016SNAPSHOT
JANUARY
Data
Sources
SMART
Subject Areas
36
2
eCISCOR
ODS
FutureData MartDirect Connect
66
2,354
ETL28 Processes
Implemented
Databricks private
cloud
VPC (26 nodes) in the
AWS
Connected the Databricks
cluster to the Oracle database
Created all relevant
DB tables in HIVE
metadata pointing to
Oracle database
Copied relevant tables from Oracle
database to S3 using Scala code
Data is stored in Apache Parquet
columnar format. For context, the 120
million row 83 column can be dumped
to S3 in just 10 minutes.
Identified appropriate
partition scheme
large tables were partitioned to
optimize Spark query performance
Created multiple notebooks
Perform data analysis and visualize the
results, e.g. created histogram of case
life cycle duration
Successful Proof of Concept
5
Current Databricks Implementation
6
Statisticians
Business Analysts
Business
Insight
S3 Data
Lake
Data Scientists
LakeHouse
7
• 75 Data
Sources
• Xx Data
Interfaces
• 7 Data
Marts
• 4 BI Tools
• 6,086
Tableau
Dashboards
• 118 SMART
Subject
Areas
• 56 SAS
Libraries
• 6,233 Users
• 75 Data
Sources
• 35 Application
Interfaces
• 7 Data Marts
• 4 BI Tools
• 6,086
Tableau
Dashboards
• 118 SMART
Subject Areas
• 56 SAS
Libraries
• 6,233 Users
Databricks Accomplishments
IMPLEMENTATION
OF DELTA LAKE
EASY INTEGRATION
WITH OBIEE ,SAS
AND TABLEAU WITH
NATIVE
CONNECTORS
INTEGRATION WITH
GITHUB FOR
CONTINUOUS
INTEGRATION &
DEPLOYMENT
AUTOMATING
ACCOUNT
PROVISIONING
MACHINE LEARNING
(ML FLOW)
INTEGRATION
8
Change Data Capture using Delta Lake
Databricks Delta –Success Factors
v Faster Ingestion of CDC changes
v Resiliency
v Improved Data Quality , Reporting
Availability and runtime
performance
v Schema evolution - adding
additional columns without
rewriting the entire table
Databricks Delta-Lessons Learned
v Storage requirements increased
v Vacuum and Optimization is
mandatory to improve the
performance
9
Unified Data Analytical Platform -Tableau
10
Unified Data Analytical Platform –OBIEE/SAS
11
Data Science Experiments using ML
12
ML Graphs processes after running the models
Prediction Model Samples
Text and Log Mining
0
0.5
1
NegativeSentiment Positive Sentiment
Sentiment Analysis
13
Time Series Models and H2O Integration
Integrated H20 with Databricks and built a model predicting the count of ‘No show’ on N400 using
the traditional Time series forecasting to predict inefficiencies in normal day-to-day planning and operations
14
Enabling Security & Governance
15
Access
Control (ACL)
Credentials
Passthrough
Secrets
Management
v Control users access to data using the
Databricks view-based access control
model (Table and schema level ACLs)
v Control users access to clusters that are
not enabled for table access control
v Enforced data object privileges at
onboarding phase
v Used Databricks secrets manager to store
credentials and reference in notebooks
Databricks Management API Usage
16
Cluster/Jobs management àCreate, delete,
manage clusters and get execution status of daily
scheduled jobs which helped automated
monitoring.
Library /Secret managementà Easy upload of
any third-party libraries and manage encrypted
scopes/credentials to connect to source and
target endpoints.
Integration and Deploymentsà API with Git
and Jenkins for continuous integration and
continuous deployment
Enabled MLFlow Tracking API for our Data
Science experiments
API
Integrated Databricks Management API with Jenkins and other scripting tools to
automate all our administration and management tasks.
Lessons learned through this Journey
Training plan Cloud based
experience
Subject Matter
Expertise
Automation
17
Success Strategy
Success Criteria Benefit
Performance
ü Auto-scalability leveraging on-demand and spot instances
ü Efficient processing of larger datasets comparable to RDMS systems
ü Scalable read/write performance on S3
Support for a variety of statistical
programming languages
ü Data Science Platform ( R, Python, Scala and SQL)
ü Supports MLIB : Machine Learning & Deep Learning
Integration with existing tools
ü Allows connections to industry standard technologies via ODBC/JDBC
connection and inbuilt connectors.
Easily integrate new data sources
ü Supports seamless integration with data streaming technologies like
Kafka/Kinesis using Spark Streaming. This supports both structured and
unstructured
ü Leverages S3 extensively
Secure
ü Supports integration with multiple Single-Sign-On platforms
ü Supports native encryption-decryption features (AES-256 and KMS)
ü Supports Access Control Layer (ACL)
ü Implemented in USCIS Private cloud
18
Questions!
19

Lessons Learned from Modernizing USCIS Data Analytics Platform

  • 1.
    A Journey ToModernization Shawn Benjamin & Prabha Rajendran
  • 2.
    Problems Faced q InformaticaETL pipeline was brittle q Lengthy Informatica ETL development cycle q Lengthy load time workflows for ingestion of data q Lack of ability for real time/ near real-time data q Lack of data science platform 2
  • 3.
    Legacy Architecture Data Scientists Statisticians BusinessAnalysts Extracted Data R / Python Code Business Insight 3 Datawarehouse Source Systems
  • 4.
    4 C3 CON eCIMS C3 LANS C4 NFTS ELIS NASS CPMS Pay.gov AR-11 RAPS MFAS VSS ATS PCTS SNAP RNACS Benefits Files Payment Verification Validation CustSvc Scheduler AdminFraudFOIA Benefits Mart Scheduler Mart Payment Mart Validation Mart APSS C3 Con C4 CIS ELIS2 MFAS NFTS PCTS RNACS VSS C3 LAN AAO x2 CSC x2 MSC x2 NSC x2 TSC x2 VSC x2 CHAMPS ECHO iCLAIMS IDCMS QUEUE NPWR CAMINO VIBE SODA Data Marts SCCLAIMS FDNS-DS SMART Subject Areas SAS LibrarieseCISCOR Direct Connects Data Marts Direct Connects Active ODSes Decom. ODSes Treasury CIR CIS x7 x1 x2 x1 x1 x5 x2 x5 x2 x1 x2 x6 x1 x1 x1 x1 x2 x1 x2 x1 x1 LEGEND ODSes VIS x2 CPMS x1 SRMT x1 NFTS x1x1 x1 x1 FACCON x1 ePULSE x1 BI Tools Data Marts Users 4 2016SNAPSHOT JANUARY Data Sources SMART Subject Areas 36 2 eCISCOR ODS FutureData MartDirect Connect 66 2,354 ETL28 Processes
  • 5.
    Implemented Databricks private cloud VPC (26nodes) in the AWS Connected the Databricks cluster to the Oracle database Created all relevant DB tables in HIVE metadata pointing to Oracle database Copied relevant tables from Oracle database to S3 using Scala code Data is stored in Apache Parquet columnar format. For context, the 120 million row 83 column can be dumped to S3 in just 10 minutes. Identified appropriate partition scheme large tables were partitioned to optimize Spark query performance Created multiple notebooks Perform data analysis and visualize the results, e.g. created histogram of case life cycle duration Successful Proof of Concept 5
  • 6.
    Current Databricks Implementation 6 Statisticians BusinessAnalysts Business Insight S3 Data Lake Data Scientists LakeHouse
  • 7.
    7 • 75 Data Sources •Xx Data Interfaces • 7 Data Marts • 4 BI Tools • 6,086 Tableau Dashboards • 118 SMART Subject Areas • 56 SAS Libraries • 6,233 Users • 75 Data Sources • 35 Application Interfaces • 7 Data Marts • 4 BI Tools • 6,086 Tableau Dashboards • 118 SMART Subject Areas • 56 SAS Libraries • 6,233 Users
  • 8.
    Databricks Accomplishments IMPLEMENTATION OF DELTALAKE EASY INTEGRATION WITH OBIEE ,SAS AND TABLEAU WITH NATIVE CONNECTORS INTEGRATION WITH GITHUB FOR CONTINUOUS INTEGRATION & DEPLOYMENT AUTOMATING ACCOUNT PROVISIONING MACHINE LEARNING (ML FLOW) INTEGRATION 8
  • 9.
    Change Data Captureusing Delta Lake Databricks Delta –Success Factors v Faster Ingestion of CDC changes v Resiliency v Improved Data Quality , Reporting Availability and runtime performance v Schema evolution - adding additional columns without rewriting the entire table Databricks Delta-Lessons Learned v Storage requirements increased v Vacuum and Optimization is mandatory to improve the performance 9
  • 10.
    Unified Data AnalyticalPlatform -Tableau 10
  • 11.
    Unified Data AnalyticalPlatform –OBIEE/SAS 11
  • 12.
    Data Science Experimentsusing ML 12 ML Graphs processes after running the models Prediction Model Samples
  • 13.
    Text and LogMining 0 0.5 1 NegativeSentiment Positive Sentiment Sentiment Analysis 13
  • 14.
    Time Series Modelsand H2O Integration Integrated H20 with Databricks and built a model predicting the count of ‘No show’ on N400 using the traditional Time series forecasting to predict inefficiencies in normal day-to-day planning and operations 14
  • 15.
    Enabling Security &Governance 15 Access Control (ACL) Credentials Passthrough Secrets Management v Control users access to data using the Databricks view-based access control model (Table and schema level ACLs) v Control users access to clusters that are not enabled for table access control v Enforced data object privileges at onboarding phase v Used Databricks secrets manager to store credentials and reference in notebooks
  • 16.
    Databricks Management APIUsage 16 Cluster/Jobs management àCreate, delete, manage clusters and get execution status of daily scheduled jobs which helped automated monitoring. Library /Secret managementà Easy upload of any third-party libraries and manage encrypted scopes/credentials to connect to source and target endpoints. Integration and Deploymentsà API with Git and Jenkins for continuous integration and continuous deployment Enabled MLFlow Tracking API for our Data Science experiments API Integrated Databricks Management API with Jenkins and other scripting tools to automate all our administration and management tasks.
  • 17.
    Lessons learned throughthis Journey Training plan Cloud based experience Subject Matter Expertise Automation 17
  • 18.
    Success Strategy Success CriteriaBenefit Performance ü Auto-scalability leveraging on-demand and spot instances ü Efficient processing of larger datasets comparable to RDMS systems ü Scalable read/write performance on S3 Support for a variety of statistical programming languages ü Data Science Platform ( R, Python, Scala and SQL) ü Supports MLIB : Machine Learning & Deep Learning Integration with existing tools ü Allows connections to industry standard technologies via ODBC/JDBC connection and inbuilt connectors. Easily integrate new data sources ü Supports seamless integration with data streaming technologies like Kafka/Kinesis using Spark Streaming. This supports both structured and unstructured ü Leverages S3 extensively Secure ü Supports integration with multiple Single-Sign-On platforms ü Supports native encryption-decryption features (AES-256 and KMS) ü Supports Access Control Layer (ACL) ü Implemented in USCIS Private cloud 18
  • 19.