SlideShare a Scribd company logo
1 of 18
Advanced Machine Learning Data
Integration with Common Data
Framework (Model Robot)
June 20, 2018
Presenters:
Kevin Martelli (KPMG)
Managing Director - Data and Analytics
Balaji Wooputer (Freddie Mac)
Director – Risk Analytics
© Freddie Mac 2
 Freddie Mac makes home possible for
millions of families and individuals by
providing mortgage capital to lenders.
 Since our creation in 1970, we've made
housing more accessible and affordable
for homebuyers and renters in
communities nationwide.
 We are building a better housing finance
system for homebuyers, renters, lenders,
and taxpayers.
Freddie Mac
© Freddie Mac 3
Objective: Design, develop and implement a self-learning (AI), highly-flexible common
data engineering framework to automate the design of the entire data munging process.
Recap: Challenge & Objective
Reusability
Extensibility
Reduce
Development
Cost
Speed
to Market
Success = Automating the design for data munging and integration in a scalable way,
while reducing the time to implement a data application by 50% to 70%, thereby
allowing data scientists and business analysts to easily access data.
https://www.youtube.com/watch?v=ct6gydYAQr42017 : Link
Challenge: Quickly integrating multiple variations of vendor’s semi-structured and
structured loan level data in order to make quicker and better business decisions to Re-
Imagine the Mortgage Experience.
© Freddie Mac 4
 Data Enablement & Profiling Framework
Common Data Engineering Framework (CDF)
Data Enablement & Profiling Framework
Data Sources Data Integration Integration and Execution Information Access User Engagement
Data Preparation
ERD
Execution
Information Access
Analytics Tools
Actian Matrix
Hadoop Tools
Web Services
User Engagement
Guided Reporting
and Portfolio
Analytics
Customized
Dashboard
Consumption
Governance Business glossary
Data security and privacy
rules
Data flows lineage,
transformation rules
Data models and standards
Data quality and data
profiling standards
Authentication & identity
mgt. (Kerberos)
Data, masking,
encryption
Availability, Backup,
Recovery
Authorization and audit
(data level)
Patch, upgrade,
operations
Job scheduling
Security, Platform Mgmt
Business Continuity
Enterprise Loan
Application Data
Sources
Metadata
XSD
Business
Requirement
Other
Data Sources
Third-Party
Data
Other Business
Factors
Shared Operational Data
Operational Data
Store
Data Mart EDW
Job Automation
Vendor
Provided
ODI
Loan Application
XML
JSON
Raw Data Repository
Data Exploration & Landing
Data Dictionary
Data Model
Profiling
Data sources
structured, semi-
structured and
unstructured.
- Create data plan and data
dictionary (Automate)
- Understand and identify data
lineage
- Define business requirements &
rules
- Data transformation, profiling &
maturity
- Develop data work flows and
schedules
- Transform data & connect
database to backend platform
- Define and manage model &
program execution lifecycle
- Real-time model & program
execution & scoring
- Model automated learning's
- Performance, log management
for models and programs.
Execution
Enablement
Execution Logs
- Design and create data
visualization
- Business Intelligence
and reporting
Feedback upstream
systems
- Insights realized
- Connect Backend and
Frontend systems
- Tools for additional data
processing & analytics
- Generate additional
insights from enabled
data.
AUTOMATION
Part of the Common Engineer Data Framework
© Freddie Mac 5
Data Discovery
Transformation Rules
Analytical Data Model
CDF (2017)
5. Action
1. Ingest
2. Discover
4. Transform
&
Models
3. REQs.
Domain
Knowledge
Data Munging
Data Integration
Data Model
Analytics Modeling
Data Visualization
https://www.youtube.com/watch?v=ct6gydYAQr42017 HW Summit
© Freddie Mac 6
Data
Discovery
Business
Knowledge
Domain and Data Science SME’s
Customized
Enablement
Configuration
Data Model
CDF Data Model (2017)
© Freddie Mac 7
CDF Model Development Life Cycle
Data
Processing
Feature
Engineering
& Selection
Model
Development
Model
Selection
Model
Validation &
Testing
Model
Deployment
• Finalize production
code
• Develop the model
documentation
• Develop model
monitoring plan
• Develop the action
plan
• Define validation and
testing methods
• Error analysis
• Identify business case
calculation
• Model refinement
• Code development
• Model training
• Hyper-parameter
tuning
• Model taxonomy
• Create comparison
tables
• Identify candidates
• Select champion/
challenger
• Exploratory data
analysis
• Variable
transformation
• Recursive feature
elimination
• Dimension
reduction
• Data parsing
• Data cleaning
EXTRACT INFER CLASSIFY
Common Data Framework – Model Automation
© Freddie Mac 8
Model Framework Architecture
Data Platform
Model Framework Model Development
Model Warehouse
Batch Process Streaming
Insight Service as API
Model Management Considerations:
• Reusability and Model Build
• Model Versions
• Evaluation Framework (A/B) testing
• Feature Extraction
• Internal/External API
• Message Broker
© Freddie Mac 9
CDF Next Gen – Model Robot
Idea
Strategy
Customer
 Automatically discovers semantic layer from data
discovery and generates data model through a
machine-learning driven approach thereby
significantly reducing need for manual data
modeling.
 Enhance old fashioned rigid, pre-modeled and
pre-configured data models into flexible,
adaptive data entities.
 Accelerates data integration resulting in faster
data insights to users
Model Robot is “Fast, Flexible Autogenerated Data Model”
© Freddie Mac 10
Data Modeling and Analytics Package
• Spark MLlib (Naïve Bayes + Random
Forest)
• Scikit-learn
• NumPy, SciPy
• Pandas
NLP Package
• Gensim
• NLTK
Model
Prediction
Output (Data
Model)
HDP-HDFS
User Input Model
Configuration
Vendor
Input
JSON
Vendor
Input
XML
Existing
Data
Model
Training/Prediction
Target Label
For Training
Data
Profiling
and
Discovery
Feature
Engineer
Model
Training
Model
Ensemble
Model
Prediction
Metric
Persist
Model
Classification
Engine
Model Training
1. Based on Learning Source Descriptions for Data
Integration
2. Leverage the model framework
3. Use Vendor data for training
4. 3 Core Features: Numerical, Text, Relationship
Doan AH, Domingos P, Levy A (2000) Learning source
descriptions for data integration. In: Proc WebDB Workshop,
pp. 81–92
Model:
Training data: 25k JSON/XML Files, 300+ attributes
Testing data: 1k JSON/XML Files, 300+ attributes
Training vs Test data split: 80% vs 20%
Prediction Accuracy: 69%
Mapped 22 Target Labels
E.g. 6,000,000 numerical values and 9 000 text values.
© Freddie Mac 11
CDF Next Gen (2018) – Model Robot
Data
Discovery
SME
Confirmation
and Update
Predicted Data
Model
Continuous Learning
Feature
Engineer Training
Prediction
Spark ML
Business
Knowledge
© Freddie Mac 12
Feature
Extraction
Numeric-based
features
• Min, max, mean,
median...
Text-based features
• TF_IDF
• POS tagging
• NER tagging
Relationship-based
features
• Depth
• Number of neighbors
• Xpath
Modeling
Random Forest for
numeric
Naïve-Bayes for text
features
Cosine similarity for
relationships
Data
Discovery
Model
Ensemble
CDF Model Robot Design (High-level)
© Freddie Mac 13
CDF Autogenerated Data Model Prediction
Feature
Extraction
SME Review
and Update
Data
Discovery
New data for the same business area
/path/to/a AvailableBalance [5400, 6000, 3000, 1500, …]
/path/to/b AssetType [MNMT, …]
Numeric:
• Min, Max
• Mean, STD
• Percentiles
• …
Free-form text:
• Tokenize
• TF-IDF
• …
Path+Name:
• Tokenize
• TF-IDF
• …
Random
Forest
Naïve-
Bayes
Cosine-
similarity
PredictionPrediction
Source Predicted
AvailableBalance AssetAccountBalance
AssetType AccountType
© Freddie Mac 14
CDF - Outcomes
Cost
ReusabilityExtensible
Collaboration
Trust
Rules
Agility
Automation
Analytics
Data
Reconcile
Data
Data
Business Rules
New
Customer
Insights
© Freddie Mac 15
CDF Business Value Delivered
2016
2017
2018
2016
Developed a generic data
engineering framework.
Reduced the “Data Munging”
time by 50%.
Reports available 3 to 4
months after going live.
2017
Reduced data engineering
timeline by an additional
25%.
Report generated the next
business day.
2018
Develop a self-learning AI
data model.
Reduce data engineering and
data model effort timeline.
(Target 25%)
© Freddie Mac 16
Thank You
2018 Hortonworks Summit
Balaji Wooputur & Kevin Martelli
Jun/20/2018
© Freddie Mac 17
CDF Conceptual Architecture
ElementTree
Sqoop/JDBC
DataProcessing (PySpark)
Oozie/Ranger/Ambari
ElementTree
PartitionedORCTables
Data Discovery Output
Analytics
Modeling
Sqoop/ODI
Data Profiling
and Discovery
Feature
Engineer
Model
Training
Model
Ensemble
Timeline:
• Automated data ingestion
• Leverage significant domain knowledge to enable (Manual)
• Use learnings to decrease domain knowledge and
accelerated Analytical Data Model (ADM)
RAW
AccelerateTimeToValue
Data Profiling
FlatFile
RAW
DataFrame (SparkSQL)
Partition
1
… Partitio
nN
Data Modeling
and Analytics
Package
• Spark MLlib
(Naïve Bayes
+ Random
Forest)
• Scikit-learn
• NumPy, SciPy
• Pandas
NLP Package
• Gensim
• NLTK
Resilient Distributed
Dataset
Partition
1
… Partitio
nN
Data model
© Freddie Mac 18
 Training data:
» 27467 JSON/XML Files
– 338 attributes
– 6,000,000 numerical values
– 9,000 text values
» 22 Target Labels
» Training v.s. Validation data split: 80% v.s. 20%
 Testing data:
» 1150 JSON/XML Files
 Overall Prediction Accuracy:
» 69%
CDF Autogenerated Data Model Training

More Related Content

What's hot

The future-of-cryptocurrency-2021
The future-of-cryptocurrency-2021The future-of-cryptocurrency-2021
The future-of-cryptocurrency-2021zer zeed
 
Data Monetization
Data MonetizationData Monetization
Data MonetizationDATAVERSITY
 
Tokens and Complex Systems
Tokens and Complex SystemsTokens and Complex Systems
Tokens and Complex SystemsTrent McConaghy
 
The future of big data analytics
The future of big data analyticsThe future of big data analytics
The future of big data analyticsAhmed Banafa
 
THE STATE OF digital finance.pdf
THE STATE OF digital finance.pdfTHE STATE OF digital finance.pdf
THE STATE OF digital finance.pdfAlemayehu
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Global Future of Blockchain
Global Future of Blockchain Global Future of Blockchain
Global Future of Blockchain Melanie Swan
 
Big data analysis and Internet of Things(IoT)
Big data analysis and Internet of Things(IoT)Big data analysis and Internet of Things(IoT)
Big data analysis and Internet of Things(IoT)Monica Kambala
 
Data Governance and Data Science to Improve Data Quality
Data Governance and Data Science to Improve Data QualityData Governance and Data Science to Improve Data Quality
Data Governance and Data Science to Improve Data QualityDATAVERSITY
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analyticsSSaudia
 
Customer Profiling using Data Mining
Customer Profiling using Data Mining Customer Profiling using Data Mining
Customer Profiling using Data Mining Suman Chatterjee
 
Metadata Strategies
Metadata StrategiesMetadata Strategies
Metadata StrategiesDATAVERSITY
 
BI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and StrategyBI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and StrategyShivam Dhawan
 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management DATAVERSITY
 
Demystifying Open Banking
Demystifying Open BankingDemystifying Open Banking
Demystifying Open Bankingaccenture
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introductionkrishna singh
 

What's hot (20)

The future-of-cryptocurrency-2021
The future-of-cryptocurrency-2021The future-of-cryptocurrency-2021
The future-of-cryptocurrency-2021
 
Data Monetization
Data MonetizationData Monetization
Data Monetization
 
Tokens and Complex Systems
Tokens and Complex SystemsTokens and Complex Systems
Tokens and Complex Systems
 
The future of big data analytics
The future of big data analyticsThe future of big data analytics
The future of big data analytics
 
THE STATE OF digital finance.pdf
THE STATE OF digital finance.pdfTHE STATE OF digital finance.pdf
THE STATE OF digital finance.pdf
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Global Future of Blockchain
Global Future of Blockchain Global Future of Blockchain
Global Future of Blockchain
 
Data monetization pov
Data monetization   povData monetization   pov
Data monetization pov
 
Big data analysis and Internet of Things(IoT)
Big data analysis and Internet of Things(IoT)Big data analysis and Internet of Things(IoT)
Big data analysis and Internet of Things(IoT)
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Data Governance and Data Science to Improve Data Quality
Data Governance and Data Science to Improve Data QualityData Governance and Data Science to Improve Data Quality
Data Governance and Data Science to Improve Data Quality
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
Customer Profiling using Data Mining
Customer Profiling using Data Mining Customer Profiling using Data Mining
Customer Profiling using Data Mining
 
Metadata Strategies
Metadata StrategiesMetadata Strategies
Metadata Strategies
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
BI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and StrategyBI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and Strategy
 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management
 
Demystifying Open Banking
Demystifying Open BankingDemystifying Open Banking
Demystifying Open Banking
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
 

Similar to Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration with Common Data Framework (Model Robot)

Information Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesInformation Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesDataWorks Summit
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...Denodo
 
Anzo Smart Data Integration
Anzo Smart Data IntegrationAnzo Smart Data Integration
Anzo Smart Data IntegrationMarty Loughlin
 
Modern Data Discovery and Integration in Retail Banking
Modern Data Discovery and Integration in Retail BankingModern Data Discovery and Integration in Retail Banking
Modern Data Discovery and Integration in Retail BankingCambridge Semantics
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data PlatformVikas Manoria
 
Is your data paying you dividends?
Is your data paying you dividends? Is your data paying you dividends?
Is your data paying you dividends? Karan Sachdeva
 
Data Virtualization: Introduction and Business Value (UK)
Data Virtualization: Introduction and Business Value (UK)Data Virtualization: Introduction and Business Value (UK)
Data Virtualization: Introduction and Business Value (UK)Denodo
 
Anzo smart data integration dgiq 2014
Anzo smart data integration dgiq 2014Anzo smart data integration dgiq 2014
Anzo smart data integration dgiq 2014Marty Loughlin
 
ICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceKaran Sachdeva
 
Arquitectura de Datos en Azure
Arquitectura de Datos en AzureArquitectura de Datos en Azure
Arquitectura de Datos en AzureElena Lopez
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyNeo4j
 
Klarna Tech Talk - Mind the Data!
Klarna Tech Talk - Mind the Data!Klarna Tech Talk - Mind the Data!
Klarna Tech Talk - Mind the Data!Jeffrey T. Pollock
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Denodo
 
Virtual Sandbox for Data Scientists at Enterprise Scale
Virtual Sandbox for Data Scientists at Enterprise ScaleVirtual Sandbox for Data Scientists at Enterprise Scale
Virtual Sandbox for Data Scientists at Enterprise ScaleDenodo
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalHarvinder Atwal
 
Customer Intelligence_ Harnessing Elephants at Transamerica Presentation (1)
Customer Intelligence_ Harnessing Elephants at Transamerica    Presentation (1)Customer Intelligence_ Harnessing Elephants at Transamerica    Presentation (1)
Customer Intelligence_ Harnessing Elephants at Transamerica Presentation (1)Vishal Bamba
 

Similar to Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration with Common Data Framework (Model Robot) (20)

Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns
 
Information Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesInformation Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data Lakes
 
About CDAP
About CDAPAbout CDAP
About CDAP
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
 
Anzo Smart Data Integration
Anzo Smart Data IntegrationAnzo Smart Data Integration
Anzo Smart Data Integration
 
Modern Data Discovery and Integration in Retail Banking
Modern Data Discovery and Integration in Retail BankingModern Data Discovery and Integration in Retail Banking
Modern Data Discovery and Integration in Retail Banking
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data Platform
 
Is your data paying you dividends?
Is your data paying you dividends? Is your data paying you dividends?
Is your data paying you dividends?
 
Workshop: Make the Most of Customer Data Platforms - David Raab
Workshop: Make the Most of Customer Data Platforms - David RaabWorkshop: Make the Most of Customer Data Platforms - David Raab
Workshop: Make the Most of Customer Data Platforms - David Raab
 
Data Virtualization: Introduction and Business Value (UK)
Data Virtualization: Introduction and Business Value (UK)Data Virtualization: Introduction and Business Value (UK)
Data Virtualization: Introduction and Business Value (UK)
 
Anzo smart data integration dgiq 2014
Anzo smart data integration dgiq 2014Anzo smart data integration dgiq 2014
Anzo smart data integration dgiq 2014
 
ICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data Science
 
Arquitectura de Datos en Azure
Arquitectura de Datos en AzureArquitectura de Datos en Azure
Arquitectura de Datos en Azure
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph Strategy
 
Klarna Tech Talk - Mind the Data!
Klarna Tech Talk - Mind the Data!Klarna Tech Talk - Mind the Data!
Klarna Tech Talk - Mind the Data!
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
 
Virtual Sandbox for Data Scientists at Enterprise Scale
Virtual Sandbox for Data Scientists at Enterprise ScaleVirtual Sandbox for Data Scientists at Enterprise Scale
Virtual Sandbox for Data Scientists at Enterprise Scale
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
 
Customer Intelligence_ Harnessing Elephants at Transamerica Presentation (1)
Customer Intelligence_ Harnessing Elephants at Transamerica    Presentation (1)Customer Intelligence_ Harnessing Elephants at Transamerica    Presentation (1)
Customer Intelligence_ Harnessing Elephants at Transamerica Presentation (1)
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration with Common Data Framework (Model Robot)

  • 1. Advanced Machine Learning Data Integration with Common Data Framework (Model Robot) June 20, 2018 Presenters: Kevin Martelli (KPMG) Managing Director - Data and Analytics Balaji Wooputer (Freddie Mac) Director – Risk Analytics
  • 2. © Freddie Mac 2  Freddie Mac makes home possible for millions of families and individuals by providing mortgage capital to lenders.  Since our creation in 1970, we've made housing more accessible and affordable for homebuyers and renters in communities nationwide.  We are building a better housing finance system for homebuyers, renters, lenders, and taxpayers. Freddie Mac
  • 3. © Freddie Mac 3 Objective: Design, develop and implement a self-learning (AI), highly-flexible common data engineering framework to automate the design of the entire data munging process. Recap: Challenge & Objective Reusability Extensibility Reduce Development Cost Speed to Market Success = Automating the design for data munging and integration in a scalable way, while reducing the time to implement a data application by 50% to 70%, thereby allowing data scientists and business analysts to easily access data. https://www.youtube.com/watch?v=ct6gydYAQr42017 : Link Challenge: Quickly integrating multiple variations of vendor’s semi-structured and structured loan level data in order to make quicker and better business decisions to Re- Imagine the Mortgage Experience.
  • 4. © Freddie Mac 4  Data Enablement & Profiling Framework Common Data Engineering Framework (CDF) Data Enablement & Profiling Framework Data Sources Data Integration Integration and Execution Information Access User Engagement Data Preparation ERD Execution Information Access Analytics Tools Actian Matrix Hadoop Tools Web Services User Engagement Guided Reporting and Portfolio Analytics Customized Dashboard Consumption Governance Business glossary Data security and privacy rules Data flows lineage, transformation rules Data models and standards Data quality and data profiling standards Authentication & identity mgt. (Kerberos) Data, masking, encryption Availability, Backup, Recovery Authorization and audit (data level) Patch, upgrade, operations Job scheduling Security, Platform Mgmt Business Continuity Enterprise Loan Application Data Sources Metadata XSD Business Requirement Other Data Sources Third-Party Data Other Business Factors Shared Operational Data Operational Data Store Data Mart EDW Job Automation Vendor Provided ODI Loan Application XML JSON Raw Data Repository Data Exploration & Landing Data Dictionary Data Model Profiling Data sources structured, semi- structured and unstructured. - Create data plan and data dictionary (Automate) - Understand and identify data lineage - Define business requirements & rules - Data transformation, profiling & maturity - Develop data work flows and schedules - Transform data & connect database to backend platform - Define and manage model & program execution lifecycle - Real-time model & program execution & scoring - Model automated learning's - Performance, log management for models and programs. Execution Enablement Execution Logs - Design and create data visualization - Business Intelligence and reporting Feedback upstream systems - Insights realized - Connect Backend and Frontend systems - Tools for additional data processing & analytics - Generate additional insights from enabled data. AUTOMATION Part of the Common Engineer Data Framework
  • 5. © Freddie Mac 5 Data Discovery Transformation Rules Analytical Data Model CDF (2017) 5. Action 1. Ingest 2. Discover 4. Transform & Models 3. REQs. Domain Knowledge Data Munging Data Integration Data Model Analytics Modeling Data Visualization https://www.youtube.com/watch?v=ct6gydYAQr42017 HW Summit
  • 6. © Freddie Mac 6 Data Discovery Business Knowledge Domain and Data Science SME’s Customized Enablement Configuration Data Model CDF Data Model (2017)
  • 7. © Freddie Mac 7 CDF Model Development Life Cycle Data Processing Feature Engineering & Selection Model Development Model Selection Model Validation & Testing Model Deployment • Finalize production code • Develop the model documentation • Develop model monitoring plan • Develop the action plan • Define validation and testing methods • Error analysis • Identify business case calculation • Model refinement • Code development • Model training • Hyper-parameter tuning • Model taxonomy • Create comparison tables • Identify candidates • Select champion/ challenger • Exploratory data analysis • Variable transformation • Recursive feature elimination • Dimension reduction • Data parsing • Data cleaning EXTRACT INFER CLASSIFY Common Data Framework – Model Automation
  • 8. © Freddie Mac 8 Model Framework Architecture Data Platform Model Framework Model Development Model Warehouse Batch Process Streaming Insight Service as API Model Management Considerations: • Reusability and Model Build • Model Versions • Evaluation Framework (A/B) testing • Feature Extraction • Internal/External API • Message Broker
  • 9. © Freddie Mac 9 CDF Next Gen – Model Robot Idea Strategy Customer  Automatically discovers semantic layer from data discovery and generates data model through a machine-learning driven approach thereby significantly reducing need for manual data modeling.  Enhance old fashioned rigid, pre-modeled and pre-configured data models into flexible, adaptive data entities.  Accelerates data integration resulting in faster data insights to users Model Robot is “Fast, Flexible Autogenerated Data Model”
  • 10. © Freddie Mac 10 Data Modeling and Analytics Package • Spark MLlib (Naïve Bayes + Random Forest) • Scikit-learn • NumPy, SciPy • Pandas NLP Package • Gensim • NLTK Model Prediction Output (Data Model) HDP-HDFS User Input Model Configuration Vendor Input JSON Vendor Input XML Existing Data Model Training/Prediction Target Label For Training Data Profiling and Discovery Feature Engineer Model Training Model Ensemble Model Prediction Metric Persist Model Classification Engine Model Training 1. Based on Learning Source Descriptions for Data Integration 2. Leverage the model framework 3. Use Vendor data for training 4. 3 Core Features: Numerical, Text, Relationship Doan AH, Domingos P, Levy A (2000) Learning source descriptions for data integration. In: Proc WebDB Workshop, pp. 81–92 Model: Training data: 25k JSON/XML Files, 300+ attributes Testing data: 1k JSON/XML Files, 300+ attributes Training vs Test data split: 80% vs 20% Prediction Accuracy: 69% Mapped 22 Target Labels E.g. 6,000,000 numerical values and 9 000 text values.
  • 11. © Freddie Mac 11 CDF Next Gen (2018) – Model Robot Data Discovery SME Confirmation and Update Predicted Data Model Continuous Learning Feature Engineer Training Prediction Spark ML Business Knowledge
  • 12. © Freddie Mac 12 Feature Extraction Numeric-based features • Min, max, mean, median... Text-based features • TF_IDF • POS tagging • NER tagging Relationship-based features • Depth • Number of neighbors • Xpath Modeling Random Forest for numeric Naïve-Bayes for text features Cosine similarity for relationships Data Discovery Model Ensemble CDF Model Robot Design (High-level)
  • 13. © Freddie Mac 13 CDF Autogenerated Data Model Prediction Feature Extraction SME Review and Update Data Discovery New data for the same business area /path/to/a AvailableBalance [5400, 6000, 3000, 1500, …] /path/to/b AssetType [MNMT, …] Numeric: • Min, Max • Mean, STD • Percentiles • … Free-form text: • Tokenize • TF-IDF • … Path+Name: • Tokenize • TF-IDF • … Random Forest Naïve- Bayes Cosine- similarity PredictionPrediction Source Predicted AvailableBalance AssetAccountBalance AssetType AccountType
  • 14. © Freddie Mac 14 CDF - Outcomes Cost ReusabilityExtensible Collaboration Trust Rules Agility Automation Analytics Data Reconcile Data Data Business Rules New Customer Insights
  • 15. © Freddie Mac 15 CDF Business Value Delivered 2016 2017 2018 2016 Developed a generic data engineering framework. Reduced the “Data Munging” time by 50%. Reports available 3 to 4 months after going live. 2017 Reduced data engineering timeline by an additional 25%. Report generated the next business day. 2018 Develop a self-learning AI data model. Reduce data engineering and data model effort timeline. (Target 25%)
  • 16. © Freddie Mac 16 Thank You 2018 Hortonworks Summit Balaji Wooputur & Kevin Martelli Jun/20/2018
  • 17. © Freddie Mac 17 CDF Conceptual Architecture ElementTree Sqoop/JDBC DataProcessing (PySpark) Oozie/Ranger/Ambari ElementTree PartitionedORCTables Data Discovery Output Analytics Modeling Sqoop/ODI Data Profiling and Discovery Feature Engineer Model Training Model Ensemble Timeline: • Automated data ingestion • Leverage significant domain knowledge to enable (Manual) • Use learnings to decrease domain knowledge and accelerated Analytical Data Model (ADM) RAW AccelerateTimeToValue Data Profiling FlatFile RAW DataFrame (SparkSQL) Partition 1 … Partitio nN Data Modeling and Analytics Package • Spark MLlib (Naïve Bayes + Random Forest) • Scikit-learn • NumPy, SciPy • Pandas NLP Package • Gensim • NLTK Resilient Distributed Dataset Partition 1 … Partitio nN Data model
  • 18. © Freddie Mac 18  Training data: » 27467 JSON/XML Files – 338 attributes – 6,000,000 numerical values – 9,000 text values » 22 Target Labels » Training v.s. Validation data split: 80% v.s. 20%  Testing data: » 1150 JSON/XML Files  Overall Prediction Accuracy: » 69% CDF Autogenerated Data Model Training

Editor's Notes

  1. Good Afternoon everyone. How is everyone doing. Welcome to Freddie Mac and KPMG case study session . on Advanced ML Data Integration with Common Data Framework (Model Robot). Glad to be here to see great ideas and innovations sessions. This is 2nd year in a row for FreddieMac and KPMG to present at HW summit. My name is Balaji Wooputur working in Freddie Mac as Risk Analytics Director. I’m heading Risk Analytics team for SF Risk division. Freddie Mac has been partner for past couple of years with KPMG on HW solution. I’m here today with Kevin Martelli from KPMG’s. Myself and Kevin will co-present today’s session. Anyone in audience from last year CDF session? I’m going to cover following in today’s session. First, who we are, recap of 2017 project objective, 2017 “Patent pending” CDF, 2018 CDF (Model Robot) and CDF Model life cycle and Model framework, which is extended and reused components from our CDF. Let me start with who we are. Kevin is a managing director and heading KPMG’s Big Data Software Engineering Team in KPMG COE for D&A Light. “”-Kevin Greetings and Welcome message””
  2. How many of you had mortgage experience? Freddie Mac was created in year 1970 to expand the secondary market for mortgages in the US. Freddie Mac makes homeownership and rental housing financing more accessible and affordable. Operating in the secondary mortgage market, we keep mortgage capital flowing by purchasing mortgage loans from lenders so they in turn can provide more loans to qualified borrowers. Freddie Mac initiative of "reimagine the mortgage experience.“− ways we’re putting into action the feedback, insights, and opinions to get loan closing faster and save money Our mission to provide liquidity, stability, and affordability to the U.S. housing market in all economic conditions extends to all communities from coast to coast. Mortgage loan manufacturing consists of loan origination, loan closing and servicing loan after purchase. Freddie Mac pool loans then securitize and sell as MBS (Mortgage back securities) to global investors. Now, I’ll handover to Kevin.
  3. Kevin Biggest Challenges understanding and processing the datasets variety of vendors and not very standardized Time consuming to understand the datasets Many of the people in the audience have the same problems? 60% of time in cleaning, organizating and collecting data. (least enjoyable) Resolve challenge KPMG and Freddie have been working on a program over the last couple of years. First we focused on the foundation which automated processes but also allowed us to obtain data sets that could then be used for training the models to more fully automate the process Framework built on 4 core principles
  4. Kevin This is a busy slide and will not spend the time to review all aspects. The idea is to show a conceptual flow of the complexity of producing and consumig an insight. There is the standard data flow of identification of data sources, Ingestion…... And then there are all the supporting processes – quality, security lineage, etc. In a perfect world all these processes work together perfectly but we all know that is not the case. - Help compensate for deficinies found in other areas
  5. The CDF is down into three main components. We discussed these in detail during the DW summit last year. I want to provide a quick overview as it is important to understand the foundation before we discuss the intelligence processing that was added.   The initial framework had 3 main components that align to the model above - Data Discovery, Transformation or Business Rules, & Analytical Model.   In Data Discovery, the program would automatically ingest semi-structure data from the vendors (mainly JSON and XML) and produce insights into the data. It would provide, sample data values, min, max and mean of values, nulls, outliers, where it fell in the object definition, etc. The Data Discover output would allow Domain experts to better understand the data in order to make determinations on how to link the data to the target data model. Transformation rules are rules that business users can apply to the data. (i.e. Transformations (derive new attributes, standard data, data transformations, etc.) Once the data is discovered and transformational rules are applied data is fed into the analytical data model, which then automatically updates or creates new tables in Hive.   Although there are parts, which are automated this is still a human intensive process; hence, the need to add more intelligence to the framework.  
  6. This slide represents shows the overall flow of the CDF. We wanted to highlight the middle section where there was a lot of manual effort and time required for Domain experts and SME in order to produce a useable data model.   SME’s and Domain experts leverages the discovery output and would perform mapping based on their knowledge as well as the data discover output. Who generated the file and a lot of communicaton. For 1 or 2 data sources this is ok but as sources increase there is a lot of Time spend and the manual effort is Risk and Error prone As a result, the team wanted to further automate these process; hence the model robot. The idea of model robot was to automate these human intense processes (thru pass learnings), while still keeping the Human in the loop but to a lesser extent more for validation vs creating. In order to accelerate the time from ingestion to realized business value.   22 attributes
  7. As we started to add the intelligence to the framework we followed a specific defined model development framework: The lifecycle is broken up into 6 main stages. Data Processing put into a format that we can better understand the data Feature Selection – (Research Paper) An important part in the overall lifecycle of framework but what features do I want to leverage. The data science team were able to leverage some standard practices; such as dimension reduction variable transformation, etc. Model selection is important because selecting the wrong model can lead itself to waste time as you try to leverage and refine the model for accuracy. Having a larger team to leverage helps to identify models that have been successful in the past on similar data sets and problems. Model Test and Validation? Once everything is completed you need to deploy, which is not always easy on a Hadoop ecosystem. We will get more into that on the next slide.   At the bottom we have a small workflow of the model components. We built as components to enable reusability
  8. Model Management Native Support in Hadoop…. Once we have built the model we needed a manageable way to deploy and leverage within the Hadoop ecosystem. There is not a straight forward way to accomplish this task. In other analytical packages such as SAS they have applications to help manage. If you are using native Hadoop how to manage modes that you deploy? How do you track the version? How to do do A/B Testing? How do you execute the mode? How to you stream and run in batch?
  9. Balaji Data Insight (Identify noise, separate noise data, leverage business context of the data and dynamic modeling) Semantic Layer – Vendor meta data not standardized. Vendor datasets are semi-structured data (Key value pairs – JSON and Dynamic XML) Data Veracity – SPOT (Single Point of Truth) with data governance emerged. Data Management principles (Metadata standardization, Enterprise naming standards, data types etc.) Data Model – Semi-structed dataset to confirmed data model to meet organization Data Model standards has to be applied to MPP reporting platform
  10. We are living the “world of information” Decompose the information.. Identifying noise in the data.. Segregating actual meaningful data and noise in the data.. Bottom left - “Reference Doan AH” Bottom right – Walkthrough sample data with sizing etc.. Evaluating meaningful data with domain features of the data Predicting model with existing data model Bringing in Human in loop to verify/validate prediction model
  11. Balaji Training outputs provided to SME’s and Domain experts. Transforming raw data to features that better to present , Identifying factors that attributes useful for modeling Train data  Outputs predicted model (Human in the loop) for continuous Let me dive deep-into Model robot design and flow -Json, XML (Dynamic containers) , key value pairs -Data in semi structured and schema less formats (Run Data discovery) output metadata and profiled data (ranges, types etc.)
  12. Balaji – Feature extraction Numeric features min, max, mean, median… Text features TF-IDF, POS tagging, NER tagging… Attribute Names Relationship-based Xpath Depth Number of neighbors Etc
  13. Balaji – Prediction Model This is the moment you all are waiting for. CDF AI (Model Robot) here..
  14. TO summarize CDF outcomes Collabaration, trust, automation, analytics and data ready…. CDF is one stop for any type/format data ingestion with Data Discovery, Data Model and Data Engineering. We are proud today to say our risk analysis is equipped with “Intraday and Day 1 data insights” CDF framework core components reused and extend (Data Model) and reducing cost for Data Integration/Engineering Maturity model we are at 4 .. Fine tune ourselves to complete at 4 and go to 5
  15. Business value delivered Using the generic data engineering framework approach for our next product offering in 2016, we reduced “Data Munging” time by 50% using automation, enabling analysts to generate reports within a month of release. Our subsequent product launch in 2017 resulted in reducing the data engineering timeline by 25% allowing for reports to be generated and reviewed next business day. Share our 2017 success story of business outcome on addressing loan risk and actionable provide feedback to customers on loan origination process. Data insights on loan quality, Automated Collateral Evaluation (ACE)  Get to closing faster – no need for a traditional appraisal  Save money – no appraisal fee  Immediate certainty – automatically eligible for collateral rep and warranty relief
  16. Flat Files are loaded on share drive and manually up loaded to HDFS Nothing like XSD files. ss