Freddie Mac and KPMG will share an innovative solution to accelerate data model (ERM) development and data integration on a highly-distributed, in-memory computing platform. The machine learning component (PySpark) of the framework executes against evolving semi-structured and structured data sets to learn and automate data mapping from various sources to a targeted schema. As a result, it significantly reduces the manual analysis, design and development effort, as well as establishes faster data integration across a variety of complex and high-volume datasets.
The solution will leverage various components of the Hadoop data platform. It will use Sqoop to import the data into the platform. PySpark will be leveraged in order to process the data. In addition, the application will also have a developed PySpark ML model that will run as a continuous job in Spark to process the ingested semi-structured data and intelligently map into the proper Hive tables. This will all be scheduled thru the use of Oozie.
Speakers
Kevin Martelli, KPMG, Managing Director
Balaji Wooputur, Freddie Mac, Risk Analyst Director
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration with Common Data Framework (Model Robot)
1. Advanced Machine Learning Data
Integration with Common Data
Framework (Model Robot)
June 20, 2018
Presenters:
Kevin Martelli (KPMG)
Managing Director - Data and Analytics
Balaji Wooputer (Freddie Mac)
Director – Risk Analytics
Good Afternoon everyone. How is everyone doing.
Welcome to Freddie Mac and KPMG case study session . on Advanced ML Data Integration with Common Data Framework (Model Robot).
Glad to be here to see great ideas and innovations sessions.
This is 2nd year in a row for FreddieMac and KPMG to present at HW summit.
My name is Balaji Wooputur working in Freddie Mac as Risk Analytics Director. I’m heading Risk Analytics team for SF Risk division. Freddie Mac has been partner for past couple of years with KPMG on HW solution.
I’m here today with Kevin Martelli from KPMG’s. Myself and Kevin will co-present today’s session.
Anyone in audience from last year CDF session?
I’m going to cover following in today’s session. First, who we are, recap of 2017 project objective, 2017 “Patent pending” CDF, 2018 CDF (Model Robot) and CDF Model life cycle and Model framework, which is extended and reused components from our CDF.
Let me start with who we are.
Kevin is a managing director and heading KPMG’s Big Data Software Engineering Team in KPMG COE for D&A Light. “”-Kevin Greetings and Welcome message””
How many of you had mortgage experience?
Freddie Mac was created in year 1970 to expand the secondary market for mortgages in the US. Freddie Mac makes homeownership and rental housing financing more accessible and affordable.
Operating in the secondary mortgage market, we keep mortgage capital flowing by purchasing mortgage loans from lenders so they in turn can provide more loans to qualified borrowers.
Freddie Mac initiative of "reimagine the mortgage experience.“− ways we’re putting into action the feedback, insights, and opinions to get loan closing faster and save money
Our mission to provide liquidity, stability, and affordability to the U.S. housing market in all economic conditions extends to all communities from coast to coast.
Mortgage loan manufacturing consists of loan origination, loan closing and servicing loan after purchase.
Freddie Mac pool loans then securitize and sell as MBS (Mortgage back securities) to global investors.
Now, I’ll handover to Kevin.
Kevin
Biggest Challenges understanding and processing the datasets variety of vendors and not very standardized
Time consuming to understand the datasets
Many of the people in the audience have the same problems?
60% of time in cleaning, organizating and collecting data. (least enjoyable)
Resolve challenge KPMG and Freddie have been working on a program over the last couple of years. First we focused on the foundation which automated processes but also allowed us to obtain data sets that could then be used for training the models to more fully automate the process
Framework built on 4 core principles
Kevin
This is a busy slide and will not spend the time to review all aspects.
The idea is to show a conceptual flow of the complexity of producing and consumig an insight.
There is the standard data flow of identification of data sources, Ingestion…...
And then there are all the supporting processes – quality, security lineage, etc.
In a perfect world all these processes work together perfectly but we all know that is not the case.
- Help compensate for deficinies found in other areas
The CDF is down into three main components. We discussed these in detail during the DW summit last year. I want to provide a quick overview as it is important to understand the foundation before we discuss the intelligence processing that was added.
The initial framework had 3 main components that align to the model above - Data Discovery, Transformation or Business Rules, & Analytical Model.
In Data Discovery, the program would automatically ingest semi-structure data from the vendors (mainly JSON and XML) and produce insights into the data. It would provide, sample data values, min, max and mean of values, nulls, outliers, where it fell in the object definition, etc. The Data Discover output would allow Domain experts to better understand the data in order to make determinations on how to link the data to the target data model.
Transformation rules are rules that business users can apply to the data. (i.e. Transformations (derive new attributes, standard data, data transformations, etc.)
Once the data is discovered and transformational rules are applied data is fed into the analytical data model, which then automatically updates or creates new tables in Hive.
Although there are parts, which are automated this is still a human intensive process; hence, the need to add more intelligence to the framework.
This slide represents shows the overall flow of the CDF. We wanted to highlight the middle section where there was a lot of manual effort and time required for Domain experts and SME in order to produce a useable data model.
SME’s and Domain experts leverages the discovery output and would perform mapping based on their knowledge as well as the data discover output. Who generated the file and a lot of communicaton.
For 1 or 2 data sources this is ok but as sources increase there is a lot of Time spend and the manual effort is Risk and Error prone
As a result, the team wanted to further automate these process; hence the model robot. The idea of model robot was to automate these human intense processes (thru pass learnings), while still keeping the Human in the loop but to a lesser extent more for validation vs creating. In order to accelerate the time from ingestion to realized business value.
22 attributes
As we started to add the intelligence to the framework we followed a specific defined model development framework:
The lifecycle is broken up into 6 main stages.
Data Processing put into a format that we can better understand the data
Feature Selection – (Research Paper) An important part in the overall lifecycle of framework but what features do I want to leverage. The data science team were able to leverage some standard practices; such as dimension reduction variable transformation, etc.
Model selection is important because selecting the wrong model can lead itself to waste time as you try to leverage and refine the model for accuracy. Having a larger team to leverage helps to identify models that have been successful in the past on similar data sets and problems.
Model Test and Validation?
Once everything is completed you need to deploy, which is not always easy on a Hadoop ecosystem. We will get more into that on the next slide.
At the bottom we have a small workflow of the model components. We built as components to enable reusability
Model Management Native Support in Hadoop….
Once we have built the model we needed a manageable way to deploy and leverage within the Hadoop ecosystem. There is not a straight forward way to accomplish this task. In other analytical packages such as SAS they have applications to help manage.
If you are using native Hadoop how to manage modes that you deploy?
How do you track the version?
How to do do A/B Testing?
How do you execute the mode?
How to you stream and run in batch?
Balaji
Data Insight (Identify noise, separate noise data, leverage business context of the data and dynamic modeling)
Semantic Layer – Vendor meta data not standardized. Vendor datasets are semi-structured data (Key value pairs – JSON and Dynamic XML)
Data Veracity – SPOT (Single Point of Truth) with data governance emerged. Data Management principles (Metadata standardization, Enterprise naming standards, data types etc.)
Data Model – Semi-structed dataset to confirmed data model to meet organization Data Model standards has to be applied to MPP reporting platform
We are living the “world of information”
Decompose the information.. Identifying noise in the data.. Segregating actual meaningful data and noise in the data..
Bottom left - “Reference Doan AH”
Bottom right – Walkthrough sample data with sizing etc..
Evaluating meaningful data with domain features of the data
Predicting model with existing data model
Bringing in Human in loop to verify/validate prediction model
Balaji Training outputs provided to SME’s and Domain experts.
Transforming raw data to features that better to present ,
Identifying factors that attributes useful for modeling
Train data Outputs predicted model (Human in the loop) for continuous
Let me dive deep-into Model robot design and flow
-Json, XML (Dynamic containers) , key value pairs
-Data in semi structured and schema less formats (Run Data discovery) output metadata and profiled data (ranges, types etc.)
Balaji – Feature extraction
Numeric features
min, max, mean, median…
Text features
TF-IDF, POS tagging, NER tagging…
Attribute Names
Relationship-based
Xpath
Depth
Number of neighbors
Etc
Balaji – Prediction Model
This is the moment you all are waiting for. CDF AI (Model Robot) here..
TO summarize CDF outcomes
Collabaration, trust, automation, analytics and data ready….
CDF is one stop for any type/format data ingestion with Data Discovery, Data Model and Data Engineering.
We are proud today to say our risk analysis is equipped with “Intraday and Day 1 data insights”
CDF framework core components reused and extend (Data Model) and reducing cost for Data Integration/Engineering
Maturity model we are at 4 .. Fine tune ourselves to complete at 4 and go to 5
Business value delivered
Using the generic data engineering framework approach for our next product offering in 2016, we reduced “Data Munging” time by 50% using automation, enabling analysts to generate reports within a month of release.
Our subsequent product launch in 2017 resulted in reducing the data engineering timeline by 25% allowing for reports to be generated and reviewed next business day.
Share our 2017 success story of business outcome on addressing loan risk and actionable provide feedback to customers on loan origination process. Data insights on loan quality,
Automated Collateral Evaluation (ACE)
Get to closing faster – no need for a traditional appraisal
Save money – no appraisal fee
Immediate certainty – automatically eligible for collateral rep and warranty relief
Flat Files are loaded on share drive and manually up loaded to HDFS
Nothing like XSD files.
ss