As machine learning becomes more pervasive in the industry, data scientists and quants are realizing the challenges and limitations of machine learning models. One of the primary reasons machine learning applications fail is due to the lack of rich, diverse and clean datasets needed to build models. Datasets may have missing values, may not incorporate enough samples for all use cases (for example: availability of fraudulent transaction records to train a model) and may not be easily sharable due to privacy concerns. While there are many data cleansing techniques to fix data-related issues and we can always try and get new and rich datasets, the cost is at times prohibitive and at times impractical leading many institutions to abandon machine learning and go back to rule-based methods.
Synthetic data sets and simulations are used to enrich and augment existing datasets to provide comprehensive samples while training machine learning problems. In addition, synthetic datasets can be used for comprehensive scenario analysis, missing value filling and privacy protection of the datasets when building models. The advent of novel techniques like Deep Learning has rekindled interest in using techniques like GANs and Encoder-Decoder architectures in financial synthetic data generation.
In this workshop, we will discuss the state of the art in Synthetic data generation and will illustrate the various techniques and methods that can be used in practice. Through examples using QuSynthesize & QuSandbox, we will demonstrate how these techniques can be realized in practice.
1. Synthetic Data Generation for Machine Learning
2020 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
Sri.Krishnamurthy@qusandbox.com
www.quantuniversity.com
03/05/2020
Boston, MA
2. 2
Speaker bio
• Quant, Data Science & ML practitioner
• Prior Experience at MathWorks, Citigroup
and Endeca and 25+ financial services and
energy customers.
• Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Teaches Data Science/AI at Northeastern
University, Boston
• Reviewer: Journal of Asset Management
Sri Krishnamurthy
Founder and CEO
QuantUniversity
3. 3
About QuantUniversity
• Boston-based Data Science, Quant
Finance and Machine Learning
training and consulting advisory
• Trained more than 1000 students in
Quantitative methods, Data Science,
ML and Big Data Technologies
• Building a platform for
operationalizing AI and Machine
Learning in the Enterprise
4. 4
1. Challenges with Real Datasets
2. Synthetic Dataset generation tools
▫ Proprietary
▫ Open Source
– Faker
– Data Synthesizer
– SDV
– Synthpop
– GANs
3. Demos
▫ Data Synthesizer
▫ Sales Data Generator
▫ VIX Data Generator
Agenda
7. 7
• It may not be feasible to get samples for all
categories
• Lighting conditions
• Modifications (Glasses/No glasses,
Moustache/ No Moustache etc.)
• Positions
Coverage
Challenges with real datasets
8. 8
All scenarios haven’t
played out
• Stress scenarios
• What-if scenarios
Challenges with real datasets
Figure ref: http://www.actuaries.org/CTTEES_SOLV/Documents/StressTestingPaper.pdf
9. 9
Missing values
• Missing at random
• Missing sequences
• Need data to fill frames
Challenges with real datasets
10. 10
• Access
▫ Hard to find
▫ Rare class problems
▫ Privacy concerns
making it difficult to
share
Challenges with real datasets
11. 11
Imbalanced
• Need more samples of rare
class
• Need proxies for data points
that were not observed or
recorded
Challenges with real datasets
14. 14
Proprietary Tools
Company Core Technology
Tonic.ai
All-in-one platform for data anonymization, subsetting, and synthesis
integrated with databases (hadoop, oracle, mysql, MS sql server,
mongo db, amazon aurora/redshift, and google big query)
- Uses Condenser and Masquerade
Mostly.ai
Tablular data using generative deep neural networks (no image data)
CVEDIA
- Sensor modeling and algorithm training
- Handle image using SynCity as a custom pocket laboratory to
generate highly entropic scenes, conditions, and metadata. Enable
real-time Hardware-In-the-Loop (HWIL), Human-In-the-Loop (HITL) or
Software-In-the-Loop (SIL) simulations even with complex sensor
configurations
Deep vision data image creation
synthetic training data
Synthesis.ai The data generation platform for computer vision
27. 27
If you want to be a part of QuSandbox private Beta
Contact us:
info@qusandbox
28. 28
1. Model Governance in the Age of Data Science and AI
▫ GFMI Course, March 9th, 10th, New York, NY
2. Synthetic VIX data generation using deep learning techniques
▫ QWAFAFEW meeting - March 17th, 2020, Boston MA
3. Using synthetic data for ML in Finance
▫ 2nd Annual Machine Learning in Quantitative Finance – April 1st, 2020, New York, NY
4. Tackling the biggest limitations of ML
▫ 2nd Annual Machine Learning in Quantitative Finance – April 1st, 2020, New York, NY
5. Foundations of Machine learning and AI for Financial Professionals
▫ 8-week Online course offered in partnership with PRMIA – May 12th – June 30th, 2020, Online
6. A Master Class on AI and Machine Learning for Financial Professionals
▫ Invited session at the 73rd CFA Annual Conference – May 17th, 2020, Atlanta, GA
Upcoming events by QuantUniversity
29. Sri Krishnamurthy, CFA, CAP
Founder and Chief Data Scientist
sri@quantuniversity.com
srikrishnamurthy
www.QuantUniversity.com
www.analyticscertificate.com
www.qusandbox.com
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
29