SlideShare a Scribd company logo
Trumania, a realistic scenario-based data-generator
Svend Vanderveken
Leuven Data Science meetup - January 2018
2
Real Impact Analytics
• Data analytics solutions for telecommunication operators
• https://realimpactanalytics.com
• We’re hiring :)
Gautier Krings
• Co-founder of Jetpack.AI
• http://jetpack.ai
Svend Vanderveken
• Freelance Data Engineer
• @svend_x4f
• https://sv3nd.github.io
About us
With some awesome contributions from:
● Thoralf Gutierrez
● Milan van der Meer
● Floran Hachez
3
The problem
Data engineers and data scientists
need realistic test datasets
to validate the behaviour of data-processing applications
4
The problem
Why such datasets are hard to get by:
● using existing data is often not allowed
● we need a great diversity of datasets to validate many
situations
5
Existing solutions
6
Existing solutions
Schema-based approach
# Ted Dunning’s Log Synth
# https://github.com/tdunning/log-synth
[
{"name":"id", "class":"id"},
{"name":"name", "class":"name", "type":"first_last"},
{"name":"gender", "class":"string",
"dist":{"MALE":0.5, "FEMALE":0.5, "OTHER":0.02}},
{"name":"address", "class":"address"},
{"name":"visit", "class":"date", "format":"MM/dd/yyyy",
"start":"01/31/1995", "end":"02/07/1999"}
]
7
Existing solutions
Schema-based approach
● sufficient for many use cases
=> if you can, use that: it’s the simplest and the fastest
● caveat:
○ columns are often uncorrelated & dataset has no internal
structure
○ little/no use of empirical distributions
○ hard to manipulate in terms of cause and consequences
Existing solutions
8
Learning-based approaches
● fit a multivariate model to production data
● sample data from it
SDGen:
github.com/iostackproject/SDGen
Synthetic Data Vault:
dspace.mit.edu/handle/1721.1/109616
Existing solutions
9
Scenario/simulation based:
• Koen de Jonge Telcotraffic simulator
• cf MLGeek meetup of the 26th Oct 2016
• github.com/botkop/botkop-telcotraffic-simulator
Benchmark-based: TPC-DS
10
Trumania
realistic
scenario-based
python 3 library
facts & dimensions
Trumania circus
11
Population
Logs
Story
Population
Population
Story
Trumania population
12
• Typically static / dimensional data (can be dynamic too)
• Similar approach to schema-based
• Correlated fields if necessary
13
person = circus.create_population(name="person", size=10000,
ids_gen=SequencialGenerator(prefix="PERSON_"))
person.create_attribute(name="name",
init_gen=FakerGenerator(method="name")))
person.create_attribute(name="age",
init_gen=NumpyRandomGenerator(method="normal", loc=35, scale=5))
person.create_attribute(name="account_usage",
init_gen=NumpyRandomGenerator(method="exponential", scale=2))
Trumania generators
14
• Common interface for all random aspects of a Circus
• Essentially a thin wrapper around
• numpy
• faker
• empirical distribution
• ...bring your own distro
• Can be transformed and chained
15
beta_generator = NumpyRandomGenerator(method="beta", a=3, b=7)
age_generator = beta_gen.map(lambda s: (s * 60) + 10)
.map()
Trumania population: real data too
16
Handy to combine real and random data inside a circus
distributors = population.load_from("/data/real_distributors.csv")
Trumania relationships
17
• relations among populations
• shops per geographical zones,
• social networks,
• …
• dynamic or static
Trumania stories
18
• Executing a story produces the events
• Sequence of random or deterministic operations
• Made of:
• generators
• random traversal of weighted relationships
• population’s attribute lookups
• update of the Circus state
19
duration_gen = ...
# outputs a time series with:
# PERSON_ID, CALLER_NAME, DURATION, CALLEE_ID, CALLEE_NAME, TIME
call_story.set_operations(
person_population.ops.lookup(
actor_id_field="PERSON_ID",
select={"NAME": "CALLER_NAME"}),
duration_gen.ops.generate(named_as="DURATION"),
person_population.get_relationship("friends").ops.select_one(
from_field="PERSON_ID", named_as="CALLEE_ID"),
person_population.ops.lookup(
actor_id_field="CALLEE_ID",
select={"NAME": "CALLEE_NAME"}),
clock.ops.timestamp(named_as="TIME")
)
More Trumania
20
• … and time profiles
• … and a circus persistence mechanism
• … and circus state updates
• ...
Trumania caveats
21
Some possible improvements:
• performance: python, pandas
• more I/O options (it's all local CSV for now)
• it’s a young tool ;)
Trumania open source
22
The project is open source as of today !
Code and scenario examples: github.com/RealImpactAnalytics/trumania
Documentation: realimpactanalytics.github.io/trumania
Slack trumania.slack.com
Clone it, try it, let us know what you think!
Brussels Office
5, Place du Champ de Mars
1050 Brussels
Belgium
Cape Town Office
34 Somerset Road
8005, Green Point, Cape Town
South Africa
São Paulo Office
93, Rua Doutor Andrade Pertence
Vila Olímpia, São Paulo
Brazil
Luxembourg Office
2 - L 2314 , Place de Paris
Luxembourg
Grand-Duchy of Luxembourg
Follow us:
www.realimpactanalytics.com
Legal notices and disclaimer
24
All rights reserved. No part of this document may be reproduced, utilized, stored in a
retrieval system, or transmitted in any form or by any means without the prior written
permission of Real Impact Analytics.
The information, including any analyses, numbers, images, and pricing data
contained in this document are non-binding and for discussion purposes only. As
such, they are subject to adjustments and/or modifications at the sole discretion of
Real Impact Analytics.
Any agreement is subject to the signature of a definitive final contract between Real
Impact Analytics and the recipient and the acceptance by the Recipient of Real
Impact Analytics’ terms and conditions.

More Related Content

What's hot

Dual and multiple relationships in professional ethics
Dual and multiple relationships in professional ethicsDual and multiple relationships in professional ethics
Dual and multiple relationships in professional ethics
jerristephenson
 
Diminished Responsibility
Diminished ResponsibilityDiminished Responsibility
Diminished Responsibility
Miss Hart
 
Discrimination
DiscriminationDiscrimination
Discrimination
Chelsea Griffin
 
Forms of corruption
Forms of corruption Forms of corruption
Forms of corruption
Etica Lab
 
Cyber Crime and a Case Study
Cyber Crime and a Case StudyCyber Crime and a Case Study
Cyber Crime and a Case Study
Pratham Jaiswal
 
Delinquency dimensions of homelessness in ibadan metropolis oyo state nigeria
Delinquency dimensions of homelessness in ibadan metropolis oyo state nigeriaDelinquency dimensions of homelessness in ibadan metropolis oyo state nigeria
Delinquency dimensions of homelessness in ibadan metropolis oyo state nigeria
Nuhu Bamalli Polytechnic Zaria
 
Kaun banega crorepati 6 game
Kaun banega crorepati 6 gameKaun banega crorepati 6 game
Kaun banega crorepati 6 game
Nayi Ngo
 
factors of OB in movie Taare Zameen Par
factors of OB in movie Taare Zameen Parfactors of OB in movie Taare Zameen Par
factors of OB in movie Taare Zameen Par
Shubham Agrawal
 
Carol gilligan s moral development theory (psychology topic)
Carol gilligan s moral development theory (psychology topic)Carol gilligan s moral development theory (psychology topic)
Carol gilligan s moral development theory (psychology topic)
rehm dc
 
Information Technology Act
Information Technology ActInformation Technology Act
Information Technology Actmaruhope
 
Glass ceiling presentation
Glass ceiling presentationGlass ceiling presentation
Glass ceiling presentationguestc43e9e
 
Credit Risk Evaluation Model
Credit Risk Evaluation ModelCredit Risk Evaluation Model
Credit Risk Evaluation Model
Mihai Enescu
 
Linkedin Answers
Linkedin AnswersLinkedin Answers
Linkedin Answersguestb29f5
 
Cyber security and prevention in Bangladesh
Cyber security and prevention in BangladeshCyber security and prevention in Bangladesh
Cyber security and prevention in Bangladesh
Rabita Rejwana
 
taare zameen par movie's motto.
taare zameen par movie's motto.taare zameen par movie's motto.
taare zameen par movie's motto.
Ashish Yadav
 
Ncert india rivers
Ncert india riversNcert india rivers
Ncert india rivers
Venu Gopal Kallem
 
Discrimination in employment
Discrimination in employmentDiscrimination in employment
Discrimination in employment
Nazir Fahim
 
Cyber Law and Information Technology Act 2000 with case studies
Cyber Law and Information Technology Act 2000 with case studiesCyber Law and Information Technology Act 2000 with case studies
Cyber Law and Information Technology Act 2000 with case studies
Sneha J Chouhan
 

What's hot (20)

Dual and multiple relationships in professional ethics
Dual and multiple relationships in professional ethicsDual and multiple relationships in professional ethics
Dual and multiple relationships in professional ethics
 
Diminished Responsibility
Diminished ResponsibilityDiminished Responsibility
Diminished Responsibility
 
Discrimination
DiscriminationDiscrimination
Discrimination
 
Sack s sentence completion test report
Sack s sentence completion test reportSack s sentence completion test report
Sack s sentence completion test report
 
Law and psychology
Law and psychologyLaw and psychology
Law and psychology
 
Forms of corruption
Forms of corruption Forms of corruption
Forms of corruption
 
Cyber Crime and a Case Study
Cyber Crime and a Case StudyCyber Crime and a Case Study
Cyber Crime and a Case Study
 
Delinquency dimensions of homelessness in ibadan metropolis oyo state nigeria
Delinquency dimensions of homelessness in ibadan metropolis oyo state nigeriaDelinquency dimensions of homelessness in ibadan metropolis oyo state nigeria
Delinquency dimensions of homelessness in ibadan metropolis oyo state nigeria
 
Kaun banega crorepati 6 game
Kaun banega crorepati 6 gameKaun banega crorepati 6 game
Kaun banega crorepati 6 game
 
factors of OB in movie Taare Zameen Par
factors of OB in movie Taare Zameen Parfactors of OB in movie Taare Zameen Par
factors of OB in movie Taare Zameen Par
 
Carol gilligan s moral development theory (psychology topic)
Carol gilligan s moral development theory (psychology topic)Carol gilligan s moral development theory (psychology topic)
Carol gilligan s moral development theory (psychology topic)
 
Information Technology Act
Information Technology ActInformation Technology Act
Information Technology Act
 
Glass ceiling presentation
Glass ceiling presentationGlass ceiling presentation
Glass ceiling presentation
 
Credit Risk Evaluation Model
Credit Risk Evaluation ModelCredit Risk Evaluation Model
Credit Risk Evaluation Model
 
Linkedin Answers
Linkedin AnswersLinkedin Answers
Linkedin Answers
 
Cyber security and prevention in Bangladesh
Cyber security and prevention in BangladeshCyber security and prevention in Bangladesh
Cyber security and prevention in Bangladesh
 
taare zameen par movie's motto.
taare zameen par movie's motto.taare zameen par movie's motto.
taare zameen par movie's motto.
 
Ncert india rivers
Ncert india riversNcert india rivers
Ncert india rivers
 
Discrimination in employment
Discrimination in employmentDiscrimination in employment
Discrimination in employment
 
Cyber Law and Information Technology Act 2000 with case studies
Cyber Law and Information Technology Act 2000 with case studiesCyber Law and Information Technology Act 2000 with case studies
Cyber Law and Information Technology Act 2000 with case studies
 

Similar to Trumania , a realistic scenario-based data-generator

Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
Paul Groth
 
Autodiscovery or The long tail of open data
Autodiscovery or The long tail of open dataAutodiscovery or The long tail of open data
Autodiscovery or The long tail of open data
Connected Data World
 
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...
MITRE - ATT&CKcon
 
Predictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine IntelligencePredictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine Intelligence
Numenta
 
Sci computing using python
Sci computing using pythonSci computing using python
Sci computing using python
Ashok Govindarajan
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
Travis Oliphant
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
Mihai Criveti
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
CS, NcState
 
Threat Hunting with Elastic at SpectorOps: Welcome to HELK
Threat Hunting with Elastic at SpectorOps: Welcome to HELKThreat Hunting with Elastic at SpectorOps: Welcome to HELK
Threat Hunting with Elastic at SpectorOps: Welcome to HELK
Elasticsearch
 
Python ml
Python mlPython ml
Python ml
Shubham Sharma
 
Data Democratization at Nubank
 Data Democratization at Nubank Data Democratization at Nubank
Data Democratization at Nubank
Databricks
 
How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...
Alluxio, Inc.
 
Big&open data challenges for smartcity-PIC2014 Shanghai
Big&open data challenges for smartcity-PIC2014 ShanghaiBig&open data challenges for smartcity-PIC2014 Shanghai
Big&open data challenges for smartcity-PIC2014 Shanghai
Victoria López
 
Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream Processing
Guido Schmutz
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
Guido Schmutz
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
Jan Wiegelmann
 
Automating to Augment Testing
Automating to Augment TestingAutomating to Augment Testing
Automating to Augment Testing
Alan Richardson
 

Similar to Trumania , a realistic scenario-based data-generator (20)

Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
Autodiscovery or The long tail of open data
Autodiscovery or The long tail of open dataAutodiscovery or The long tail of open data
Autodiscovery or The long tail of open data
 
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...
 
Predictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine IntelligencePredictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine Intelligence
 
Sci computing using python
Sci computing using pythonSci computing using python
Sci computing using python
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
Threat Hunting with Elastic at SpectorOps: Welcome to HELK
Threat Hunting with Elastic at SpectorOps: Welcome to HELKThreat Hunting with Elastic at SpectorOps: Welcome to HELK
Threat Hunting with Elastic at SpectorOps: Welcome to HELK
 
Python ml
Python mlPython ml
Python ml
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
Data Democratization at Nubank
 Data Democratization at Nubank Data Democratization at Nubank
Data Democratization at Nubank
 
How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...
 
Big&open data challenges for smartcity-PIC2014 Shanghai
Big&open data challenges for smartcity-PIC2014 ShanghaiBig&open data challenges for smartcity-PIC2014 Shanghai
Big&open data challenges for smartcity-PIC2014 Shanghai
 
Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream Processing
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
 
Automating to Augment Testing
Automating to Augment TestingAutomating to Augment Testing
Automating to Augment Testing
 

More from Data Science Leuven

Distributed Deep Learning Using Java on the Client and in the Cloud
Distributed Deep Learning Using Java on the Client and in the CloudDistributed Deep Learning Using Java on the Client and in the Cloud
Distributed Deep Learning Using Java on the Client and in the Cloud
Data Science Leuven
 
Statbel and big data
Statbel and big dataStatbel and big data
Statbel and big data
Data Science Leuven
 
Learning from positive and unlabeled data
Learning from positive and unlabeled dataLearning from positive and unlabeled data
Learning from positive and unlabeled data
Data Science Leuven
 
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersLighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
Data Science Leuven
 
Recommender systems for job search - Michael Reusens
Recommender systems for job search - Michael ReusensRecommender systems for job search - Michael Reusens
Recommender systems for job search - Michael Reusens
Data Science Leuven
 
VITO WatchItGrow - Jeroen Dries
VITO WatchItGrow - Jeroen DriesVITO WatchItGrow - Jeroen Dries
VITO WatchItGrow - Jeroen Dries
Data Science Leuven
 
How to build a search engine in 2 days
How to build a search engine in 2 daysHow to build a search engine in 2 days
How to build a search engine in 2 days
Data Science Leuven
 
Uplift models
Uplift modelsUplift models
Uplift models
Data Science Leuven
 
Value from health data
Value from health dataValue from health data
Value from health data
Data Science Leuven
 
Computing power and algorithms? In people we trust
Computing power and algorithms? In people we trustComputing power and algorithms? In people we trust
Computing power and algorithms? In people we trust
Data Science Leuven
 
Recommender systems, optimizing least squares or user experience
Recommender systems, optimizing least squares or user experienceRecommender systems, optimizing least squares or user experience
Recommender systems, optimizing least squares or user experience
Data Science Leuven
 
Replicability and questionable research practices
Replicability and questionable research practicesReplicability and questionable research practices
Replicability and questionable research practices
Data Science Leuven
 
Predicting Eurosong with Google Predicting Eurosong with Google and data visu...
Predicting Eurosong with Google Predicting Eurosong with Google and data visu...Predicting Eurosong with Google Predicting Eurosong with Google and data visu...
Predicting Eurosong with Google Predicting Eurosong with Google and data visu...
Data Science Leuven
 
Storytelling for impactful predictive models - Gert De Geyter
Storytelling for impactful predictive models - Gert De GeyterStorytelling for impactful predictive models - Gert De Geyter
Storytelling for impactful predictive models - Gert De Geyter
Data Science Leuven
 
Lessons from driving analytics projects
Lessons from driving analytics projectsLessons from driving analytics projects
Lessons from driving analytics projects
Data Science Leuven
 
Geospatial visual analytics
Geospatial visual analyticsGeospatial visual analytics
Geospatial visual analytics
Data Science Leuven
 
Open-Source Data Science Crossing The Chasm
Open-Source Data Science Crossing The ChasmOpen-Source Data Science Crossing The Chasm
Open-Source Data Science Crossing The Chasm
Data Science Leuven
 
Probabilistic machine learning for optimization and solving complex
Probabilistic machine learning for optimization and solving complexProbabilistic machine learning for optimization and solving complex
Probabilistic machine learning for optimization and solving complex
Data Science Leuven
 
Closing
ClosingClosing
Welcome
WelcomeWelcome

More from Data Science Leuven (20)

Distributed Deep Learning Using Java on the Client and in the Cloud
Distributed Deep Learning Using Java on the Client and in the CloudDistributed Deep Learning Using Java on the Client and in the Cloud
Distributed Deep Learning Using Java on the Client and in the Cloud
 
Statbel and big data
Statbel and big dataStatbel and big data
Statbel and big data
 
Learning from positive and unlabeled data
Learning from positive and unlabeled dataLearning from positive and unlabeled data
Learning from positive and unlabeled data
 
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersLighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
 
Recommender systems for job search - Michael Reusens
Recommender systems for job search - Michael ReusensRecommender systems for job search - Michael Reusens
Recommender systems for job search - Michael Reusens
 
VITO WatchItGrow - Jeroen Dries
VITO WatchItGrow - Jeroen DriesVITO WatchItGrow - Jeroen Dries
VITO WatchItGrow - Jeroen Dries
 
How to build a search engine in 2 days
How to build a search engine in 2 daysHow to build a search engine in 2 days
How to build a search engine in 2 days
 
Uplift models
Uplift modelsUplift models
Uplift models
 
Value from health data
Value from health dataValue from health data
Value from health data
 
Computing power and algorithms? In people we trust
Computing power and algorithms? In people we trustComputing power and algorithms? In people we trust
Computing power and algorithms? In people we trust
 
Recommender systems, optimizing least squares or user experience
Recommender systems, optimizing least squares or user experienceRecommender systems, optimizing least squares or user experience
Recommender systems, optimizing least squares or user experience
 
Replicability and questionable research practices
Replicability and questionable research practicesReplicability and questionable research practices
Replicability and questionable research practices
 
Predicting Eurosong with Google Predicting Eurosong with Google and data visu...
Predicting Eurosong with Google Predicting Eurosong with Google and data visu...Predicting Eurosong with Google Predicting Eurosong with Google and data visu...
Predicting Eurosong with Google Predicting Eurosong with Google and data visu...
 
Storytelling for impactful predictive models - Gert De Geyter
Storytelling for impactful predictive models - Gert De GeyterStorytelling for impactful predictive models - Gert De Geyter
Storytelling for impactful predictive models - Gert De Geyter
 
Lessons from driving analytics projects
Lessons from driving analytics projectsLessons from driving analytics projects
Lessons from driving analytics projects
 
Geospatial visual analytics
Geospatial visual analyticsGeospatial visual analytics
Geospatial visual analytics
 
Open-Source Data Science Crossing The Chasm
Open-Source Data Science Crossing The ChasmOpen-Source Data Science Crossing The Chasm
Open-Source Data Science Crossing The Chasm
 
Probabilistic machine learning for optimization and solving complex
Probabilistic machine learning for optimization and solving complexProbabilistic machine learning for optimization and solving complex
Probabilistic machine learning for optimization and solving complex
 
Closing
ClosingClosing
Closing
 
Welcome
WelcomeWelcome
Welcome
 

Recently uploaded

tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 

Recently uploaded (20)

tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 

Trumania , a realistic scenario-based data-generator

  • 1. Trumania, a realistic scenario-based data-generator Svend Vanderveken Leuven Data Science meetup - January 2018
  • 2. 2 Real Impact Analytics • Data analytics solutions for telecommunication operators • https://realimpactanalytics.com • We’re hiring :) Gautier Krings • Co-founder of Jetpack.AI • http://jetpack.ai Svend Vanderveken • Freelance Data Engineer • @svend_x4f • https://sv3nd.github.io About us With some awesome contributions from: ● Thoralf Gutierrez ● Milan van der Meer ● Floran Hachez
  • 3. 3 The problem Data engineers and data scientists need realistic test datasets to validate the behaviour of data-processing applications
  • 4. 4 The problem Why such datasets are hard to get by: ● using existing data is often not allowed ● we need a great diversity of datasets to validate many situations
  • 6. 6 Existing solutions Schema-based approach # Ted Dunning’s Log Synth # https://github.com/tdunning/log-synth [ {"name":"id", "class":"id"}, {"name":"name", "class":"name", "type":"first_last"}, {"name":"gender", "class":"string", "dist":{"MALE":0.5, "FEMALE":0.5, "OTHER":0.02}}, {"name":"address", "class":"address"}, {"name":"visit", "class":"date", "format":"MM/dd/yyyy", "start":"01/31/1995", "end":"02/07/1999"} ]
  • 7. 7 Existing solutions Schema-based approach ● sufficient for many use cases => if you can, use that: it’s the simplest and the fastest ● caveat: ○ columns are often uncorrelated & dataset has no internal structure ○ little/no use of empirical distributions ○ hard to manipulate in terms of cause and consequences
  • 8. Existing solutions 8 Learning-based approaches ● fit a multivariate model to production data ● sample data from it SDGen: github.com/iostackproject/SDGen Synthetic Data Vault: dspace.mit.edu/handle/1721.1/109616
  • 9. Existing solutions 9 Scenario/simulation based: • Koen de Jonge Telcotraffic simulator • cf MLGeek meetup of the 26th Oct 2016 • github.com/botkop/botkop-telcotraffic-simulator Benchmark-based: TPC-DS
  • 12. Trumania population 12 • Typically static / dimensional data (can be dynamic too) • Similar approach to schema-based • Correlated fields if necessary
  • 13. 13 person = circus.create_population(name="person", size=10000, ids_gen=SequencialGenerator(prefix="PERSON_")) person.create_attribute(name="name", init_gen=FakerGenerator(method="name"))) person.create_attribute(name="age", init_gen=NumpyRandomGenerator(method="normal", loc=35, scale=5)) person.create_attribute(name="account_usage", init_gen=NumpyRandomGenerator(method="exponential", scale=2))
  • 14. Trumania generators 14 • Common interface for all random aspects of a Circus • Essentially a thin wrapper around • numpy • faker • empirical distribution • ...bring your own distro • Can be transformed and chained
  • 15. 15 beta_generator = NumpyRandomGenerator(method="beta", a=3, b=7) age_generator = beta_gen.map(lambda s: (s * 60) + 10) .map()
  • 16. Trumania population: real data too 16 Handy to combine real and random data inside a circus distributors = population.load_from("/data/real_distributors.csv")
  • 17. Trumania relationships 17 • relations among populations • shops per geographical zones, • social networks, • … • dynamic or static
  • 18. Trumania stories 18 • Executing a story produces the events • Sequence of random or deterministic operations • Made of: • generators • random traversal of weighted relationships • population’s attribute lookups • update of the Circus state
  • 19. 19 duration_gen = ... # outputs a time series with: # PERSON_ID, CALLER_NAME, DURATION, CALLEE_ID, CALLEE_NAME, TIME call_story.set_operations( person_population.ops.lookup( actor_id_field="PERSON_ID", select={"NAME": "CALLER_NAME"}), duration_gen.ops.generate(named_as="DURATION"), person_population.get_relationship("friends").ops.select_one( from_field="PERSON_ID", named_as="CALLEE_ID"), person_population.ops.lookup( actor_id_field="CALLEE_ID", select={"NAME": "CALLEE_NAME"}), clock.ops.timestamp(named_as="TIME") )
  • 20. More Trumania 20 • … and time profiles • … and a circus persistence mechanism • … and circus state updates • ...
  • 21. Trumania caveats 21 Some possible improvements: • performance: python, pandas • more I/O options (it's all local CSV for now) • it’s a young tool ;)
  • 22. Trumania open source 22 The project is open source as of today ! Code and scenario examples: github.com/RealImpactAnalytics/trumania Documentation: realimpactanalytics.github.io/trumania Slack trumania.slack.com Clone it, try it, let us know what you think!
  • 23. Brussels Office 5, Place du Champ de Mars 1050 Brussels Belgium Cape Town Office 34 Somerset Road 8005, Green Point, Cape Town South Africa São Paulo Office 93, Rua Doutor Andrade Pertence Vila Olímpia, São Paulo Brazil Luxembourg Office 2 - L 2314 , Place de Paris Luxembourg Grand-Duchy of Luxembourg Follow us: www.realimpactanalytics.com
  • 24. Legal notices and disclaimer 24 All rights reserved. No part of this document may be reproduced, utilized, stored in a retrieval system, or transmitted in any form or by any means without the prior written permission of Real Impact Analytics. The information, including any analyses, numbers, images, and pricing data contained in this document are non-binding and for discussion purposes only. As such, they are subject to adjustments and/or modifications at the sole discretion of Real Impact Analytics. Any agreement is subject to the signature of a definitive final contract between Real Impact Analytics and the recipient and the acceptance by the Recipient of Real Impact Analytics’ terms and conditions.