Trumania , a realistic scenario-based data-generator

Trumania, a realistic scenario-based data-generator
Svend Vanderveken
Leuven Data Science meetup - January 2018

2
Real Impact Analytics
• Data analytics solutions for telecommunication operators
• https://realimpactanalytics.com
• We’re hiring :)
Gautier Krings
• Co-founder of Jetpack.AI
• http://jetpack.ai
Svend Vanderveken
• Freelance Data Engineer
• @svend_x4f
• https://sv3nd.github.io
About us
With some awesome contributions from:
● Thoralf Gutierrez
● Milan van der Meer
● Floran Hachez

3
The problem
Data engineers and data scientists
need realistic test datasets
to validate the behaviour of data-processing applications

4
The problem
Why such datasets are hard to get by:
● using existing data is often not allowed
● we need a great diversity of datasets to validate many
situations

6
Existing solutions
Schema-based approach
# Ted Dunning’s Log Synth
# https://github.com/tdunning/log-synth
[
{"name":"id", "class":"id"},
{"name":"name", "class":"name", "type":"first_last"},
{"name":"gender", "class":"string",
"dist":{"MALE":0.5, "FEMALE":0.5, "OTHER":0.02}},
{"name":"address", "class":"address"},
{"name":"visit", "class":"date", "format":"MM/dd/yyyy",
"start":"01/31/1995", "end":"02/07/1999"}
]

7
Existing solutions
Schema-based approach
● sufficient for many use cases
=> if you can, use that: it’s the simplest and the fastest
● caveat:
○ columns are often uncorrelated & dataset has no internal
structure
○ little/no use of empirical distributions
○ hard to manipulate in terms of cause and consequences

Existing solutions
8
Learning-based approaches
● fit a multivariate model to production data
● sample data from it
SDGen:
github.com/iostackproject/SDGen
Synthetic Data Vault:
dspace.mit.edu/handle/1721.1/109616

Existing solutions
9
Scenario/simulation based:
• Koen de Jonge Telcotraffic simulator
• cf MLGeek meetup of the 26th Oct 2016
• github.com/botkop/botkop-telcotraffic-simulator
Benchmark-based: TPC-DS

10
Trumania
realistic
scenario-based
python 3 library
facts & dimensions

Trumania circus
11
Population
Logs
Story
Population
Population
Story

Trumania population
12
• Typically static / dimensional data (can be dynamic too)
• Similar approach to schema-based
• Correlated fields if necessary

13
person = circus.create_population(name="person", size=10000,
ids_gen=SequencialGenerator(prefix="PERSON_"))
person.create_attribute(name="name",
init_gen=FakerGenerator(method="name")))
person.create_attribute(name="age",
init_gen=NumpyRandomGenerator(method="normal", loc=35, scale=5))
person.create_attribute(name="account_usage",
init_gen=NumpyRandomGenerator(method="exponential", scale=2))

Trumania generators
14
• Common interface for all random aspects of a Circus
• Essentially a thin wrapper around
• numpy
• faker
• empirical distribution
• ...bring your own distro
• Can be transformed and chained

15
beta_generator = NumpyRandomGenerator(method="beta", a=3, b=7)
age_generator = beta_gen.map(lambda s: (s * 60) + 10)
.map()

Trumania population: real data too
16
Handy to combine real and random data inside a circus
distributors = population.load_from("/data/real_distributors.csv")

Trumania relationships
17
• relations among populations
• shops per geographical zones,
• social networks,
• …
• dynamic or static

Trumania stories
18
• Executing a story produces the events
• Sequence of random or deterministic operations
• Made of:
• generators
• random traversal of weighted relationships
• population’s attribute lookups
• update of the Circus state

19
duration_gen = ...
# outputs a time series with:
# PERSON_ID, CALLER_NAME, DURATION, CALLEE_ID, CALLEE_NAME, TIME
call_story.set_operations(
person_population.ops.lookup(
actor_id_field="PERSON_ID",
select={"NAME": "CALLER_NAME"}),
duration_gen.ops.generate(named_as="DURATION"),
person_population.get_relationship("friends").ops.select_one(
from_field="PERSON_ID", named_as="CALLEE_ID"),
person_population.ops.lookup(
actor_id_field="CALLEE_ID",
select={"NAME": "CALLEE_NAME"}),
clock.ops.timestamp(named_as="TIME")
)

More Trumania
20
• … and time profiles
• … and a circus persistence mechanism
• … and circus state updates
• ...

Trumania caveats
21
Some possible improvements:
• performance: python, pandas
• more I/O options (it's all local CSV for now)
• it’s a young tool ;)

Trumania open source
22
The project is open source as of today !
Code and scenario examples: github.com/RealImpactAnalytics/trumania
Documentation: realimpactanalytics.github.io/trumania
Slack trumania.slack.com
Clone it, try it, let us know what you think!

Brussels Office
5, Place du Champ de Mars
1050 Brussels
Belgium
Cape Town Office
34 Somerset Road
8005, Green Point, Cape Town
South Africa
São Paulo Office
93, Rua Doutor Andrade Pertence
Vila Olímpia, São Paulo
Brazil
Luxembourg Office
2 - L 2314 , Place de Paris
Luxembourg
Grand-Duchy of Luxembourg
Follow us:
www.realimpactanalytics.com

Legal notices and disclaimer
24
All rights reserved. No part of this document may be reproduced, utilized, stored in a
retrieval system, or transmitted in any form or by any means without the prior written
permission of Real Impact Analytics.
The information, including any analyses, numbers, images, and pricing data
contained in this document are non-binding and for discussion purposes only. As
such, they are subject to adjustments and/or modifications at the sole discretion of
Real Impact Analytics.
Any agreement is subject to the signature of a definitive final contract between Real
Impact Analytics and the recipient and the acceptance by the Recipient of Real
Impact Analytics’ terms and conditions.

Trumania , a realistic scenario-based data-generator

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Trumania , a realistic scenario-based data-generator

Similar to Trumania , a realistic scenario-based data-generator (20)

More from Data Science Leuven

More from Data Science Leuven (20)

Recently uploaded

Recently uploaded (20)

Trumania , a realistic scenario-based data-generator