Data Generation with PROSPECT: a Probability Specification Tool

Data Generation with
PROSPECT: A Probability
Speciﬁcation Tool
Alan Ismaiel, Ivan Ruchkin, Jason Shu, Oleg Sokolsky, Insup
Lee
University of Pennsylvania Computer and Information Science Department
Winter Simulation Conference 2021
December 14th, 2021
1

Motivating Scenario: Autonomous Car
• Engineering team is building autonomous cleaning vehicle
• Team intends to simulate the vehicle in desired conditions:
• Time of day is determined by the cleaning schedule
• Lane occupancy is determined by the parking ticket history
• Obstacle detection rate differs by time of day
2
How should they simulate the conditions?

Motivating Scenario: Network Latency
• Network monitor estimates latency based on N latest ping delays
• Need simple synthetic data to test the monitor
• Goal: quickly generate a simple dataset of
• Observed ping delays
• Underlying network latencies
• Desirable properties of the dataset:
• Low latency on average
• Ping delays change occasionally over time
• High latency sometimes leads to high ping delays
3
How to generate this dataset?

Automated Data Generation
• Increasingly important: testing complex systems, deep learning
• Obtaining real data often infeasible or impractical
• Many information sources: requirements, common-sense constraints,
intuition, known statistics
Current data generation tools:
• Tailored to specific model
• Imperative sampling
• Little support for arbitrary constraints
• High complexity
4

Problem
Declaratively specify and automatically sample discrete temporal distribution
under known constraints:
● Algebraic constraints on marginal/joint/conditional probabilities
● Conditional/unconditional independence
● Temporal relations
5
Intractable in general

Approach Overview
1) Define tractable cases with a shared underlying model
2) The user specifies a distribution in our high-level declarative language
3) The specification is translated into polynomial equations
4) The system of polynomial equations is solved algebraically
5) If the solution defines a unique distribution, we sample it
6

Approach Overview
7

Discrete Time Markov Chains (DTMCs)
DTMC: A discrete stochastic process that adheres to the Markov Property,
where conditional probabilities of future states of the process depend only
on the present state.
8
Arbitrary DTMCs are difficult to specify

Three Case Types (I)
Static Case: Time is irrelevant, sampling is conducted i.i.d.
Time-Invariant Case: Sampling is not independent, but the temporal
distributions don’t change over time.
Time-Variant Case: Sampling is not independent, and the temporal
distributions change over time.
9

Approach Overview
11

Speciﬁcation Language: Scenario
12

Speciﬁcation: Case and Variables
DSL
13

Speciﬁcation: Independence
DSL
14

Speciﬁcation: Probability Constraints
DSL
15

Approach Overview
16

Parameterizing Speciﬁcations
Goal: Parameterize all the probability specifications into algebraic equations
We define O-Parameters to represent the probabilities of elementary events
over the user’s defined sample space.
Every syntax element can be expressed with O-Parameters:
• Parameterize conditional and unconditional event probabilities
• Parameterize conditional and unconditional independence
• Parameterize the stationary assumption (time invariant case)
• Parameterize recursive probability specifications (time variant case)
17

Motivating Scenario Parameters (1)
18

19

20

Approach Overview
21

Solving a System of Equations
Goal: Find unique distribution parameters satisfying the equations in step 3
Relies on Buchberger’s Algorithm and Cylindrical Algebraic Decomposition,
solving algorithms that are guaranteed to terminate.
Our implementation based on Wolfram Mathematica automatically picks an
appropriate solving algorithm, and returns solutions in complex numbers
22
• We only consider solutions that define valid probability distributions

Solving for Unique Solutions
Solving algorithm returns a set S of probability distributions.
● |S|= 1: the distribution is unique
● |S| > 1: the distribution is underspecified (not enough constraints for a
unique distribution)
● |S|< 1: the distribution is overspecified (conflicting constraints, no
distribution possible)
23

Solving the Motivating Scenario
24

Approach Overview
25

PROSPECT
PROSPECT is a software tool that allows users to provide an input made in
the specification language, and outputs generated data.
https://prospect.precise.seas.upenn.edu
https://github.com/bisc/prospect
26

PROSPECT (Pre-Recorded Demo)
27

Evaluation
Goal: Compare the required manual effort, length of code/specification, and
accuracy of data generation between PROSPECT approach and probabilistic
programming (PPL) baseline
For each scenario, we made two data generation programs in the PPL Pyro
v1.5.1:
1. Accurate solution, correctly interprets specifications and assumptions,
manually inferring the intended distribution
2. Naive solution, demonstrates plausible errors by ignoring implicit
dependencies between variables
28

Evaluation: Length of Code/Spec
29
PROSPECT specifications were substantially more succinct than
probabilistic programs, achieving 2-3x reduction of line count

Evaluation: Sampling Accuracy
30
Naive Baseline
Accurate Baseline
PROSPECT
Accurate baseline and PROSPECT were statistically
indistinguishable on a full sample of 10000 points, both
obtained more accurate results than the naive baseline

Future Work
• Syntax extensions for broader sampling settings:
• Continuous parametric distributions
• Probabilities constrained by variable values
• Semantic extensions for under-specified distributions:
• Resolving ambiguity with meta-models
• Tuning to available data
• Tool extensions for usability:
• Conditional termination of sampling
• When over-specified, return the minimal conflicting sub-spec
31

Conclusion
Our contributions:
1. A specification language for discrete distributions
2. An algebraic inference approach for distributions from the specifications
3. A software tool PROSPECT that implements the language and interface
4. An evaluation of PROSPECT on 3 case studies
We believe this approach can be used for simulation, probabilistic reasoning,
design and analysis, and other tasks that require probabilistic specifications.
https://prospect.precise.seas.upenn.edu
32

Related Works
DSLs: SESSL, NEDL, build deterministic designs with predefined patterns
• PROSPECT samples potentially complex random designs, compliments the work
Simulators: CARLA, Udacity, X-Plane, AirSim
• Focuses on specific discrete domains whereas PROSPECT performs at any level of abstraction
Graphical Models: Markov/Bayesian Networks
• PROSPECT represents a domain-agnostic approach that can create graphical models
PPLs: Pyro, Scenic
• Stronger focus on inferring a model given a program and dataset, whereas PROSPECT relies on explicit
declarative specifications
Coupla: ARTA, NORTA, VARTA, Stochastic Programming: SAMPL
• Focus on continuous distributions, not discrete, requires knowledge/data to choose models, PROSPECT does
not
33

Data Generation with PROSPECT: a Probability Specification Tool

More Related Content

What's hot

Similar to Data Generation with PROSPECT: a Probability Specification Tool

More from Ivan Ruchkin

Recently uploaded

Data Generation with PROSPECT: a Probability Specification Tool