Legal policy simulation is an important decision-support tool in domains such as taxation. The primary goal of legal policy simulation is predicting how changes in the law affect measures of interest, e.g., revenue. Currently, legal policies are simulated via a combination of spreadsheets and software code. This poses a validation challenge both due to complexity reasons and due to legal experts lacking the expertise to understand software code. A further challenge is that representative data for simulation may be unavailable, thus necessitating a data generator.
We develop a framework for legal policy simulation that is aimed at addressing these challenges. The framework uses models for specifying both legal policies and the probabilistic characteristics of the underlying population. We devise an automated algorithm for simulation data generation. We evaluate our framework through a case study on Luxembourg's Tax Law.
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Models15
1. A Model-Based
Framework for
Probabilistic Simulation
of Legal Policies
Ghanem Soltana, Nicolas Sannier, Mehrdad Sabetzadeh,
and Lionel Briand
SnT Centre for Security, Reliability and Trust
University of Luxembourg, Luxembourg
2. How did this work come about?
2
• Collaboration with
Government of
Luxembourg
CTIE: Government’s IT Centre
ACD: Tax Administration Department
• New tax system under development
• Develop tailored solutions for decision-support and
software verification
3. Context
3
Using UM L for M odeling Procedural Legal Rules:
A pproach and a Study of Luxembourg’s Tax Law
Ghanem Soltana, Elizabeta Fourneret, Morayo Adedjouma,
Mehrdad Sabetzadeh, and Lionel Briand
SnT Centre for Security, Reliability and Trust, University of Luxembourg
{ f i r st name. l ast name} @uni . l u
A bst ract . Many laws, e.g., those concerning taxes and social benefits,
need to be operationalized and implemented into public administration
procedures and eGovernment applications. Where such operationaliza-
tion is warranted, the legal frameworks that interpret the underlying
5. Objectives
5
• Simulating the impact of legal policy changes
• Enabling simulation even when simulation data
is not available
Simulation
data Generates
(optional)
Simulates
Models of
legal
policies
0%
2%
4%
6%
8%
10%
12%
0%
5%
10%
15%
20%
25%
0-10.000
10.000-20.000
20.000-30.000
30.000-40.000
40.000-50.000
50.000-60.000
60.000-70.000
70.000-80.000
80.000-90.000
90.000-100.000
100.000-110.000
110.000-120.000
120.000-130.000
130.000-140.000
140.000-150.000
150.000-160.000
160.000-170.000
170.000-180.000
180.000-190.000
190.000-200.000
200.000-250.000
250.000-350.000
350.000-500.000
500.000-700.000
700.000-1.000.000
>1.000.000
Gross annual income (in Euros)
Contributiontorevenue
Households
Percentage
Percentage
Percentage
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
0
1-3.000
3.001-6.000
6.001-9.000
9.001-12.000
12.001-15.000
15.001-18.000
18.001-21.000
21.001-24.000
24.001-27.000
27.001-30.000
>30.000
Annual income taxes due (in Euros)
Households
Before change
After change
Input to
Impact of legal
policy changes on
variables of
interest
6. Legal policy simulation in practice
6
Some existing simulation tools focused on taxation and social
security:
• ASSERT: Assessing the effects of reforms in taxation
• SYSIFF: A micro-simulation model for the French tax system
• POLIMOD: A national static tax-benefit model for the UK
• EUROMOD: European benefit-tax model and social integration
9. Limitations of current simulation
frameworks
9
• Legal policies are hard-to-validate
• Single-purpose models
• Unusable when simulation data is not available
10. • Legal policies should be captured in a
precise and yet easy to understand manner
• Automated simulation/analysis should be
possible even when data is not available
Desiderata
10
11. 11
• Legal policies are from prescriptive laws
- Taxation and social benefits
• No change in human behavior due to legal policy
modifications
Working assumptions
12. Our policy simulation framework
Relevant
legal texts
Domain model
Policy models
Model
legal policies
Generated
simulation data
Simulation
results
¨
Generate
simulation data
Annotated
domain model
<<s>>
<<p>>
<<p>>
<<m>>
Annotate
domain model with
probabilities
≠
ÆØPerform
simulation
Is simulation
data available?
Yes
No
Simulation
data
12
13. • A legal policy model captures the procedure envisaged by law for
performing a certain activity
• Notation: Extended Activity Diagrams (ADs)
• Facilitates communication between legal and IT experts
ExpressiveVisual
PreciseExecutable
ADs
Legal policy models
[Soltana et al., 2014]
13
14. Art. 105bis […] The commuting expenses deduction is defined as a
function over the distance between the principal towns of the
municipalities of a taxpayer's home and his place of work.
The distance is measured in units of distance expressing the kilometric
distance between [principal] towns. A ministerial regulation provides
these distances.
The amount of the deduction is calculated as follows:
• If the distance exceeds 4 units but is less than 30 units, the deduction
is 99€ per unit of distance.
• The first 4 units are not taken into account and the deduction for a
distance exceeding 30 units is limited to 2,574€.
* Translation from French text
Excerpt from the income tax law
14
18. Simulation framework overview
Relevant
legal texts
Domain model
Policy models
Model
legal policies
Generated
simulation data
Simulation
results
¨
Generate
simulation data
Annotated
domain model
<<s>>
<<p>>
<<p>>
<<m>>
Annotate
domain model with
probabilities
≠
ÆØPerform
simulation
Is simulation
data available?
Yes
No
Simulation
data
18
19. Related work on instance generation
• Exhaustive search:
- UML2CSP [Cabot et al., 2014]
- Alloy [Jackson, 2009]
• Non-exhaustive techniques:
- Metaheuristic-search [Ali et al., 2013]
- Predefined patterns [Gogolla et al., 2005]
- Mutation analysis [Di Nardo et al., 2015]
- Configurable random generation [Hartmann et al.,
2014]
19
20. Limitations in existing work
Existing techniques cannot generate
data that is suitable for our analysis
needs
20
Representativenes
s
Scalability
Limitation
s
21. Our solution to generate simulation
data
21
Random
generation
Profile for
capturing
probabilistic
characteristics of
the real population
Scalability Representativenes
s
guided by
Limitation
s
22. Relative frequencies
* Source: STATEC, Luxembourg
60% of income types are Employment, 20% are Pension,
and the remaining 20% are Other
22
27. 27
Consistency constraints
The sound application of the profile’s stereotypes is enforced by
several consistency constraints:
• Completeness of the probabilistic information
• Well-formedness of the probabilistic information
• Mutual-exclusiveness application of certain stereotypes
28. Simulation framework overview
Relevant
legal texts
Domain model
Policy models
Model
legal policies
Generated
simulation data
Simulation
results
¨
Generate
simulation data
Annotated
domain model
<<s>>
<<p>>
<<p>>
<<m>>
Annotate
domain model with
probabilities
≠
ÆØPerform
simulation
Is simulation
data available?
Yes
No
Simulation
data
28
29. 29
Fully automated data generation
Policy models (set)
Simulation
data (instance
of slice model)
Annotated domain model
<<s>>
<<p>>
<<p>>
Slice
model
Slice
domain model
¨
1
2
6
3
7
8
9
5
4
Instantiate
slice model
Ø
Traversal order
a c
b
d
a' b'
c'
d'
Segments
classification
Identify
traversal order
ÆClassify
path segments
Simulation unit (class)
≠
Sample size
30. Simulation framework overview
Relevant
legal texts
Domain model
Policy models
Model
legal policies
Generated
simulation data
Simulation
results
¨
Generate
simulation data
Annotated
domain model
<<s>>
<<p>>
<<p>>
<<m>>
Annotate
domain model with
probabilities
≠
ÆØPerform
simulation
Is simulation
data available?
Yes
No
Simulation
data
30
31. 31
Simulation process
Activity Diagram(s)
(legal rule) Feedback
Generate
simulation code
Simulation code
Visualize and
analyze results
Run simulator
Simulation Results
Simulation
data
Domain model
Original and
modified sets of
legal policies
33. 33
Research questions
• RQ1: Do data generation and simulation run in reasonable
time?
• RQ2: Does our data generator produce data that is
consistent with the specified characteristics of the
population?
• RQ3: Are the results of different data generation runs
consistent (up to random variation)?
34. 34
Case study
• Models for personal income taxes
were created (domain model + policy
models)
• Six representative policy models were
selected (out of 18 policy models)
• All models used in this evaluation were
validated by legal experts
35. 35
Probabilistic information
Statistic Description
Age Distribution of taxpayers by age
Income type Relative distribution of different incomes types
(employment, agriculture, business and trade, etc.)
Income rage Distribution of the annual income ranges for taxpayers
Invalidity rate Percentage of invalid taxpayers
Invalidity type Relative distribution of different invalidity types
Residence
status
Relative distribution of resident versus non-resident
taxpayers
…
15 distributions (from census and synthetized data) were used to
specify Luxembourg’s population’s characteristics
STATEC, Luxembourg
36. 36
RQ1: Do data generation and simulation run in
reasonable time?
0 1k 2k 3k 4k 5k 6k 7k 8k 9k 10k
051015202530
ID + CIS + PE + FD + LD + CIP
ID + CIS + PE + FD + LD
ID + CIS + PE + FD
ID + CIS + PE
ID + CIS
ID
Number of generated tax cases
Executiontime(inminutes)
Results for the
generator
- Deduction for invalidity (ID)
- Credit for salaried workers (CIS)
- Deduction for permanent expenses
(PE)
- Deduction for commuting expenses
(FD)
- Deduction for long-term debts (LD)
- Credits for pensioners (CIP)
37. 37
- Deduction for invalidity (ID)
- Credit for salaried workers (CIS)
- Deduction for permanent expenses
(PE)
- Deduction for commuting expenses
(FD)
- Deduction for long-term debts (LD)
- Credits for pensioners (CIP)
Results for the simulator
RQ1: Do data generation and
simulation run in reasonable
time?
38. 38
RQ2: Does our data generator produce data
that is consistent with the specified
characteristics?
Generated sample
starts to be
representative for a
size above 2000 units
39. 39
RQ3: Are the results of different
data generation runs consistent?
• 5 samples of 5000 tax cases
• Pairwise comparison of the generated samples using
kolmogorov-smirnov test
No counter-evidence that the samples come from different
populations
40. 40
Ongoing work
• Decision-support for the Government’s actual tax reforms
• Evaluating the accuracy of the simulation results
0%
10%
20%
30%
40%
50%
60%
70%
Tax class 1 Tax class 1.a Tax class 2
Taxpayers
Before change
After change
- 20%!
0%!
20%!
40%!
60%!
80%!
100%!
>21.001!
18.001-21.000!
15.001-18.000!
12.001-15.000!
9001-1200!
6001-9000!
3001-6000!
1-3000!
0!
1-3000!
3001-6000!
6001-9000!
9001-1200!
12.001-15.000!
15.001-18.000!
18.001-21.000!
>21.001!
Less taxes to pay! More taxes to pay!
Annual decrease / increase in taxes due (in Euros)!
Households!
41. 41
Summary
• Model-based simulation framework for legal
policies
• A profile for expressing probabilistic characteristics
of a population
• An automated stochastic data generator
• Preliminary evaluation of scalability,
representativeness, and reproducibility is promising
• Applied to assess actual tax reforms
42. A Model-Based
Framework for
Probabilistic Simulation
of Legal Policies
Ghanem Soltana, Nicolas Sannier, Mehrdad Sabetzadeh,
and Lionel Briand
SnT Centre for Security, Reliability and Trust
University of Luxembourg, Luxembourg
43. 43
Model sizes
• The domain model has: 64 classes, 43 generalizations, 344
attributes, and 53 associations
• The six policy models have an average of 35 elements
The work that I am about to present is a continuation to the work that we started and presented in MODLES last year.
In particular what we tried to is is to use models as oracles for what the actual system does
This is what I described last year, what I am presenting today is another facet of the project to verify that the system does behave properly and is compliant to the law
The first facet of the project focuses on testing
Utilizes the same models for a very different purpose
One major concern when development such a system is ensuring compliance to the underlying laws.
focus on the simulation rather than the compliance use case
Second track
Models of legal policies
Previous
One major concern when development such a system is ensuring compliance to the underlying laws.
focus on the simulation rather than the compliance use case
Second track
Simulating the impact of policy changes
Enabling realistic and yet practical simulation even when sample data is not available
Descriptive model + procedural model
Show that we get easily confused and lost
Show that we get easily confused and lost
Challenges and desired characteristics
Inheritance laws.
Challenges and desired characteristics
Are very much aligned to what need to be improved in these frameworks
Challenges and desired characteristics
Deontic modalities
Permissions, obligations, prohibitions
We assume that there is no behavioral change to the law modification.
When the law changes the taxpayers might change their behavior
The normal flow.
We faced this issues with acd
Step one was elaborated in a previous work. I will basically do the bear minimum to give a feel a bout what are these policy models that I am talking about and then focus on the main contributions of this current work which are illustrated through steps 2 to 4
One of the goals that I have mentioned is to make the specification as intuitive as possible
This is something that we have dealt with in previous
Just let me give you example
This on of the various deduction that you have
In Luxembourg as in many other countries you can claim a deduction for the commute that you have every morning from the commut from your home to your work place
The amount of the deduction is determined based on your commute distance
This is the main artifacts used by the simulator.
You can think of this as a database that one would query to retrieve the appropriate input to.
And that domain model would be the basis for describing your data samples
Which lead us to the next step of which deals with
This is the main artifacts used by the simulator.
You can think of this as a database that one would query to retrieve the appropriate input to.
And that domain model would be the basis for describing your data samples
Which lead us to the next step of which deals with
This is the main artifacts used by the simulator.
You can think of this as a database that one would query to retrieve the appropriate input to.
And that domain model would be the basis for describing your data samples
Which lead us to the next step of which deals with
Mention the source of the data
They are aimed at Testing and system configuration
By this I do not mean that all techniques are both not scalable and not representatives. But we were unable to find any technique that is both scalable and provides representative samples
Exhaustive if they finish.
In testing you are looking for these pathological situations for the situations Boundary cases where something can go wrong. This is different the data needs to be aligned to hat the real population is
Get back to the domain model
Feed it
Statec
The pervious annotations are also used to define the cardinalities between objects from the domain model
As I already mentioned over the previous slide, we have simulation as the main target for automation
There are two main directions that we are pursuing with regards to simulation.
Simulating the behavior of software systems.
Simulation of legal decisions
Traceability and generation
In the interest of time.
We have the domain model annotated with probabilistic annotations as we saw
We have the policy models
This in same sense the root or the starting point
The size of the sample
We have the size of the sample
There are many technical details I will not get through these
But here what happens at a high level of abstraction
Essentially this what happen
We first figure out what are the part
Obviously, if we want to simulate
This is done basically to get the data generator targeted and efficient
Than we have to figure out in which order the classes and attributes should be instantiated
We call a path segment an association traverse in a given direction
This classification will ensure that the generator will not fall into an infinite loop caused by
cyclic association paths in the slice mode
Fails
From time to time we generate inconsistent objects
We use OCL imperatively
As I already mentioned over the previous slide, we have simulation as the main target for automation
There are two main directions that we are pursuing with regards to simulation.
Simulating the behavior of software systems.
Simulation of legal decisions
Traceability and generation
Our framework additionally supports result differencing, meaning that the user can provide an original and a modified set of policies, subject both sets to the same simulation data, and compare the simulation results to quantify the impact. This type of analysis does not add any new conceptual element to our framework and is thus not further discussed.
Our framework additionally supports result differencing, meaning that the user can provide an original and a modified set of policies, subject both sets to the same simulation data, and compare the simulation results to quantify the impact. This type of analysis does not add any new conceptual element to our framework and is thus not further discussed.
In Luxembourg they have a tax card that contains all the relevant tax information like deductions.
Mention what we mean by validation
All the sources were real.
Random order
In the slopse
The simulation code
These models are querying the instance model for inputs via OCL queries
And obviously the way that you define the queries would have an impact on the scalability
This a important observation and when we investigate that the reason why the trend is not perfectly linear was because we were using all Instances in some of the queries
Get les less efficient as the instance model grows
But still
Sentence a bout the Euclidian distance
Pairs of histograms.
0 they are the same.
1 they are significantly different
Ideally we would like to have very large number
As you see past 200 thous, we get
to determine whether the generated quantities of different samples are likely to be derived from the same population
We compared pairwise the generated samples.
Details paper
The results no counter evidence
Make a transition before going to future work
The government is discussing a big change
They are thinking of removing benefits that married are getting just by being taxed jointly
Take home messages.
Our context was this
But the technology is quiet generic
The work was motivated by addressing a very specific and contextual problem
But the profile and the data generator that e have built for addressing this specific have many useful features that we believe can be used for other context and other type of simulation such as system simulation
Our framework additionally supports result differencing, meaning that the user can provide an original and a modified set of policies, subject both sets to the same simulation data, and compare the simulation results to quantify the impact. This type of analysis does not add any new conceptual element to our framework and is thus not further discussed.
Our framework additionally supports result differencing, meaning that the user can provide an original and a modified set of policies, subject both sets to the same simulation data, and compare the simulation results to quantify the impact. This type of analysis does not add any new conceptual element to our framework and is thus not further discussed.
Our framework additionally supports result differencing, meaning that the user can provide an original and a modified set of policies, subject both sets to the same simulation data, and compare the simulation results to quantify the impact. This type of analysis does not add any new conceptual element to our framework and is thus not further discussed.