Process Mining for ERP Systems

Erik Nooijen,
Boudewijn v. Dongen, Dirk Fahland

Process Mining for ERP Systems

Process Discovery

process
event process
discovery
log model
algorithm

c1: A B C D E assumptions
c2: A C B D E • case = sequence of events of this case
c3: A F D E • cases are isolated:
event A in c1 happens only in c1 (and not in c2)
…
• cases of the same process

• one unique case id,
• each event associated to exactly one case id

PAGE 1

Typical Process in an ERP System

Manufacturer

Material A Material B
order
Material B Material B
product X order
Alice materials
ACME Inc.

Material B Material A
order
Material C Material C
product Y order
Bob
materials
Build to Order Mega Corp.

PAGE 2

n-to-m relations  database

process
process
discovery
model
algorithm

id attributes time-stamp attributes ProductOrder Customer
poID cust. … created processed built shipped cust. address …
po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15 Alice … …

po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18 Bob … …

relations data attributes
OrderedMaterial id attributes MaterialOrder
poID moID type added moID suppl. … completed sent received
po1 mo3 B 30-08 13:13 mo3 ACME 30-08 13:15 30-08 14:15 01-09 9:05
po1 mo4 A 30-08 13:14 mo4 MEGA 30-08 13:17 30-08 16:12 01-09 10:13
po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16 relations
PAGE 3

Process Discovery for ERP Systems

process
process
discovery
model
algorithm

0..*
Customer
reality: data in a relational DB
ProductOrder - cust
1
-… • events stored as time-stamped
- poID
- cust attributes in tables
- created OrderedMat.
MaterialOrder
- processed - poID
- built 1
- moID
- moID • multiple primary keys
- shipped 1..* - supplier  multiple notions of case
- type
1..* - completed
- added 1
- sent
- received • tables are related
 one event related to
multiple cases

PAGE 4

Process Discovery for ERP Systems

process
process
discovery
model
algorithm

0..*
Customer
reality: data in a relational DB
ProductOrder - cust
1
-… • events stored as time-stamped
- poID
- cust attributes in tables
- created OrderedMat.
MaterialOrder
- processed - poID
- built 1
- moID
- moID • multiple primary keys
- shipped 1..* - supplier  multiple notions of case
- type
1..* - completed
- added 1
- sent
- received • tables are related
 one event related to
multiple cases

PAGE 5

Outline

process
model

related by
primary foreign-key
relations

decompose by primary keys

model f.
log f. discovery PO
log f. model f.
MO
PO MO
discovery
PAGE 6

Find Artifact Schemas

process
model

related by
primary foreign-key
relations


model f.
log f. discovery PO
log f. model f.
MO
PO MO
discovery
PAGE 7

Step 0: discover database schema

 document schema vs. actual schema  identify
• column types (esp. time-stamped columns)
• primary keys
• foreign keys
 various (non-trivial) techniques available
 key discovery is NP-complete in the size of the
table(s)
 result:

PAGE 8

Step 1: decompose schema into processes

= schema summarization find:
1. sets of
corresponding
tables
2. links between
those
ProductOrder MaterialOrder

PAGE 9

Automatic Schema Summarization

= group similar tables
through clustering
 define a distance between
any 2 tables
• by relations
• by information content

 tables that are close to
each other
 same cluster
 # of clusters: user input

PAGE 10


1. structural distance A
between tables 1
2 fanout: 1 = (2+0)/2
fanout ~ avg. # of child fanout: 1
records related to the fanout: 2
same parent record
A B A B A B
1 X 1 X 1 X
2 Y 1 Y 1 Y
2 Z
2 U

PAGE 11


1. structural distance A
between tables 1
2 fanout: 1
fanout ~ avg. # of child fanout: 1 m.fr: 2 = 1/ (1/2)
records related to the m.fr: 1 fanout: 2
same parent record m.fr: 1
A B A B A B
matched fraction ~ 1 X 1 X 1 X
1 / (fraction of records in 2 Y 1 Y 1 Y
parent with matching child 2 Z
record) 2 U

PAGE 12

Grouping by Clustering

1. structural distance
2. information distance
importance of each table
= entropy (is maximal if all
records are different)
distance: 2 tables with high
entropies  large distance
3. weighted distance by
structure + information
4. k-means clustering: most important table of cluster
k clusters based on = table with least distance to all
 key attribute of the cluster
weighted distance
PAGE 13

Artifact Schema  Artifact Log

process
model

related by
primary foreign-key
relations


model f.
log f. discovery PO
log f. model f.
MO
PO MO
discovery
PAGE 14

Log Extraction

cluster = set of related tables
+ primary key of most important table

case id

poID cust. … created processed built shipped
log f.
PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18

poID moID type added
po1 mo3 B 30-08 13:13
po1: po1 mo4 A 30-08 13:14
po2 mo3 B 30-08 13:15

po2: po2 mo4 C 30-08 13:16

PAGE 15

Log Extraction


case id

time-stamped attribute  event

log f.
PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18

po1 mo3 B 30-08 13:13
po1: (created, poID=po1, time=30-08 9:22, …) po1 mo4 A 30-08 13:14
po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16

PAGE 16

Log Extraction


case id

related attributes  event attributes
log f.
PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18

po1 mo3 B 30-08 13:13
po1: (created, poID=po1, time=30-08 9:22, cust.=Alice, …)po1 mo4 A 30-08 13:14
po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16

PAGE 17

Log Extraction


case id

log f.
PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18

po1 mo3 B 30-08 13:13
(processed, poID=po1, time=30-08 13:12, …) po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16

PAGE 18

Log Extraction


case id

log f.
PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18

po1 mo3 B 30-08 13:13
(processed, poID=po1, time=30-08 13:12, …) po2 mo3 B 30-08 13:15
(added, poID=po1, time=30-08 13:13, moID=mo3, …)po2 mo4 C 30-08 13:16
refers to artifact “MaterialOrder”
PAGE 19

Outline

process
model

compose by
primary foreign-key
relations


model f.
log f. discovery order
log f. model f.
order
quote quote
discovery
PAGE 20

Resulting Model(s)
Product Order Material Order
1..*
added
create

completed

processed

added 1..* sent

built

received

shipped

(addded, poID=po1, …, moID=mo3)
PAGE 21

Implementation & Evaluation

 prototype tool
• input: relational database (via JDBC), .csv tables
• steps
− discover database schema (types, keys, relations)
− discover artifact schema
− by k-means clustering
− by user picking tables
− extract logs  ProM

PAGE 22

Evaluation: SAP System of Sligro

 > 300 tables, > 40 GiB of data
 schema extraction time-stamp attributes: 15 hrs
primary keys: 4 hrs
foreign keys: 5 hrs (single col)/
6 days (double col.)

 clustering entropies: 17 hrs
table distances: 5 hrs
clustering: a few seconds
~20 different artifacts found
largest: 47 tables, 869 columns

 log extraction extract 1000 traces of > 246,000 events
query database: 1 hrs
write log file: 32 hrs

PAGE 23

Sligro: Artikel lifecycle model

PAGE 24

Open issues

 performance
• key discovery: NP-complete in R (# of records)
• foreign key discovery: NP-complete in R2
• problem is in the “hard part” of NP
•  sampling of data, domain knowledge, semi-automatic
 requires good database structure
• proper relations, proper keys
• otherwise wrong clusters are formed
• events don’t get right attributes
•  semi-automatic approach
 events shared by multiple cases… working on it…
PAGE 25

Process Mining for ERP Systems

Recommended

Recommended

More Related Content

More from Dirk Fahland

More from Dirk Fahland (14)

Recently uploaded

Recently uploaded (20)

Process Mining for ERP Systems