2. Background
Jamie and Debbie got married. A couple weeks after getting married,
they left England and started travelling around the world. Jamie and
Debbie stayed in touch with their family and friends via email. However,
after about 2 years, the mother of Debbie started to worry about her.
She contacted the police and said:
Question: What do you think is the problem?
3. Their opinion
We think that the emails we receive from Debbie are not written by
Debbie. They though that the wording of the emails was not the type of
wording that their daughter would use.
Question:
Would you family notice if you did not write your own email?
Why do you think so?
5. Datasets
The detectives at Nottingham police station collected the emails that the
mother had received from Debbie’s email address.
The emails were divided into two datasets:
• Dataset 1: Emails received from Debbie before marriage
• Dataset 2: Emails received from Debbie after marriage
The detectives also felt that the style of writing was not the same in Dataset 1
and 2. They then collected emails sent by Jamie to his family. This is the third
dataset.
• Dataset 3: Emails received from Jamie
6. Dataset names
The police now have three datasets.
• Dataset 1: Emails received from Debbie before marriage
• Dataset 2: Emails received from Debbie after marriage
• Dataset 3: Emails received from Jamie
The investigators also used another dataset containing thousands of emails
written by different people.
• Dataset 4: Large collection of emails from many different senders
Question: Decide which datasets are questioned, known or reference.
8. Datasets
If the language features in two datasets are similar, the same person may have
written them.
If the language features in two datasets are NOT similar, the same person may
NOT have written them.
So, we need to discover if:
The features in the questioned dataset are similar to a known corpus
The features in the questioned dataset are different to a known corpus
Question: How can we do this?
9. Answers
Deep learning
• This will probably work, but it is difficult to explain in court
Statistical analysis
• This may also work, but it is also difficult to explain in court
Habits of language (idiosyncratic language - 口癖)
• This works and is the easy to explain.
Question: How can we systematically identify idiosyncratic language?
10. Is your language idiosyncratic?
THIN FAT DELICIOUS I
Thin
Slim
Slender
Skeletal
Skinny
Emaciated
Fat
Plump
Tubby
Chubby
Podgy
Overweight
Obese
Delicious
Tasty
Yummy
Flavorsome
Delectable
Scrumptious
俺
僕
私
儂
自分
うち
あたし
19. Expert system
When preparing evidence for this case, the linguistic had to:
• Analyze keywords in the Questioned dataset
• Analyze keywords in each Known dataset separately
• Create a table in Excel to compare the keywords
• Identify keywords that occur in both Questioned and Known datasets
An expert system is needed to streamline this process, and remove the chance
of human error.