Starbuck.pptx

Background
Jamie and Debbie got married. A couple weeks after getting married,
they left England and started travelling around the world. Jamie and
Debbie stayed in touch with their family and friends via email. However,
after about 2 years, the mother of Debbie started to worry about her.
She contacted the police and said:
Question: What do you think is the problem?

Their opinion
We think that the emails we receive from Debbie are not written by
Debbie. They though that the wording of the emails was not the type of
wording that their daughter would use.
Question:
Would you family notice if you did not write your own email?
Why do you think so?

Datasets
The detectives at Nottingham police station collected the emails that the
mother had received from Debbie’s email address.
The emails were divided into two datasets:
• Dataset 1: Emails received from Debbie before marriage
• Dataset 2: Emails received from Debbie after marriage
The detectives also felt that the style of writing was not the same in Dataset 1
and 2. They then collected emails sent by Jamie to his family. This is the third
dataset.
• Dataset 3: Emails received from Jamie

Dataset names
The police now have three datasets.
• Dataset 1: Emails received from Debbie before marriage
• Dataset 2: Emails received from Debbie after marriage
• Dataset 3: Emails received from Jamie
The investigators also used another dataset containing thousands of emails
written by different people.
• Dataset 4: Large collection of emails from many different senders
Question: Decide which datasets are questioned, known or reference.

Datasets
Questioned
(3,000
words)
Known 1:
Debbie
(28,000
words)
Known 2:
Jamie
(6,000
words)
Reference
(1 million
words)

Datasets
If the language features in two datasets are similar, the same person may have
written them.
If the language features in two datasets are NOT similar, the same person may
NOT have written them.
So, we need to discover if:
The features in the questioned dataset are similar to a known corpus
The features in the questioned dataset are different to a known corpus
Question: How can we do this?

Answers
Deep learning
• This will probably work, but it is difficult to explain in court
Statistical analysis
• This may also work, but it is also difficult to explain in court
Habits of language (idiosyncratic language - 口癖)
• This works and is the easy to explain.
Question: How can we systematically identify idiosyncratic language?

Is your language idiosyncratic?
THIN FAT DELICIOUS I
Thin
Slim
Slender
Skeletal
Skinny
Emaciated
Fat
Plump
Tubby
Chubby
Podgy
Overweight
Obese
Delicious
Tasty
Yummy
Flavorsome
Delectable
Scrumptious
俺
僕
私
儂
自分
うち
あたし

Datasets
Questioned
Known 1:
Debbie
Known 2:
Jamie
Reference
Afd
affadsA
Faasfsafs
afadsfa
Fsaafd
affsafs
Aafs
adsf
Afd
affadsA
Afd
affadsA
afadsfa
Fsaafd
affsafs
adsf
Afd
affadsA
Three common language features are used:
1. word frequency,
2. word frequency of words that are used more than expected, and
3. patterns following such words

Datasets
Questioned
Known 1:
Debbie
Known 2:
Jamie
Reference
Afd
affadsA
Faasfsafs
afadsfa
Fsaafd
affsafs
Aafs
adsf
Afd
affadsA
Afd
affadsA
afadsfa
Fsaafd
affsafs
adsf
Afd
affadsA
Question: How do we identify word frequency?

Datasets
Questioned
Known 1:
Debbie
Known 2:
Jamie
Reference
Afd
affadsA
Faasfsafs
afadsfa
Fsaafd
affsafs
Aafs
adsf
Afd
affadsA
Afd
affadsA
afadsfa
Fsaafd
affsafs
adsf
Afd
affadsA
Question: How do we identify frequency of words that are used more
than expected (keywords)?

Datasets
Questioned
Known 1:
Debbie
Known 2:
Jamie
Reference
Afd
affadsA
Faasfsafs
afadsfa
Fsaafd
affsafs
Aafs
adsf
Afd
affadsA
Afd
affadsA
afadsfa
Fsaafd
affsafs
adsf
Afd
affadsA
Question: How do we identify patterns following keywords?

Datasets
Questioned
Known 1:
Debbie
Known 2:
Jamie
Reference
Awhile 3
Buisiness 2
was sat 1
Awhile 2
Buisiness 4
was sat 1
Question: If these keywords were discovered, what would you conclude?

Jamie Starbuck
Check the result online by searching for “Jamie Starbuck”

Expert system
When preparing evidence for this case, the linguistic had to:
• Analyze keywords in the Questioned dataset
• Analyze keywords in each Known dataset separately
• Create a table in Excel to compare the keywords
• Identify keywords that occur in both Questioned and Known datasets
An expert system is needed to streamline this process, and remove the chance
of human error.

Starbuck.pptx

Recommended

Recommended

More Related Content

Similar to Starbuck.pptx

Similar to Starbuck.pptx (20)

More from john6938

More from john6938 (20)

Recently uploaded

Recently uploaded (20)

Starbuck.pptx