Detec%ng
Decep%on
in
        Wri%ng
Style
Sadia
Afroz,
Michael
Brennan
and
Rachel
Greenstadt.
        Privacy,
Security
an...
Overview
•  Authorship
recogni%on
•  Authorship
recogni%on
in
adversarial
   environment
•  Decep%on
detec%on
•  Experimen...
Authorship
recogni%on

















Who
wrote
the
document?

Authorship
recogni%on
Stylometry:


  –  An
authorship
recogni%on
system
based
solely
on
     wri%ng
style.
  –  Not
handw...
Why
it
works?
•  Everybody
has
learned
language
differently

How
regular
authorship
recogni%on
              works
           Extract
features
                               Machine
L...
Extract

                       Determine
           features
                       authorship
                       Mac...
Assump%ons
•  Wri%ng
style
is
invariant.
   –  It’s
like
a
fingerprint,
you
can’t
really
change
it.
   –  Authorship
recogn...
Adversarial
AUacks
•  Imita%on
or
framing
aUack
    –  Where
one
author
imitates
another
author
    –  Par%cipants
were
as...
Accuracy
in
detec6ng
authorship
of
regular
                             documents
  1
0.9
0.8
0.7
0.6
0.5
        More
tha...
Accuracy
in
detec6ng
authorship
of
                 Obfuscated
documents
  1
0.9
0.8
0.7
0.6
                             ...
Accuracy
in
detec6ng
authorship
of
                   Imitated
documents
  1
0.9
0.8
0.7
0.6
                             ...
Can
we
detect
Stylis%c
Decep%on?
                      Imitated
           Regular
                Obfuscated

Extended‐Brennan‐Greenstadt

                    Corpus
•  56
authors

   –  12
of
the
par%cipants
are
from
Drexel
Univers...
Detec%ng
stylis%c
decep%on
is
possible
100
   98
   94.5
                                  89.5
 90
     95.7
       85
 8...
Feature
Changes
in
Imita6on
and
Obfusca6on
            Personal
pronoun
              Sentence
count
                     ...
Problem
with
the
dataset:
             Topic
Similarity
•  All
the
decep%ve
documents
were
of
same
   topic.
             ...
Hemingway‐Faulkner
Imita%on

                Corpus
•  Ar%cles
from
the
Interna%onal
Imita%on
   Hemingway
Contest
(2000‐2...
Decep%on
detec%on
is
possible
even
when
the
topic
is
not
similar
•  81.2%
accurate
in
detec%ng
imitated
   documents.

Long
term
decep%on:
            A
Gay
Girl
In
Damascus
Thomas
MacMaster.
                                      Fake
pictur...
Long
term
decep%on
is
hard
to
detect
•  None
of
the
blog
posts
were
found
to
be
   decep%ve.
•  But
regular
authorship
rec...
Long
term
decep%on
 Authorship
recogni%on
of
the
blog
               posts
Thomas
MacMaster.
   Amina
Arraf
   BriUa
(Thom...
Future
works
•  Intrusion
detec%on
•  Social
spam
detec%on
•  Iden%fying
quality
discourse

Two
Tools
•  JStylo:
Authorship
Recogni%on
Analysis
Tool.
•  Anonymouth:
Authorship
Recogni%on
Evasion
   Tool.
•  Free,
O...
Privacy,
Security
and
Automa%on
Lab
      (hUps://psal.cs.drexel.edu)
•  Faculty
   –  Dr.
Rachel
Greenstadt
•  Graduate
S...
Upcoming SlideShare
Loading in …5
×

Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

1,266 views

Published on

Published in: Technology, Education
  • Be the first to comment

Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

  1. 1. Detec%ng
Decep%on
in
 Wri%ng
Style
Sadia
Afroz,
Michael
Brennan
and
Rachel
Greenstadt.
 Privacy,
Security
and
Automa%on
Lab
 Drexel
University

  2. 2. Overview
•  Authorship
recogni%on
•  Authorship
recogni%on
in
adversarial
 environment
•  Decep%on
detec%on
•  Experiments
on
different
datasets

  3. 3. Authorship
recogni%on

















Who
wrote
the
document?

  4. 4. Authorship
recogni%on
Stylometry:


 –  An
authorship
recogni%on
system
based
solely
on
 wri%ng
style.
 –  Not
handwri%ng
 –  Only
linguis%c
style:
word
choice,
sentence
length,
 parts‐of‐speech
usage,
…

  5. 5. Why
it
works?
•  Everybody
has
learned
language
differently

  6. 6. How
regular
authorship
recogni%on
 works
 Extract
features
 Machine
Learning
 System

  7. 7. Extract

 Determine
 features
 authorship
 Machine
Learning
 System
Document
of

unknown
authorship

  8. 8. Assump%ons
•  Wri%ng
style
is
invariant.
 –  It’s
like
a
fingerprint,
you
can’t
really
change
it.
 –  Authorship
recogni%on
can
iden%fy
you
if
there
 are
sufficient
wri%ng
samples
and
a
set
of
 suspects.

  9. 9. Adversarial
AUacks
•  Imita%on
or
framing
aUack
 –  Where
one
author
imitates
another
author
 –  Par%cipants
were
asked
to
imitate
Cormac
 McCarthy
in
wri%ng
about
their
day
•  Obfusca%on
aUack
 –  Where
an
author
hides
his
regular
style
 –  Par%cipants
were
asked
to
write
about
their
 neighborhood
in
a
different
style
M.
Brennan
and
R.
Greenstadt.
Prac%cal
aUacks
against
authorship
recogni%on
techniques.
In
Proceedings
of
the
Twenty‐First
Conference
on
Innova%ve
Applica%ons
of
Ar%ficial
Intelligence
(IAAI),
Pasadena,
CA,
2009.

  10. 10. Accuracy
in
detec6ng
authorship
of
regular
 documents
 1
0.9
0.8
0.7
0.6
0.5
 More
than
80%
accurate
in
detec%ng
 9‐Feature
(NN)
 authorship
with
40
authors
in
regular
 Synonym‐Based
0.4
 document
 Writeprints
Baseline
(SVM)
0.3
 Random
0.2
0.1
 0
 5
 10
 15
 20
 25
 30
 35
 40
 Number
of
Authors

  11. 11. Accuracy
in
detec6ng
authorship
of
 Obfuscated
documents
 1
0.9
0.8
0.7
0.6
 9‐Feature
(NN)
0.5
 Accuracy
is
less
than
random
chance

 Synonym‐Based
0.4
 in
obfuscated
documents
 Writeprints
Baseline
(SVM)
0.3
 Random
0.2
0.1
 0
 5
 10
 15
 20
 25
 30
 35
 40
 Number
of
Authors

  12. 12. Accuracy
in
detec6ng
authorship
of
 Imitated
documents
 1
0.9
0.8
0.7
0.6
 9‐Feature
(NN)
0.5
 Accuracy
is
less
than
random
chance

 Synonym‐Based
0.4
 in
imitated
documents
 Writeprints
Baseline
(SVM)
0.3
 Random
0.2
0.1
 0
 5
 10
 15
 20
 25
 30
 35
 40
 Number
of
Authors

  13. 13. Can
we
detect
Stylis%c
Decep%on?
 Imitated
 Regular
 Obfuscated

  14. 14. Extended‐Brennan‐Greenstadt

 Corpus
•  56
authors

 –  12
of
the
par%cipants
are
from
Drexel
University
 –  The
rest
are
paid
workers
from
Amazon
mechanical
turk
•  Three
kinds
of
wri%ng
samples
 –  Regular
wri%ngs
(5000
words)
 –  Imitated
wri%ng
 •  A
500‐word
ar%cle
describing
a
day

 •  Imitate
Cormac
McCarthy
from
`The
Road’
 –  Obfuscated
wri%ng
 •  A
500‐word
ar%cle

describing
neighborhood
 •  Hide
own
wri%ng
style

  15. 15. Detec%ng
stylis%c
decep%on
is
possible
100
 98
 94.5
 89.5
 90
 95.7
 85
 80
 75.3
 70
 59.9
 60
 48
 Writeprint,
SVM
 50
 43
 Lying‐detec%on,
J48
 40
 9‐feature
set,
J48
 30
 20
 10
 0
 Regular
 Imita%on
 Obfusca%on

  16. 16. Feature
Changes
in
Imita6on
and
Obfusca6on
 Personal
pronoun
 Sentence
count
 Par%cle
 Short
Words
 Verb
 Unique
words
 Adverb
 Existen%al
there
 Imita%on
 Average
syllables
per
word
 Obfusca%on
 Average
word
length
 Adjec%ve
 Cardinal
number
Gunning‐Fog
readability
index
 Average
sentence
length
 ‐80
 ‐60
 ‐40
 ‐20
 0
 20
 40
 60
 80
 100

  17. 17. Problem
with
the
dataset:
 Topic
Similarity
•  All
the
decep%ve
documents
were
of
same
 topic.
 5,$6.)78)9+,$($-.)8$%.($)&$.)+-)9$.$60-1) %9:$(&%(+%4)%.;7(&;+3) $"•  Non‐content‐specific
 !#," !#+" !#*" !"#$%&($)
features
have
same

 !#)" !#(" =>3/0<1<" !#" ?5@-<08" !#&"effect
as
content‐specific

 A23/53/" !#%" !#$" !"features.
 -.-/0123" 4567804" *+,$($-.)/(+0-1)2%#34$&) 29:7;<0123"
  18. 18. Hemingway‐Faulkner
Imita%on

 Corpus
•  Ar%cles
from
the
Interna%onal
Imita%on
 Hemingway
Contest
(2000‐2005)
•  Ar%cles
from
the
Faux
Faulkner
Contest
 (2001‐2005)
•  Original
excerpts
of
Ernest
Hemingway
and
 William
Faulkner

  19. 19. Decep%on
detec%on
is
possible
even
when
the
topic
is
not
similar
•  81.2%
accurate
in
detec%ng
imitated
 documents.

  20. 20. Long
term
decep%on:
 A
Gay
Girl
In
Damascus
Thomas
MacMaster.
 Fake
picture
of
Amina
Arraf.
–  Original
author
was
a
40‐year
old
American
ci%zen,
 Thomas
MacMaster.
–  Pretended
to
be
a
Syrian
gay
woman,
Amina
Arraf.
–  The
author
worked
for
at
least
5
years
to
create
a
 new
style.

  21. 21. Long
term
decep%on
is
hard
to
detect
•  None
of
the
blog
posts
were
found
to
be
 decep%ve.
•  But
regular
authorship
recogni%on
can
help.
•  We
tried
to
aUribute
authorship
of
the
blog
 posts
using
Thomas
(as
himself),
Thomas
(as
 Amina),
BriUa
(Thomas’s
wife).

  22. 22. Long
term
decep%on
 Authorship
recogni%on
of
the
blog
 posts
Thomas
MacMaster.
 Amina
Arraf
 BriUa
(Thomas’s
wife)
 54%
 43%
 3%

  23. 23. Future
works
•  Intrusion
detec%on
•  Social
spam
detec%on
•  Iden%fying
quality
discourse

  24. 24. Two
Tools
•  JStylo:
Authorship
Recogni%on
Analysis
Tool.
•  Anonymouth:
Authorship
Recogni%on
Evasion
 Tool.
•  Free,
Open
Source.
(GNU
GPL)
•  Alpha
releases
available
today
at
 hUps://psal.cs.drexel.edu
 –  Migra%ng
to
GitHub
soon.

  25. 25. Privacy,
Security
and
Automa%on
Lab
 (hUps://psal.cs.drexel.edu)
•  Faculty
 –  Dr.
Rachel
Greenstadt
•  Graduate
Students
 –  Sadia
Afroz
(Decep%on
Detec%on
Lead)
 –  Diamond
Bishop
 –  Michael
Brennan
 –  Aylin
Caliskan
 –  Ariel
Stolerman
(JStylo
Lead
Developer)
•  Undergraduate
Students
 –  Pavan
Kantharaju
 –  Andrew
McDonald
(Anonymouth
Lead
Developer)


×