• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online
 

Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online

on

  • 892 views

 

Statistics

Views

Total Views
892
Views on SlideShare
892
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online  Sadia Afroz: Detecting Hoaxes, Frauds, and Deception in Writing Style Online Presentation Transcript

    • Detec%ng
Decep%on
in
 Wri%ng
Style
Sadia
Afroz,
Michael
Brennan
and
Rachel
Greenstadt.
 Privacy,
Security
and
Automa%on
Lab
 Drexel
University

    • Overview
•  Authorship
recogni%on
•  Authorship
recogni%on
in
adversarial
 environment
•  Decep%on
detec%on
•  Experiments
on
different
datasets

    • Authorship
recogni%on

















Who
wrote
the
document?

    • Authorship
recogni%on
Stylometry:


 –  An
authorship
recogni%on
system
based
solely
on
 wri%ng
style.
 –  Not
handwri%ng
 –  Only
linguis%c
style:
word
choice,
sentence
length,
 parts‐of‐speech
usage,
…

    • Why
it
works?
•  Everybody
has
learned
language
differently

    • How
regular
authorship
recogni%on
 works
 Extract
features
 Machine
Learning
 System

    • Extract

 Determine
 features
 authorship
 Machine
Learning
 System
Document
of

unknown
authorship

    • Assump%ons
•  Wri%ng
style
is
invariant.
 –  It’s
like
a
fingerprint,
you
can’t
really
change
it.
 –  Authorship
recogni%on
can
iden%fy
you
if
there
 are
sufficient
wri%ng
samples
and
a
set
of
 suspects.

    • Adversarial
AUacks
•  Imita%on
or
framing
aUack
 –  Where
one
author
imitates
another
author
 –  Par%cipants
were
asked
to
imitate
Cormac
 McCarthy
in
wri%ng
about
their
day
•  Obfusca%on
aUack
 –  Where
an
author
hides
his
regular
style
 –  Par%cipants
were
asked
to
write
about
their
 neighborhood
in
a
different
style
M.
Brennan
and
R.
Greenstadt.
Prac%cal
aUacks
against
authorship
recogni%on
techniques.
In
Proceedings
of
the
Twenty‐First
Conference
on
Innova%ve
Applica%ons
of
Ar%ficial
Intelligence
(IAAI),
Pasadena,
CA,
2009.

    • Accuracy
in
detec6ng
authorship
of
regular
 documents
 1
0.9
0.8
0.7
0.6
0.5
 More
than
80%
accurate
in
detec%ng
 9‐Feature
(NN)
 authorship
with
40
authors
in
regular
 Synonym‐Based
0.4
 document
 Writeprints
Baseline
(SVM)
0.3
 Random
0.2
0.1
 0
 5
 10
 15
 20
 25
 30
 35
 40
 Number
of
Authors

    • Accuracy
in
detec6ng
authorship
of
 Obfuscated
documents
 1
0.9
0.8
0.7
0.6
 9‐Feature
(NN)
0.5
 Accuracy
is
less
than
random
chance

 Synonym‐Based
0.4
 in
obfuscated
documents
 Writeprints
Baseline
(SVM)
0.3
 Random
0.2
0.1
 0
 5
 10
 15
 20
 25
 30
 35
 40
 Number
of
Authors

    • Accuracy
in
detec6ng
authorship
of
 Imitated
documents
 1
0.9
0.8
0.7
0.6
 9‐Feature
(NN)
0.5
 Accuracy
is
less
than
random
chance

 Synonym‐Based
0.4
 in
imitated
documents
 Writeprints
Baseline
(SVM)
0.3
 Random
0.2
0.1
 0
 5
 10
 15
 20
 25
 30
 35
 40
 Number
of
Authors

    • Can
we
detect
Stylis%c
Decep%on?
 Imitated
 Regular
 Obfuscated

    • Extended‐Brennan‐Greenstadt

 Corpus
•  56
authors

 –  12
of
the
par%cipants
are
from
Drexel
University
 –  The
rest
are
paid
workers
from
Amazon
mechanical
turk
•  Three
kinds
of
wri%ng
samples
 –  Regular
wri%ngs
(5000
words)
 –  Imitated
wri%ng
 •  A
500‐word
ar%cle
describing
a
day

 •  Imitate
Cormac
McCarthy
from
`The
Road’
 –  Obfuscated
wri%ng
 •  A
500‐word
ar%cle

describing
neighborhood
 •  Hide
own
wri%ng
style

    • Detec%ng
stylis%c
decep%on
is
possible
100
 98
 94.5
 89.5
 90
 95.7
 85
 80
 75.3
 70
 59.9
 60
 48
 Writeprint,
SVM
 50
 43
 Lying‐detec%on,
J48
 40
 9‐feature
set,
J48
 30
 20
 10
 0
 Regular
 Imita%on
 Obfusca%on

    • Feature
Changes
in
Imita6on
and
Obfusca6on
 Personal
pronoun
 Sentence
count
 Par%cle
 Short
Words
 Verb
 Unique
words
 Adverb
 Existen%al
there
 Imita%on
 Average
syllables
per
word
 Obfusca%on
 Average
word
length
 Adjec%ve
 Cardinal
number
Gunning‐Fog
readability
index
 Average
sentence
length
 ‐80
 ‐60
 ‐40
 ‐20
 0
 20
 40
 60
 80
 100

    • Problem
with
the
dataset:
 Topic
Similarity
•  All
the
decep%ve
documents
were
of
same
 topic.
 5,$6.)78)9+,$($-.)8$%.($)&$.)+-)9$.$60-1) %9:$(&%(+%4)%.;7(&;+3) $"•  Non‐content‐specific
 !#," !#+" !#*" !"#$%&($)
features
have
same

 !#)" !#(" =>3/0<1<" !#" ?5@-<08" !#&"effect
as
content‐specific

 A23/53/" !#%" !#$" !"features.
 -.-/0123" 4567804" *+,$($-.)/(+0-1)2%#34$&) 29:7;<0123"
    • Hemingway‐Faulkner
Imita%on

 Corpus
•  Ar%cles
from
the
Interna%onal
Imita%on
 Hemingway
Contest
(2000‐2005)
•  Ar%cles
from
the
Faux
Faulkner
Contest
 (2001‐2005)
•  Original
excerpts
of
Ernest
Hemingway
and
 William
Faulkner

    • Decep%on
detec%on
is
possible
even
when
the
topic
is
not
similar
•  81.2%
accurate
in
detec%ng
imitated
 documents.

    • Long
term
decep%on:
 A
Gay
Girl
In
Damascus
Thomas
MacMaster.
 Fake
picture
of
Amina
Arraf.
–  Original
author
was
a
40‐year
old
American
ci%zen,
 Thomas
MacMaster.
–  Pretended
to
be
a
Syrian
gay
woman,
Amina
Arraf.
–  The
author
worked
for
at
least
5
years
to
create
a
 new
style.

    • Long
term
decep%on
is
hard
to
detect
•  None
of
the
blog
posts
were
found
to
be
 decep%ve.
•  But
regular
authorship
recogni%on
can
help.
•  We
tried
to
aUribute
authorship
of
the
blog
 posts
using
Thomas
(as
himself),
Thomas
(as
 Amina),
BriUa
(Thomas’s
wife).

    • Long
term
decep%on
 Authorship
recogni%on
of
the
blog
 posts
Thomas
MacMaster.
 Amina
Arraf
 BriUa
(Thomas’s
wife)
 54%
 43%
 3%

    • Future
works
•  Intrusion
detec%on
•  Social
spam
detec%on
•  Iden%fying
quality
discourse

    • Two
Tools
•  JStylo:
Authorship
Recogni%on
Analysis
Tool.
•  Anonymouth:
Authorship
Recogni%on
Evasion
 Tool.
•  Free,
Open
Source.
(GNU
GPL)
•  Alpha
releases
available
today
at
 hUps://psal.cs.drexel.edu
 –  Migra%ng
to
GitHub
soon.

    • Privacy,
Security
and
Automa%on
Lab
 (hUps://psal.cs.drexel.edu)
•  Faculty
 –  Dr.
Rachel
Greenstadt
•  Graduate
Students
 –  Sadia
Afroz
(Decep%on
Detec%on
Lead)
 –  Diamond
Bishop
 –  Michael
Brennan
 –  Aylin
Caliskan
 –  Ariel
Stolerman
(JStylo
Lead
Developer)
•  Undergraduate
Students
 –  Pavan
Kantharaju
 –  Andrew
McDonald
(Anonymouth
Lead
Developer)