SlideShare a Scribd company logo
The language of reasons
Tyler Schnoebelen
Conspiracy, complaints,
and fraud
2
1) Showing how computational linguistics solves business problems
2) Identifying markers of fraud using language data
For company-internal fraud/compliance investigators
For government/regulatory/consumer advocacy
3) Detecting and using rationalization and reason-giving
The importance of emotion
The case of because in
Consumer complaints
Conspiracy forum posts
Hi! Welcome to the slides for this talk—also
check out the Notes. Basically this talk is about:
3
Fraud
4
5
The Association of Certified Fraud Examiners
looked at 1,483 fraud cases reported in 2014
They estimate global fraud loss is at least 5%
of revenue for companies
Estimate of losses to fraud, worldwide
$3.7 trillion
6
7
8
Financial statement fraud is much more
expensive
9
By industry
10
For the big dollars, look to the top
11
What are the red flags for fraudsters?
12
Losses are rarely recovered
13
Detecting deception
14
Prior work tends to be “word lists” or
experiments
L&Z used 29,663 transcribed quarterly
earnings calls
16,577 CEO Q&A responses
14,462 CFO Q&A responses
L&Z keep track of when quarterly financial
statements were later restated (during first call
they knew something was amiss)
Depending on strictness of restatement, 14%,
7% or 5% of the calls had deception in them.
Larcker & Zakolyukina (2010)
15
Larcker & Zakolyukina (2010)
CEOs CFOs
References to general knowledge (you know) more more
Non-extreme positive emotion words fewer fewer
References to shareholder value/value creation fewer fewer
Self-references fewer
3rd person plural/impersonal pronouns more fewer
Extreme negative emotion words fewer
Extreme positive emotion words more
Certainty words fewer more
Hesitation words fewer more
16
Text analytics
17
Text analytics
18
Linguistics:
Scientific study
of language
Machine Learning:
Automatically train
computers to make
human-like decisions
● Compliance monitoring
● Enterprise search
● E-Communications surveillance
● Technology assisted review
● Sentiment analysis
● Deception detection
● Text summarization
Natural Language Processing:
Enable machines to automatically derive
meaning from natural language input
Fraud and compliance in digital
communications
19
Early case
assessment
Relevancy
filtering
Risk
Scoring
Key entities
Strategic communications
Spam, Newsletters
Near de-duplication
Fraud diamond
Sentiment
Personal communications
Investigation stage Models100% Data volume
30%
10%
< 1%
Top-down vs. bottom-up text analytics
20
“Bribe” …
“Tea money”
“Facilitation
payment” “Backhander”
Top-down (Search)
● Rules-based
● False positives/negatives
● Brittle
Bottom-up (Discovery)
● Statistical
● Highly accurate
● Adaptive
21
Comparing rules vs. machine learning
22
High accuracy on complex task after only 1 day of work
Project goal: Uncover key documents relevant to Energy Regulation
out of 200,000 messages that matched raw keywords
23
Flexible Ontology
23
Develop rich ontology for investigative analytics
and insights at scale
Cline
Cline
1
Client’s Questions Known Areas of Interest
Pressure
Rationalization
Names
Opportunity
Capability
Emotions
Topic Modeling
Themes
?
2
21
3
24
Data gets smarter and more accurate through
adaptive system
Adaptive System Structured Data Reports
Action
• Annotation suggestions
• Document priority
• Shortest path for coverage
• Error detection
Machine
Learning
Optimization
Prediction
Engine
Human
Review
4 5 6
6
Idibon’s models drive more accurate, scalable
investigations of fraud
25
Identify indicative language
• Identify and extract indicators correlated with fraud
• Gather data from disparate structured, unstructured, public,
and private data sources
Model fraud within the organization
• Score and rank individual custodians by likelihood of fraud
• Summarize indicators of fraud by department or scheme
Scale across people and clients
• Model fraud using documents from multiple custodians
• Build replicable models for different client types
Monitor and track risk
• Model on-going risks in client interactions
• Track known liability or non-compliance issues
1.
2.
3.
4.
Detecting fraud requires a variety of models
Strategic Communications: Automatically identify important communications
based on the language used in emails with a BCC recipient
Fraud Triangle and Fraud Diamond: Identify messages containing indicators
of Motive, Opportunity, Rationalization and Capability to risk-rank actors and
their communications
Key Entities: Discover people, places, organizations, and other entities
mentioned in communications to uncover hidden relationships
Personal Messages: Flag messages that are intimate in nature and that may
contain evidence of illicit behavior or collusion
Sentiment Analysis: Categorize communications as positive, neutral, or
negative
Taboo Words and Obscenity: Identify emotionally charged language that may
reflect behaviors and events of interest
enron report merger
(Corporate
communications about
mergers that you
probably DON’T care
about)
27
Find needles in haystacks: quickly hone in on
relevant areas of the data
legal f&j citizens
“I also find the advance ethical
waiver language repugnant, but
could agree to it if the other
modifications mentioned could be
made.”
employees enron
bankruptcy
“Michelle, here is a
suggested revision to
Section 3.4 B … If a
terminated employee who
is entitled to receive a
severance benefit … the
severance benefit payable
under the Plan shall be
reduced and offset”
time good back
(Lots of irrelevant stuff about
home, weekends, Thanksgiving,
etc.)
Sentiment analysis and automatic topic
discovery reveal significant communications
28
Negative: Antitrust issues, M&A, Insider Trading
Positive: Product Releases, Employment
29
The Fraud Triangle (and
briefly, the Fraud Diamond)
30
The Fraud Triangle
Rationalization
Opportunity
FRAUD
SCORE
Pressure
31
The Fraud Triangle
Rationalization
Opportunity
FRAUD
SCORE
Pressure
Pressure:
Incentives,
wants, needs
(e.g., gambling
debts)
32
The Fraud Triangle
Rationalization
Opportunity
FRAUD
SCORE
Pressure
Opportunity:
Weaknesses in
the system that
allow fraud to
happen
Image Placeholder
33
Capability
(Makes it a Fraud Diamond)
Personal traits and abilities
• Effective lying
• Immunity to stress
• Intelligence
• Confidence
34
35
36
But let’s return to the peaky point
Rationalization
Opportunity
FRAUD
SCORE
Pressure
Rationalization:
Committing
fraud is worth
the risk
37
38
When and how do people
give reasons?
39
Because because because because
40
41
Conditions:
• “Excuse me, I have (5 or 20) pages. May I use the Xerox machine?” (no-
because)
• “Excuse me, I have (5 or 20) pages. May I use the Xerox machine, because
I’m in a rush? (because)
• “Excuse me, I have (5 or 20) pages. May I use the Xerox machine, because
I have to make copies?” (because-empty)
The idea here is that the because-empty clause offers no information.
For 5 pages: because = because-empty >> no-because
Though when stakes are higher (20 pages): because > because-empty >
no-because
Langer, Blank and Chanowitz (1978)
42
• “Given” information comes before “new”—so usually people say “such
and such happened because of X” rather than “Because of X, such and
such happened”
• Given: what’s been said already, inferable, familiar, expected
• Easier to process new information when it’s framed
• See Chafe (1984) and lots of others
• “causal clauses are primarily used to back up a previous statement that
the hearer may not accept or may not find convincing” (Diessel 2006)
• Conversation analysts find becauses offered by either speaker right
before a disagreement
• In English speech, they are surrounded by pauses, hesitations, excuses,
mitigations, indirectness, partial agreement, polarity reversals (see Ford &
Mori 1994)
Quick lit review
43
Two main coherence relations: cause-consequence and argument-claim
Causality and Subjectivity are key
Consider:
The sun was shining CONNECTIVE the temperature rose quickly
Causality
The neighbors’ lights are out CONNECTIVE they are not at home
Subjectivity
Some languages use different connectives
Sanders (2003)
Causality Subjectivity
Dutch doordat want
French parce que puisque
German weil denn
44
Children learning English learn things in this order (Bloom et al 1980):
Additive < Temporal < Causal < Adversative
and < and then < because < so < but
That is, causal connectives are seen as more complex (see also Piaget
1924/1969, Katz & Brent 1968, Clark 2003, Vers-Vermeul 2005)
BUT causally connected information is remembered better
And causal relations are read faster
Reading time decreases when causality increases
More Sanders (2003)
45
Digression! A new
construction!
46
47
24k “because X” tweets
48
Because X is mostly playful but has strong
affective underpinnings
49
Because and emotions
In soap operas, guess what the word most
associated with because is?
In the British Parliamnet, one of the words most
associated with because…
Affect and emotion are bound up in
discussions of reasoning and cognition
• Damasio (1994)
• Kahneman (2003)
• Matthews and Wells (1994)
• Zajonc (1980)
• Loewenstein et al. (2001)
• LeDoux (1998)
Reasoning needs emotions
“A sophisticated well-being monitor and
guidance system that serves both attention-
regulatory and motivational functions” (Smith
and Kirby 2000: 90).
What are emotions?
54
The need to convey and assess feelings,
moods, dispositions, and attitudes is as critical
as describing events.
We don’t just need to know predications, we
need to know affective orientation to the
predication.
(See the appendix for lots of ways that other
languages encode emotional information)
Emotions are expressed in language
55
Consumer complaints about
banks and credit agencies
56
An act or practice is unfair when:
(1) It causes or is likely to cause substantial injury to consumers;
(2) The injury is not reasonably avoidable by consumers;
(3) The injury is not outweighed by countervailing benefits to consumers or to
competition.
An act or practice is deceptive when
(1) The act or practice misleads or is likely to mislead the consumer;
(2) The consumer’s interpretation is reasonable under the circumstances;
(3) The misleading act or practice is material.
UDAAP (Unfair, Deceptive, or Abusive Acts or
Practices)
57
Whoa.
58
59
Consumers detect fraud, too
Data source: Consumer Financial Protection Bureau
21,206 consumer narratives
About banks and credit agencies
25% have the word “because” in it
(Limiting this study to because; also worth looking at
are becuase, cuz, since, therefore etc.)
Companies/governments want to detect fraud
60
Complaints with-because are much longer
244
106
Because-narratives No-because narratives
Median word count
61
Becauses per complaint
11%
21%
68%
Three or more
Two becauses
Single "because"
62
Becauses happen much more in:
• Bank account or service
• Mortgage
And less often (proportionally) in:
• Credit reporting
• Debt collection
The categories most/least because-y
63
Result: We strongly suggest someone look into Citimortgage’s business
practices,
Cause: because at best they are completely incompetent, and at worst
they are committing acts of fraud
Both Result-Cause and Cause-Result can happen
But as in most studies, Result-Cause accounts for the vast majority (here
~95%)
Structure of becauses
64
“Verifiable if you just had a transcript”
Objective-Result / Objective-Cause
They said I owed $10,000
because I didn’t pay my bill for 3 months
“Not-verifiable even if you had a transcript”
Subjective-Result / Subjective-Cause
I am near tears
because I don’t know what to do
Krippendorff’s alpha (inter-annotator agreement): 0.85
That’s very good agreement
Highest for Objective-Cause
Lowest for Objective-Result (exactly what is the scope)
All easily distinguishable—collapsing categories does not result in
higher alpha value
3 annotators, 4 annotation types
65
The Idibon team! Thanks to Jason and Nick
66
40% are Subjective-Result + Subjective-Cause
33% are Objective-Result + Objective-Cause
17% are Objective-Result + Subjective-Cause
10% are Subjective-Result + Objective-Cause
A preference for matching types
67
If you talk about your home, you aren’t
objective
Subjective-Causes vs. Objective-Causes
68
There’s really no difference between
Subjective Results and Objective
Results
There’s also no difference between
Subjective Results and Objective
Causes
Each of these tends to have a median of
about 66 characters
But Subjective Causes are quite a bit
different—a median of 84 characters
(significant, p = 0.009303 by Wilcox
test)
Affective information gets length
69
because I found they have dealt fraudulently
with many, many consumers
because the matter has not been handled in
accordance with the law
BECAUSE NOW SPRINGLEAF FINANCIAL
WOULD NOT WORK WITH THE NEW
TRUSTEE OF THE TRUST
because Nationstar has dragged its feet in the
face of its SIGNIFICANT error
Some examples of Subjective-Causes
70
• Breakdown in process (repeated attempts, again, for more than, once
again, over and over, again and again)
• Unresponsiveness (nothing happened, did not respond)
• Misrepresentation (deceived, lied, misled, scam, told me that)
• Omission (did not tell me, failed to reveal, failed to bring to my
attention)
• Emotion (my fear is that, i am angry that, frustrating)
• Subjective terms (patiently, unfair, not fair, unreasonable, struggling,
sickening, absurd, allowed to do this, tedious)
• Dialogue acts (request, deny, thank, complain, refuse, accept)
• Mortgage processes (refinance, modification, refer, appeal, assistance)
Concepts in the cause and result clauses
71
72
Fraud: How do people treat
their companies
Complaints: How do
companies treat their
consumers?
Now: How do people treat
each other?
Finding healthy communities (supportive)
And unhealthy ones (toxic)
78
Basically all of Reddit, Jan - May 2015
266m posts
96k forums (“subreddits”)
Most popular:
• /r/AskReddit (21m posts)
• /r/leagueoflegends (5m)
• /r/funny (4m)
• /r/pics (3m)
• /r/nfl (3m)
• /r/nba (3m)
Data details
79
Median % of posts with because across
subreddits with 50k+ posts (758 subreddits)
Top quartile
Bottom quartile
Distribution of “because” across subredits
5.44%
7.25%
3.95%
80
/r/changemyview (21%)
/r/DebateAChristian (19%)
/r/PurplePillDebate (18%)
/r/DebateReligion (17%)
/r/AgainstGamerGate (17%)
/r/truegaming (17%)
/r/DebateAnAthiest (17%)
/r/philosophy (16%)
/r/raisedbynarcissists (16%)
/r/PoliticalDiscussion (16%)
/r/listentothis (15%)
/r/relationship_advice (15%)
/r/relationships (15%)
/r/Anxiety (14%)
/r/ADHD (14%)
Examples of most-because-y subreddits
81
/r/podemos (0%)
/r/newsokur (0%)
/r/sweden (0%)
/r/gonewild (1%)
/r/randomsuperpowers (1%)
/r/ACTrade (1%)
/r/GlobalOffensiveTrade (1%)
/r/millionairemakers (1%)
/r/SVExchange (1%)
/r/PercyJacksonRP (1%)
/r/YamkuHighSchool (1%)
/r/XMenRP (2%)
/r/hardwareswap (2%)
/r/rwbyRP (2%)
/r/thebutton (2%)
Examples of the least because-y
82
83
84
85
This presentation is helped out by
some insights by Jana Thompson
one of our NLP Engineers and
Charissa Plattner, one of our
summer interns
Co-conspirators!
86
385k posts
30k have “because” (7.81%)
Posts with “because” tend to
score higher for
“controversiality”
They are also significantly
longer (p < 2.2e-16 by
Wilcoxon rank sum test)
/r/conspiracy
87
Counting "deleted" and "AutoModerator" as
real users, then there are 32,024 different
users who post in conspiracy from Jan-May
2015.
1,064 of them have 50 or more posts.
The median % of posts with because is 7.19%
• Top quartile: 11.43%
• Bottom quartile: 4.02%
A view of authors
88
Those who pay decent rent are doing so because they've been living in a
rent controlled area for a LONG time.
• This is preceded by a paragraph all about rent prices
• All Caps Evaluative
So, because it's minor at first that would possibly embolden them? You
can't be serious...
• So vs. oh, the importance of questions and rhetoric
• Preposed because (given/new)
Slaves? Are we literally whipped bloody when we don't do as master says
(or just because he wants to).
• Adversative: ends with, “Do you have any clue what slavery really is?”
Some examples from big-because users
89
There are 384,839 posts in this time frame.
They roll up to 222,818 "parent_id" threads.
For threads that have 50+ posts (there are only
144 of them), the median % of posts with
"because" is 5.61%.
• Top quartile: 8.14%
• Bottom quartile: 3.33%
For threads that have 15-49 posts (1,181 of
them), the median % of posts with "because" is
5.88%.
• Top quartile: 10.53%
• Bottom quartile: 0%
A view of threads
90
91
92
• JFK (head autopsy paper wound jfk)
• 9/11 buildings (building collapse steel fire wtc)
• aliens (humans earth life evolution aliens)
• 9/11 (9 11 bin laden attacks)
• space (earth moon nasa gravity apollo)
They avoid…
• media (conspiracy media news government propaganda)
• US politics (law vote obama federal president congress)
• More JFK (don't kennedy)
• moderation (reddit post comments mods banned)
• family/harm (children school kids mother abuse)
Where do authors who like because go?
93
The because-irrific authors use a median 901
characters per post
The least-because-y use 615 characters per
post
Within because posts…
94
Are because users just wordy?
Or is it that because users hang out in threads
where there’s just a lot more because?
Answer: Basically some topics are just wordier
than some others (see next two slides about
length)
What is driving length?
95
Length of posts by topic/author disposition
(longest)
Everyone Prolific becausers Because avoiders
1 JFK (2089 char) 9/11 (2747 char) More JFK (2157 char)
2 9/11 (1834 char) JFK (2464 char) aliens (1321 char)
3 More JFK (1784 char) More JFK (2130 char) reality (1270 char)
4 9/11 buildings (1489
char)
9/11 buildings (1962
char)
9/11 (1113 char)
5 aliens (1313 char) aliens (1800 char) religion (917 char)
96
Length of posts by topic/author disposition
(shortest)
Everyone Prolific becausers Because avoiders
25 criticism (534 char) moderation (695 char) climate change (392
char)
24 moderation (564 char) criticism (744 char) moderation (439 char)
23 media (653 char) meta-conspiracy (816
char)
criticism (440 char)
22 meta-conspiracy (666
char)
media (900 char) race (459 char)
21 race (721 char) internet (913 char) food/health (489 char)
97
7.8% of posts in /r/conspiracy have “because”
16,069 of the posts in /r/conspiracy have
language around fraud (21.7%)
So we’d expect about 1,255 posts to have both
“because” and fraud/etc.
Instead we find 3,491.
What about claims about fraud, illegality,
bamboozlement, etc?
98
Wrapping up
99
1) Showing how computational linguistics solves business problems
2) Identifying markers of fraud using language data
For company-internal fraud/compliance investigators
For government/regulatory/consumer advocacy
3) Detecting and using rationalization and reason-giving
The importance of emotion
The case of because
Your thoughts on next steps?
Reviewing where we’ve been
100
There are links between rationalization and because usage that can help
with applications of the fraud diamond/triangle
The different ways people use/don’t use because can help us understand
the psychological state of fraudsters and the information of people who
may be encountering it
On because
101
Fraud and compliance in digital
communications
102
Early case
assessment
Relevancy
filtering
Risk
Scoring
Key entities
Strategic communications
Spam, Newsletters
Near de-duplication
Fraud diamond
Sentiment
Personal communications
Investigation stage Models100% Data volume
30%
10%
< 1%
103
Processing millions of SMS in 12 African languages
Intent of sender
(i.e. report a problem,
ask a question or make
a suggestion)
Categorization
(i.e. orphans and
vulnerable children,
violence against
children, health,
nutrition)
Language
detection
(i.e. English, Acholi,
Karamojong, Luganda,
Nkole, Swahili, Lango)
Location
(i.e. village names)
105
Understand language data like never before
106
Thank you
@idibon.com
twitter.com/idibon
idibon.com
107
• Given-then-new information (result-then-cause in his small corpus, too)
• Given as what’s been said
• Inferable, familiar, expected
• New as unfamiliar, unexpected, unpredictable
• The rare times that because is initial, it acts as a guidepost for
information flow
• Like however, anyway, for example, on the other hand
• “A guidepost par excellence is ‘meanwhile, back at the rank’.”
• People as orienting the information for upcoming clauses
• A more general strategy of giving a frame
• Third case (That in itself was scary, cause I never fainted before) is
sequential and meant to add to the first assertion
• An “afterthought”
Chafe (1984)
108
Ordering is about functional and cognitive pressures (draws on Hawkins
1994, 2004):
• Syntactic parsing
• Discourse pragmatics
• Semantics
Result-then-clause order violates iconicity of sequence, yet they are the
most attested
• “causal clauses are primarily used to back up a previous statement that
the hearer may not accept or may not find convincing” (Diessel 2006)
Diessel (2008)
109
Because occurs when agreement is at-issue (Ford 1993)
Instead of focusing on information flow, they focus on speaker interaction and
see it as occurring where there is actual/incipient disagreement
Thus, conversation analysts find becauses offered by either speaker right
before a dispreferred turn
In English, they are surrounded by pauses, hesitations, excuses, mitigations,
indirectness, partial agreement, polarity reversals
Ford and Mori (1994)
110
The real point of their paper is that there are two Japanese becauses, but
the function differently:
• datte: glossed as ‘no for the reason that’, is immediate and clear—strong
disagreement—it isn’t about getting information but about getting a
justification
• kara: more like English, shifts towards alignment; also used if a
reference is unclear, a term is unknown, or if the speaker is assuming
something of the recipient that they don’t actually know
If you want to give someone a datte response in English, you have to use
turn onset, stress, intensifiers, choice of evaluative language, directness of
disagreeing, and non-verbal expressions
Ford and Mori (1994), cont’d
111
John came back because he loved her.
One event causes another
John loved her, because he came back.
Illustrates the speaker’s reasoning, “epistemic”; English since, French puisque,
German denn
What are you doing tonight, because there’s a good movie on.
A “speech act”
Subjective relations are often derived from objective relations (see also
Traugott 1995)
Sweetser (1990)
Tongan
si’i and si’a
Different determiners express
sympathy to the DP they head
(Hendrick, 2005)
Navajo
=go
Emotional evaluation in
narrative (Mithun 2008)
Korean
Evidentials and psych
predicates
Non-evidential sentences are
more assertive/informational,
evidential sentences about the
speaker are more “expressive”
and “spontaneous” (Chung
2010)
East Caucasian
lgs
Case for emotion
experiencers ≠
perception
experiencers
Van den Berg (2005)
Thai thîi
Complementizer for verbs of emotion/evaluation (Singhapreecha, 2010)
For strangers on the phone, because is used
mostly for vices, holidays, money, travel, wars
117
1.4%
Top 3 categories in Nigeria
9.69%
17.68%
39.44%
Employment
U-report support
Health
122
Are becausers drawn to different topics more
than others
O/E big becausers O/E because-avoiders
JFK 69 posts by big
becausers in this topic /
56 posts expected
0 posts by because-
avoiders in this topic / 13
posts expected
9/11 buildings 408 / 366 44 / 86
media 357 / 394 130 / 93
moderation 442 / 489 154 / 114
aliens 133 / 123 19 / 29
food/health 231 / 214 33 / 51
More JFK 38 / 41 13 / 10
internet 103 / 112 35 / 26
vaccines 263 / 247 43 / 59
123
Basically the same list is top, except vaccines
pop up a few spots and aliens drop down a few
spots
• More JFK (don't kennedy)
• JFK (head autopsy paper wound jfk)
• 9/11 (9 11 bin laden attacks)
• vaccines (vaccines children disease autism
polio)
• 9/11 buildings (building collapse steel fire
wtc)
Let’s remove the authors who like because

More Related Content

Similar to Conspiracy, complaints, and fraud: The language of reasons

Text mining voor Business Intelligence toepassingen
Text mining voor Business Intelligence toepassingenText mining voor Business Intelligence toepassingen
Text mining voor Business Intelligence toepassingen
jcscholtes
 
Chapter 11 Survey DataOverview Identify the different ty.docx
Chapter 11 Survey DataOverview Identify the different ty.docxChapter 11 Survey DataOverview Identify the different ty.docx
Chapter 11 Survey DataOverview Identify the different ty.docx
cravennichole326
 
Meaning in the Arts Independent Field Study Assignment For th.docx
Meaning in the Arts Independent Field Study Assignment For th.docxMeaning in the Arts Independent Field Study Assignment For th.docx
Meaning in the Arts Independent Field Study Assignment For th.docx
ARIV4
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Andre Freitas
 

Similar to Conspiracy, complaints, and fraud: The language of reasons (20)

Ivy League Personal Statement Examples - Sanox
Ivy League Personal Statement Examples - SanoxIvy League Personal Statement Examples - Sanox
Ivy League Personal Statement Examples - Sanox
 
Truth, Lies and Cyberspace: Understand, Predicting and Hacking Behaviour on t...
Truth, Lies and Cyberspace: Understand, Predicting and Hacking Behaviour on t...Truth, Lies and Cyberspace: Understand, Predicting and Hacking Behaviour on t...
Truth, Lies and Cyberspace: Understand, Predicting and Hacking Behaviour on t...
 
Essay Story About Family
Essay Story About FamilyEssay Story About Family
Essay Story About Family
 
Text mining voor Business Intelligence toepassingen
Text mining voor Business Intelligence toepassingenText mining voor Business Intelligence toepassingen
Text mining voor Business Intelligence toepassingen
 
Open University - TU100 Day school 1
Open University - TU100 Day school 1Open University - TU100 Day school 1
Open University - TU100 Day school 1
 
In plain language and in plain sight around the globe
In plain language and in plain sight around the globeIn plain language and in plain sight around the globe
In plain language and in plain sight around the globe
 
Rocky Balboa Speech Essay. Online assignment writing service.
Rocky Balboa Speech Essay. Online assignment writing service.Rocky Balboa Speech Essay. Online assignment writing service.
Rocky Balboa Speech Essay. Online assignment writing service.
 
Chapter 11 Survey DataOverview Identify the different ty.docx
Chapter 11 Survey DataOverview Identify the different ty.docxChapter 11 Survey DataOverview Identify the different ty.docx
Chapter 11 Survey DataOverview Identify the different ty.docx
 
How To Write A Film Analysis Essay By Fr
How To Write A Film Analysis Essay By FrHow To Write A Film Analysis Essay By Fr
How To Write A Film Analysis Essay By Fr
 
3 Best Practices for eDiscovery Custodian Interviews
3 Best Practices for eDiscovery Custodian Interviews3 Best Practices for eDiscovery Custodian Interviews
3 Best Practices for eDiscovery Custodian Interviews
 
BP 2014: Supporting Deeper Deliberative Dialogue Through Awareness Tools
BP 2014: Supporting Deeper Deliberative Dialogue Through Awareness ToolsBP 2014: Supporting Deeper Deliberative Dialogue Through Awareness Tools
BP 2014: Supporting Deeper Deliberative Dialogue Through Awareness Tools
 
Content = Communication: What is Plain Language and Why Should You Care?
Content = Communication: What is Plain Language and Why Should You Care?Content = Communication: What is Plain Language and Why Should You Care?
Content = Communication: What is Plain Language and Why Should You Care?
 
Meaning in the Arts Independent Field Study Assignment For th.docx
Meaning in the Arts Independent Field Study Assignment For th.docxMeaning in the Arts Independent Field Study Assignment For th.docx
Meaning in the Arts Independent Field Study Assignment For th.docx
 
Quantifying reflection
Quantifying reflectionQuantifying reflection
Quantifying reflection
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
 
Information Innovation: Turning Insights into Opportunities
Information Innovation: Turning Insights into OpportunitiesInformation Innovation: Turning Insights into Opportunities
Information Innovation: Turning Insights into Opportunities
 
Lecture 3 Social Dynamics Leith Sharp-1.pptx
Lecture 3 Social Dynamics Leith Sharp-1.pptxLecture 3 Social Dynamics Leith Sharp-1.pptx
Lecture 3 Social Dynamics Leith Sharp-1.pptx
 
Basic Principles Essay Writing. Online assignment writing service.
Basic Principles Essay Writing. Online assignment writing service.Basic Principles Essay Writing. Online assignment writing service.
Basic Principles Essay Writing. Online assignment writing service.
 
Primary Lined Paper Printable - Customize And
Primary Lined Paper Printable - Customize AndPrimary Lined Paper Printable - Customize And
Primary Lined Paper Printable - Customize And
 
Essay For Interview Example. Online assignment writing service.
Essay For Interview Example. Online assignment writing service.Essay For Interview Example. Online assignment writing service.
Essay For Interview Example. Online assignment writing service.
 

More from Idibon1

Gender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methodsGender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methods
Idibon1
 

More from Idibon1 (9)

Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and...
Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and...Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and...
Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and...
 
Ciara Sanker: Personal epistemology and epistemic learning
Ciara Sanker: Personal epistemology and epistemic learningCiara Sanker: Personal epistemology and epistemic learning
Ciara Sanker: Personal epistemology and epistemic learning
 
Suzanne Wertheim: Linguistic Anthropology meets NLP
Suzanne Wertheim: Linguistic Anthropology meets NLPSuzanne Wertheim: Linguistic Anthropology meets NLP
Suzanne Wertheim: Linguistic Anthropology meets NLP
 
Will Monroe: Text to 3D scene generation with lexical grounding
Will Monroe: Text to 3D scene generation with lexical groundingWill Monroe: Text to 3D scene generation with lexical grounding
Will Monroe: Text to 3D scene generation with lexical grounding
 
Gender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methodsGender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methods
 
Counts, comparisons, collocations, contestations: Towards a dictionary of the...
Counts, comparisons, collocations, contestations: Towards a dictionary of the...Counts, comparisons, collocations, contestations: Towards a dictionary of the...
Counts, comparisons, collocations, contestations: Towards a dictionary of the...
 
Pattern recognition and the crowd
Pattern recognition and the crowdPattern recognition and the crowd
Pattern recognition and the crowd
 
Dan Jurafsky: The Language of Food
Dan Jurafsky: The Language of FoodDan Jurafsky: The Language of Food
Dan Jurafsky: The Language of Food
 
Chris Potts: Sentiment analysis in context
Chris Potts: Sentiment analysis in contextChris Potts: Sentiment analysis in context
Chris Potts: Sentiment analysis in context
 

Recently uploaded

Agrarian Reform Policies in the Philippines: a quiz
Agrarian Reform Policies in the Philippines: a quizAgrarian Reform Policies in the Philippines: a quiz
Agrarian Reform Policies in the Philippines: a quiz
gaelcabigunda
 

Recently uploaded (20)

Everything You Should Know About Child Custody and Parenting While Living in ...
Everything You Should Know About Child Custody and Parenting While Living in ...Everything You Should Know About Child Custody and Parenting While Living in ...
Everything You Should Know About Child Custody and Parenting While Living in ...
 
7 Basic Steps of Trust Administration.pdf
7 Basic Steps of Trust Administration.pdf7 Basic Steps of Trust Administration.pdf
7 Basic Steps of Trust Administration.pdf
 
Secure Your Brand: File a Trademark Today
Secure Your Brand: File a Trademark TodaySecure Your Brand: File a Trademark Today
Secure Your Brand: File a Trademark Today
 
Agrarian Reform Policies in the Philippines: a quiz
Agrarian Reform Policies in the Philippines: a quizAgrarian Reform Policies in the Philippines: a quiz
Agrarian Reform Policies in the Philippines: a quiz
 
Book review - Amartya Sen's Idea of Justice
Book review - Amartya Sen's Idea of JusticeBook review - Amartya Sen's Idea of Justice
Book review - Amartya Sen's Idea of Justice
 
Military Commissions details LtCol Thomas Jasper as Detailed Defense Counsel
Military Commissions details LtCol Thomas Jasper as Detailed Defense CounselMilitary Commissions details LtCol Thomas Jasper as Detailed Defense Counsel
Military Commissions details LtCol Thomas Jasper as Detailed Defense Counsel
 
indian evidence act.pdf.......very helpful for law student
indian evidence act.pdf.......very helpful for law studentindian evidence act.pdf.......very helpful for law student
indian evidence act.pdf.......very helpful for law student
 
EMPLOYMENT LAW AN OVERVIEW in Malawi.pptx
EMPLOYMENT LAW  AN OVERVIEW in Malawi.pptxEMPLOYMENT LAW  AN OVERVIEW in Malawi.pptx
EMPLOYMENT LAW AN OVERVIEW in Malawi.pptx
 
VIETNAM - DIRECT POWER PURCHASE AGREEMENTS (DPPA) - Latest development - What...
VIETNAM - DIRECT POWER PURCHASE AGREEMENTS (DPPA) - Latest development - What...VIETNAM - DIRECT POWER PURCHASE AGREEMENTS (DPPA) - Latest development - What...
VIETNAM - DIRECT POWER PURCHASE AGREEMENTS (DPPA) - Latest development - What...
 
Casa Tradicion v. Casa Azul Spirits (S.D. Tex. 2024)
Casa Tradicion v. Casa Azul Spirits (S.D. Tex. 2024)Casa Tradicion v. Casa Azul Spirits (S.D. Tex. 2024)
Casa Tradicion v. Casa Azul Spirits (S.D. Tex. 2024)
 
Solidarity and Taxation: the Ubuntu approach in South Africa
Solidarity and Taxation: the Ubuntu approach in South AfricaSolidarity and Taxation: the Ubuntu approach in South Africa
Solidarity and Taxation: the Ubuntu approach in South Africa
 
RIGHTS OF VICTIM EDITED PRESENTATION(SAIF JAVED).pptx
RIGHTS OF VICTIM EDITED PRESENTATION(SAIF JAVED).pptxRIGHTS OF VICTIM EDITED PRESENTATION(SAIF JAVED).pptx
RIGHTS OF VICTIM EDITED PRESENTATION(SAIF JAVED).pptx
 
PRECEDENT AS A SOURCE OF LAW (SAIF JAVED).pptx
PRECEDENT AS A SOURCE OF LAW (SAIF JAVED).pptxPRECEDENT AS A SOURCE OF LAW (SAIF JAVED).pptx
PRECEDENT AS A SOURCE OF LAW (SAIF JAVED).pptx
 
ALL EYES ON RAFAH BUT WHY Explain more.pdf
ALL EYES ON RAFAH BUT WHY Explain more.pdfALL EYES ON RAFAH BUT WHY Explain more.pdf
ALL EYES ON RAFAH BUT WHY Explain more.pdf
 
Abdul Hakim Shabazz Deposition Hearing in Federal Court
Abdul Hakim Shabazz Deposition Hearing in Federal CourtAbdul Hakim Shabazz Deposition Hearing in Federal Court
Abdul Hakim Shabazz Deposition Hearing in Federal Court
 
Types of Cybercrime and Its Impact on Society
Types of Cybercrime and Its Impact on SocietyTypes of Cybercrime and Its Impact on Society
Types of Cybercrime and Its Impact on Society
 
What Are the Strategies Offered by Cybercrime Law Firms?
What Are the Strategies Offered by Cybercrime Law Firms?What Are the Strategies Offered by Cybercrime Law Firms?
What Are the Strategies Offered by Cybercrime Law Firms?
 
DNA Testing in Civil and Criminal Matters.pptx
DNA Testing in Civil and Criminal Matters.pptxDNA Testing in Civil and Criminal Matters.pptx
DNA Testing in Civil and Criminal Matters.pptx
 
Charge and its essentials rules Under the CRPC, 1898
Charge and its essentials rules Under the CRPC, 1898Charge and its essentials rules Under the CRPC, 1898
Charge and its essentials rules Under the CRPC, 1898
 
Donald_J_Trump_katigoritirio_stormi_daniels.pdf
Donald_J_Trump_katigoritirio_stormi_daniels.pdfDonald_J_Trump_katigoritirio_stormi_daniels.pdf
Donald_J_Trump_katigoritirio_stormi_daniels.pdf
 

Conspiracy, complaints, and fraud: The language of reasons

  • 1. The language of reasons Tyler Schnoebelen Conspiracy, complaints, and fraud
  • 2. 2 1) Showing how computational linguistics solves business problems 2) Identifying markers of fraud using language data For company-internal fraud/compliance investigators For government/regulatory/consumer advocacy 3) Detecting and using rationalization and reason-giving The importance of emotion The case of because in Consumer complaints Conspiracy forum posts Hi! Welcome to the slides for this talk—also check out the Notes. Basically this talk is about:
  • 4. 4
  • 5. 5 The Association of Certified Fraud Examiners looked at 1,483 fraud cases reported in 2014 They estimate global fraud loss is at least 5% of revenue for companies Estimate of losses to fraud, worldwide $3.7 trillion
  • 6. 6
  • 7. 7
  • 8. 8 Financial statement fraud is much more expensive
  • 10. 10 For the big dollars, look to the top
  • 11. 11 What are the red flags for fraudsters?
  • 12. 12 Losses are rarely recovered
  • 14. 14 Prior work tends to be “word lists” or experiments L&Z used 29,663 transcribed quarterly earnings calls 16,577 CEO Q&A responses 14,462 CFO Q&A responses L&Z keep track of when quarterly financial statements were later restated (during first call they knew something was amiss) Depending on strictness of restatement, 14%, 7% or 5% of the calls had deception in them. Larcker & Zakolyukina (2010)
  • 15. 15 Larcker & Zakolyukina (2010) CEOs CFOs References to general knowledge (you know) more more Non-extreme positive emotion words fewer fewer References to shareholder value/value creation fewer fewer Self-references fewer 3rd person plural/impersonal pronouns more fewer Extreme negative emotion words fewer Extreme positive emotion words more Certainty words fewer more Hesitation words fewer more
  • 17. 17
  • 18. Text analytics 18 Linguistics: Scientific study of language Machine Learning: Automatically train computers to make human-like decisions ● Compliance monitoring ● Enterprise search ● E-Communications surveillance ● Technology assisted review ● Sentiment analysis ● Deception detection ● Text summarization Natural Language Processing: Enable machines to automatically derive meaning from natural language input
  • 19. Fraud and compliance in digital communications 19 Early case assessment Relevancy filtering Risk Scoring Key entities Strategic communications Spam, Newsletters Near de-duplication Fraud diamond Sentiment Personal communications Investigation stage Models100% Data volume 30% 10% < 1%
  • 20. Top-down vs. bottom-up text analytics 20 “Bribe” … “Tea money” “Facilitation payment” “Backhander” Top-down (Search) ● Rules-based ● False positives/negatives ● Brittle Bottom-up (Discovery) ● Statistical ● Highly accurate ● Adaptive
  • 21. 21
  • 22. Comparing rules vs. machine learning 22 High accuracy on complex task after only 1 day of work Project goal: Uncover key documents relevant to Energy Regulation out of 200,000 messages that matched raw keywords
  • 23. 23 Flexible Ontology 23 Develop rich ontology for investigative analytics and insights at scale Cline Cline 1 Client’s Questions Known Areas of Interest Pressure Rationalization Names Opportunity Capability Emotions Topic Modeling Themes ? 2 21 3
  • 24. 24 Data gets smarter and more accurate through adaptive system Adaptive System Structured Data Reports Action • Annotation suggestions • Document priority • Shortest path for coverage • Error detection Machine Learning Optimization Prediction Engine Human Review 4 5 6 6
  • 25. Idibon’s models drive more accurate, scalable investigations of fraud 25 Identify indicative language • Identify and extract indicators correlated with fraud • Gather data from disparate structured, unstructured, public, and private data sources Model fraud within the organization • Score and rank individual custodians by likelihood of fraud • Summarize indicators of fraud by department or scheme Scale across people and clients • Model fraud using documents from multiple custodians • Build replicable models for different client types Monitor and track risk • Model on-going risks in client interactions • Track known liability or non-compliance issues 1. 2. 3. 4.
  • 26. Detecting fraud requires a variety of models Strategic Communications: Automatically identify important communications based on the language used in emails with a BCC recipient Fraud Triangle and Fraud Diamond: Identify messages containing indicators of Motive, Opportunity, Rationalization and Capability to risk-rank actors and their communications Key Entities: Discover people, places, organizations, and other entities mentioned in communications to uncover hidden relationships Personal Messages: Flag messages that are intimate in nature and that may contain evidence of illicit behavior or collusion Sentiment Analysis: Categorize communications as positive, neutral, or negative Taboo Words and Obscenity: Identify emotionally charged language that may reflect behaviors and events of interest
  • 27. enron report merger (Corporate communications about mergers that you probably DON’T care about) 27 Find needles in haystacks: quickly hone in on relevant areas of the data legal f&j citizens “I also find the advance ethical waiver language repugnant, but could agree to it if the other modifications mentioned could be made.” employees enron bankruptcy “Michelle, here is a suggested revision to Section 3.4 B … If a terminated employee who is entitled to receive a severance benefit … the severance benefit payable under the Plan shall be reduced and offset” time good back (Lots of irrelevant stuff about home, weekends, Thanksgiving, etc.)
  • 28. Sentiment analysis and automatic topic discovery reveal significant communications 28 Negative: Antitrust issues, M&A, Insider Trading Positive: Product Releases, Employment
  • 29. 29 The Fraud Triangle (and briefly, the Fraud Diamond)
  • 33. Image Placeholder 33 Capability (Makes it a Fraud Diamond) Personal traits and abilities • Effective lying • Immunity to stress • Intelligence • Confidence
  • 34. 34
  • 35. 35
  • 36. 36 But let’s return to the peaky point Rationalization Opportunity FRAUD SCORE Pressure Rationalization: Committing fraud is worth the risk
  • 37. 37
  • 38. 38 When and how do people give reasons?
  • 40. 40
  • 41. 41 Conditions: • “Excuse me, I have (5 or 20) pages. May I use the Xerox machine?” (no- because) • “Excuse me, I have (5 or 20) pages. May I use the Xerox machine, because I’m in a rush? (because) • “Excuse me, I have (5 or 20) pages. May I use the Xerox machine, because I have to make copies?” (because-empty) The idea here is that the because-empty clause offers no information. For 5 pages: because = because-empty >> no-because Though when stakes are higher (20 pages): because > because-empty > no-because Langer, Blank and Chanowitz (1978)
  • 42. 42 • “Given” information comes before “new”—so usually people say “such and such happened because of X” rather than “Because of X, such and such happened” • Given: what’s been said already, inferable, familiar, expected • Easier to process new information when it’s framed • See Chafe (1984) and lots of others • “causal clauses are primarily used to back up a previous statement that the hearer may not accept or may not find convincing” (Diessel 2006) • Conversation analysts find becauses offered by either speaker right before a disagreement • In English speech, they are surrounded by pauses, hesitations, excuses, mitigations, indirectness, partial agreement, polarity reversals (see Ford & Mori 1994) Quick lit review
  • 43. 43 Two main coherence relations: cause-consequence and argument-claim Causality and Subjectivity are key Consider: The sun was shining CONNECTIVE the temperature rose quickly Causality The neighbors’ lights are out CONNECTIVE they are not at home Subjectivity Some languages use different connectives Sanders (2003) Causality Subjectivity Dutch doordat want French parce que puisque German weil denn
  • 44. 44 Children learning English learn things in this order (Bloom et al 1980): Additive < Temporal < Causal < Adversative and < and then < because < so < but That is, causal connectives are seen as more complex (see also Piaget 1924/1969, Katz & Brent 1968, Clark 2003, Vers-Vermeul 2005) BUT causally connected information is remembered better And causal relations are read faster Reading time decreases when causality increases More Sanders (2003)
  • 46. 46
  • 48. 48 Because X is mostly playful but has strong affective underpinnings
  • 50. In soap operas, guess what the word most associated with because is?
  • 51. In the British Parliamnet, one of the words most associated with because…
  • 52. Affect and emotion are bound up in discussions of reasoning and cognition • Damasio (1994) • Kahneman (2003) • Matthews and Wells (1994) • Zajonc (1980) • Loewenstein et al. (2001) • LeDoux (1998) Reasoning needs emotions
  • 53. “A sophisticated well-being monitor and guidance system that serves both attention- regulatory and motivational functions” (Smith and Kirby 2000: 90). What are emotions?
  • 54. 54 The need to convey and assess feelings, moods, dispositions, and attitudes is as critical as describing events. We don’t just need to know predications, we need to know affective orientation to the predication. (See the appendix for lots of ways that other languages encode emotional information) Emotions are expressed in language
  • 55. 55 Consumer complaints about banks and credit agencies
  • 56. 56 An act or practice is unfair when: (1) It causes or is likely to cause substantial injury to consumers; (2) The injury is not reasonably avoidable by consumers; (3) The injury is not outweighed by countervailing benefits to consumers or to competition. An act or practice is deceptive when (1) The act or practice misleads or is likely to mislead the consumer; (2) The consumer’s interpretation is reasonable under the circumstances; (3) The misleading act or practice is material. UDAAP (Unfair, Deceptive, or Abusive Acts or Practices)
  • 58. 58
  • 59. 59 Consumers detect fraud, too Data source: Consumer Financial Protection Bureau 21,206 consumer narratives About banks and credit agencies 25% have the word “because” in it (Limiting this study to because; also worth looking at are becuase, cuz, since, therefore etc.) Companies/governments want to detect fraud
  • 60. 60 Complaints with-because are much longer 244 106 Because-narratives No-because narratives Median word count
  • 61. 61 Becauses per complaint 11% 21% 68% Three or more Two becauses Single "because"
  • 62. 62 Becauses happen much more in: • Bank account or service • Mortgage And less often (proportionally) in: • Credit reporting • Debt collection The categories most/least because-y
  • 63. 63 Result: We strongly suggest someone look into Citimortgage’s business practices, Cause: because at best they are completely incompetent, and at worst they are committing acts of fraud Both Result-Cause and Cause-Result can happen But as in most studies, Result-Cause accounts for the vast majority (here ~95%) Structure of becauses
  • 64. 64 “Verifiable if you just had a transcript” Objective-Result / Objective-Cause They said I owed $10,000 because I didn’t pay my bill for 3 months “Not-verifiable even if you had a transcript” Subjective-Result / Subjective-Cause I am near tears because I don’t know what to do Krippendorff’s alpha (inter-annotator agreement): 0.85 That’s very good agreement Highest for Objective-Cause Lowest for Objective-Result (exactly what is the scope) All easily distinguishable—collapsing categories does not result in higher alpha value 3 annotators, 4 annotation types
  • 65. 65 The Idibon team! Thanks to Jason and Nick
  • 66. 66 40% are Subjective-Result + Subjective-Cause 33% are Objective-Result + Objective-Cause 17% are Objective-Result + Subjective-Cause 10% are Subjective-Result + Objective-Cause A preference for matching types
  • 67. 67 If you talk about your home, you aren’t objective Subjective-Causes vs. Objective-Causes
  • 68. 68 There’s really no difference between Subjective Results and Objective Results There’s also no difference between Subjective Results and Objective Causes Each of these tends to have a median of about 66 characters But Subjective Causes are quite a bit different—a median of 84 characters (significant, p = 0.009303 by Wilcox test) Affective information gets length
  • 69. 69 because I found they have dealt fraudulently with many, many consumers because the matter has not been handled in accordance with the law BECAUSE NOW SPRINGLEAF FINANCIAL WOULD NOT WORK WITH THE NEW TRUSTEE OF THE TRUST because Nationstar has dragged its feet in the face of its SIGNIFICANT error Some examples of Subjective-Causes
  • 70. 70 • Breakdown in process (repeated attempts, again, for more than, once again, over and over, again and again) • Unresponsiveness (nothing happened, did not respond) • Misrepresentation (deceived, lied, misled, scam, told me that) • Omission (did not tell me, failed to reveal, failed to bring to my attention) • Emotion (my fear is that, i am angry that, frustrating) • Subjective terms (patiently, unfair, not fair, unreasonable, struggling, sickening, absurd, allowed to do this, tedious) • Dialogue acts (request, deny, thank, complain, refuse, accept) • Mortgage processes (refinance, modification, refer, appeal, assistance) Concepts in the cause and result clauses
  • 71. 71
  • 72. 72 Fraud: How do people treat their companies Complaints: How do companies treat their consumers? Now: How do people treat each other?
  • 73.
  • 74.
  • 77.
  • 78. 78 Basically all of Reddit, Jan - May 2015 266m posts 96k forums (“subreddits”) Most popular: • /r/AskReddit (21m posts) • /r/leagueoflegends (5m) • /r/funny (4m) • /r/pics (3m) • /r/nfl (3m) • /r/nba (3m) Data details
  • 79. 79 Median % of posts with because across subreddits with 50k+ posts (758 subreddits) Top quartile Bottom quartile Distribution of “because” across subredits 5.44% 7.25% 3.95%
  • 80. 80 /r/changemyview (21%) /r/DebateAChristian (19%) /r/PurplePillDebate (18%) /r/DebateReligion (17%) /r/AgainstGamerGate (17%) /r/truegaming (17%) /r/DebateAnAthiest (17%) /r/philosophy (16%) /r/raisedbynarcissists (16%) /r/PoliticalDiscussion (16%) /r/listentothis (15%) /r/relationship_advice (15%) /r/relationships (15%) /r/Anxiety (14%) /r/ADHD (14%) Examples of most-because-y subreddits
  • 81. 81 /r/podemos (0%) /r/newsokur (0%) /r/sweden (0%) /r/gonewild (1%) /r/randomsuperpowers (1%) /r/ACTrade (1%) /r/GlobalOffensiveTrade (1%) /r/millionairemakers (1%) /r/SVExchange (1%) /r/PercyJacksonRP (1%) /r/YamkuHighSchool (1%) /r/XMenRP (2%) /r/hardwareswap (2%) /r/rwbyRP (2%) /r/thebutton (2%) Examples of the least because-y
  • 82. 82
  • 83. 83
  • 84. 84
  • 85. 85 This presentation is helped out by some insights by Jana Thompson one of our NLP Engineers and Charissa Plattner, one of our summer interns Co-conspirators!
  • 86. 86 385k posts 30k have “because” (7.81%) Posts with “because” tend to score higher for “controversiality” They are also significantly longer (p < 2.2e-16 by Wilcoxon rank sum test) /r/conspiracy
  • 87. 87 Counting "deleted" and "AutoModerator" as real users, then there are 32,024 different users who post in conspiracy from Jan-May 2015. 1,064 of them have 50 or more posts. The median % of posts with because is 7.19% • Top quartile: 11.43% • Bottom quartile: 4.02% A view of authors
  • 88. 88 Those who pay decent rent are doing so because they've been living in a rent controlled area for a LONG time. • This is preceded by a paragraph all about rent prices • All Caps Evaluative So, because it's minor at first that would possibly embolden them? You can't be serious... • So vs. oh, the importance of questions and rhetoric • Preposed because (given/new) Slaves? Are we literally whipped bloody when we don't do as master says (or just because he wants to). • Adversative: ends with, “Do you have any clue what slavery really is?” Some examples from big-because users
  • 89. 89 There are 384,839 posts in this time frame. They roll up to 222,818 "parent_id" threads. For threads that have 50+ posts (there are only 144 of them), the median % of posts with "because" is 5.61%. • Top quartile: 8.14% • Bottom quartile: 3.33% For threads that have 15-49 posts (1,181 of them), the median % of posts with "because" is 5.88%. • Top quartile: 10.53% • Bottom quartile: 0% A view of threads
  • 90. 90
  • 91. 91
  • 92. 92 • JFK (head autopsy paper wound jfk) • 9/11 buildings (building collapse steel fire wtc) • aliens (humans earth life evolution aliens) • 9/11 (9 11 bin laden attacks) • space (earth moon nasa gravity apollo) They avoid… • media (conspiracy media news government propaganda) • US politics (law vote obama federal president congress) • More JFK (don't kennedy) • moderation (reddit post comments mods banned) • family/harm (children school kids mother abuse) Where do authors who like because go?
  • 93. 93 The because-irrific authors use a median 901 characters per post The least-because-y use 615 characters per post Within because posts…
  • 94. 94 Are because users just wordy? Or is it that because users hang out in threads where there’s just a lot more because? Answer: Basically some topics are just wordier than some others (see next two slides about length) What is driving length?
  • 95. 95 Length of posts by topic/author disposition (longest) Everyone Prolific becausers Because avoiders 1 JFK (2089 char) 9/11 (2747 char) More JFK (2157 char) 2 9/11 (1834 char) JFK (2464 char) aliens (1321 char) 3 More JFK (1784 char) More JFK (2130 char) reality (1270 char) 4 9/11 buildings (1489 char) 9/11 buildings (1962 char) 9/11 (1113 char) 5 aliens (1313 char) aliens (1800 char) religion (917 char)
  • 96. 96 Length of posts by topic/author disposition (shortest) Everyone Prolific becausers Because avoiders 25 criticism (534 char) moderation (695 char) climate change (392 char) 24 moderation (564 char) criticism (744 char) moderation (439 char) 23 media (653 char) meta-conspiracy (816 char) criticism (440 char) 22 meta-conspiracy (666 char) media (900 char) race (459 char) 21 race (721 char) internet (913 char) food/health (489 char)
  • 97. 97 7.8% of posts in /r/conspiracy have “because” 16,069 of the posts in /r/conspiracy have language around fraud (21.7%) So we’d expect about 1,255 posts to have both “because” and fraud/etc. Instead we find 3,491. What about claims about fraud, illegality, bamboozlement, etc?
  • 99. 99 1) Showing how computational linguistics solves business problems 2) Identifying markers of fraud using language data For company-internal fraud/compliance investigators For government/regulatory/consumer advocacy 3) Detecting and using rationalization and reason-giving The importance of emotion The case of because Your thoughts on next steps? Reviewing where we’ve been
  • 100. 100 There are links between rationalization and because usage that can help with applications of the fraud diamond/triangle The different ways people use/don’t use because can help us understand the psychological state of fraudsters and the information of people who may be encountering it On because
  • 101. 101
  • 102. Fraud and compliance in digital communications 102 Early case assessment Relevancy filtering Risk Scoring Key entities Strategic communications Spam, Newsletters Near de-duplication Fraud diamond Sentiment Personal communications Investigation stage Models100% Data volume 30% 10% < 1%
  • 103. 103
  • 104. Processing millions of SMS in 12 African languages Intent of sender (i.e. report a problem, ask a question or make a suggestion) Categorization (i.e. orphans and vulnerable children, violence against children, health, nutrition) Language detection (i.e. English, Acholi, Karamojong, Luganda, Nkole, Swahili, Lango) Location (i.e. village names)
  • 105. 105
  • 106. Understand language data like never before 106 Thank you @idibon.com twitter.com/idibon idibon.com
  • 107. 107 • Given-then-new information (result-then-cause in his small corpus, too) • Given as what’s been said • Inferable, familiar, expected • New as unfamiliar, unexpected, unpredictable • The rare times that because is initial, it acts as a guidepost for information flow • Like however, anyway, for example, on the other hand • “A guidepost par excellence is ‘meanwhile, back at the rank’.” • People as orienting the information for upcoming clauses • A more general strategy of giving a frame • Third case (That in itself was scary, cause I never fainted before) is sequential and meant to add to the first assertion • An “afterthought” Chafe (1984)
  • 108. 108 Ordering is about functional and cognitive pressures (draws on Hawkins 1994, 2004): • Syntactic parsing • Discourse pragmatics • Semantics Result-then-clause order violates iconicity of sequence, yet they are the most attested • “causal clauses are primarily used to back up a previous statement that the hearer may not accept or may not find convincing” (Diessel 2006) Diessel (2008)
  • 109. 109 Because occurs when agreement is at-issue (Ford 1993) Instead of focusing on information flow, they focus on speaker interaction and see it as occurring where there is actual/incipient disagreement Thus, conversation analysts find becauses offered by either speaker right before a dispreferred turn In English, they are surrounded by pauses, hesitations, excuses, mitigations, indirectness, partial agreement, polarity reversals Ford and Mori (1994)
  • 110. 110 The real point of their paper is that there are two Japanese becauses, but the function differently: • datte: glossed as ‘no for the reason that’, is immediate and clear—strong disagreement—it isn’t about getting information but about getting a justification • kara: more like English, shifts towards alignment; also used if a reference is unclear, a term is unknown, or if the speaker is assuming something of the recipient that they don’t actually know If you want to give someone a datte response in English, you have to use turn onset, stress, intensifiers, choice of evaluative language, directness of disagreeing, and non-verbal expressions Ford and Mori (1994), cont’d
  • 111. 111 John came back because he loved her. One event causes another John loved her, because he came back. Illustrates the speaker’s reasoning, “epistemic”; English since, French puisque, German denn What are you doing tonight, because there’s a good movie on. A “speech act” Subjective relations are often derived from objective relations (see also Traugott 1995) Sweetser (1990)
  • 112. Tongan si’i and si’a Different determiners express sympathy to the DP they head (Hendrick, 2005)
  • 114. Korean Evidentials and psych predicates Non-evidential sentences are more assertive/informational, evidential sentences about the speaker are more “expressive” and “spontaneous” (Chung 2010)
  • 115. East Caucasian lgs Case for emotion experiencers ≠ perception experiencers Van den Berg (2005)
  • 116. Thai thîi Complementizer for verbs of emotion/evaluation (Singhapreecha, 2010)
  • 117. For strangers on the phone, because is used mostly for vices, holidays, money, travel, wars 117
  • 118.
  • 119. 1.4%
  • 120.
  • 121. Top 3 categories in Nigeria 9.69% 17.68% 39.44% Employment U-report support Health
  • 122. 122 Are becausers drawn to different topics more than others O/E big becausers O/E because-avoiders JFK 69 posts by big becausers in this topic / 56 posts expected 0 posts by because- avoiders in this topic / 13 posts expected 9/11 buildings 408 / 366 44 / 86 media 357 / 394 130 / 93 moderation 442 / 489 154 / 114 aliens 133 / 123 19 / 29 food/health 231 / 214 33 / 51 More JFK 38 / 41 13 / 10 internet 103 / 112 35 / 26 vaccines 263 / 247 43 / 59
  • 123. 123 Basically the same list is top, except vaccines pop up a few spots and aliens drop down a few spots • More JFK (don't kennedy) • JFK (head autopsy paper wound jfk) • 9/11 (9 11 bin laden attacks) • vaccines (vaccines children disease autism polio) • 9/11 buildings (building collapse steel fire wtc) Let’s remove the authors who like because