T O WA R D S A D I C T I O N A R Y O F T H E F U T U R E
COUNTS, COMPARISONS,
COLLOCATIONS, CONTESTATIONS
DICTIONARY OF THE FUTURE?
SOME OTHER PLACES TO CHECK OUT
• The Google Ngram Viewer helps you understand
trends across a bazillion books that Google has
digitized. It’s an amazing resource:
• So are the Corpus of Historical American English:
http://corpus.byu.edu/coha/ (COHA)
• And the Corpus of Contemporary English:
http://corpus.byu.edu/coca/ (COCA)
TO COHA!
TO COCA!
TAKING CARE WITH COUNTS
• The counts in the last two slides are too small to be
anything more than interesting
• The next slide shows us tracking the collocates of
future
• Collocates are the words that appear near a given
word—one of the chief collocates of salt is pepper,
for example
COUNTS COUNT
DISCUSSIONS, DEMOCRACIES AND
DICTIONARIES
What’s going
on in Urban
Dictionary?
• Identity
• Play
• Politics
KEYWORDS
• What are the words
that are most
contested?
• How do they
change?
• Who controls the
future?
• Liberty vs. Freedom
JACK GRIEVE FINDING WOTY’S
• See also http://idibon.com/quantifying-word-year/
• p.s.—in my
ideal
Dictionary of
the Future, we
understand
the geography
of how a word
is used
MEANING IS IN THE USE
• “For a large class of
cases of the
employment of the
word ‘meaning’—
though not for all—
this way can be
explained in this way:
the meaning of a
word is its use in the
language” —
Wittgenstein,
Philosophical
Investigations
MEANING IN THE USE
• Tumblr moms use
over 4 x’s as many
and
as Twitter peeps
• What are the
collocates?
• Blue: his he him
• Purple: she’s she
• No pink heart option!
• See also http://www.washingtonpost.com/sf/opinions/2015/02/12/why-moms-love-emoji/ and
http://idibon.com/emomji-emoji-new-moms-use/
CO-OCCURRENCES MATTER (MOVIE
REVIEW RATINGS AND WORDS)
• The idea here is that if you’re writing a review and use the word wow, you’re being very positive
or very negative. You don’t say Wow, I have a balanced and neutral opinion on this very often.
• If you’re using however, however, you’re likely to be in the middle of your movie review rating or
travel summary—not at the very positive/negative extremes.
• See also http://web.stanford.edu/~cgpotts/manuscripts/potts-schwarz-exclamatives08.pdf and
http://web.stanford.edu/~cgpotts/papers/constant-davis-potts-schwarz-expressives.pdf
FOUR CASE STUDIES
• Wholesomeness: http://idibon.com/wholesome-
branding-campaign-effectiveness/
• Entrepreneur: http://idibon.com/entrepreneurs-
french-spanish-english/
• Because X: http://idibon.com/innovating-
innovation/
• #BlackLivesMatter:
http://idibon.com/blacklivesmatter-events-change-
conversations/
WHOLESOMENESS
HTTP://I DI BON.COM/WHOLESOME -BRA NDI NG -
CA MPA IGN -EFFECTI VENESS /
BRANDS LOVE WORDS
DEEP HISTORY
• The first uses of wholesome tended to be about
‘virtuous teachings’.
• In Wycliffe’s Bible way back in 1382:
The..holsum wordis of oure Lord Jhesu Crist. (1 Timothy 6:3)
(Modern versions treat wordis as ‘words’, ‘teachings’, or
‘instructions’.)
“WHOLESOME” [NOUN] OVER TIME
HOW ABOUT IN SOCIAL MEDIA?
• You have to deal with spam (11% of data in this
case; another 36% of data is “Wholesome Radio”,
which is probably irrelevant)
• In 2014 tweets:
• Food: 23% (but mostly not about Honey Maid)
• Humans: 23% (and how they can/should live; church-
related mentions are prominent)
• Entertainment: 13% (movies, TV)
• Now let’s compare this to 2011 tweet uses:
• Humans: 32%
• Entertainment: 12%
• Food: 9%
WORDS ARE CONTESTED
MORE ON CONTESTED WORDS
• In the next slide, you’ll see an image from Monroe
et al (2008)
• This is work that takes the basic thing we know:
Republicans and Democrats speak about the same
issue differently.
• In the next slide, they are showing methods that
can pull about how the parties speak about
abortion when they take the floor.
• The words at the top are the Democratic party
words, the ones at the bottom are the Republican
party words.
• http://languagelog.ldc.upenn.edu/myl/Monroe.pdf
ENTREPRENEUR
HTTP ://I DI BON.COM/ENTREPRENEURS -FRENCH -SPA NI SH-
ENGLI SH/
ENTREPRENEUR IN ENGLISH, FRENCH,
SPANISH
• Tycoon, mogul, industrialist
• A flavor of ‘ill-gotten gains’
• Entrepreuneur doesn’t seem to have this—in English right now
• Collocates have to do with:
• Advice
• Success
• Investors
• Marketing
• Social (media/services/topics/techniques)
• Failure (especially fear-of)
• Lots of named entities (SXSW, Dubai, #KSA, Twitter, Google, LinkedIn,
Etsy)
• The people using entrepreneur identify themselves as
• Authors, speakers, writers, bloggers, strategists, (life) coaches,
consultants, moms, wives, husbands, fathers, food-lovers, music-lovers
KEY: GET COMPARISON SETS
Group/Context
A
Group/Context
B
INTERCONNECTED AXES OF
DIFFERENCE
• Genre (State of the Unions vs. Reddit comments)
• Time (1940s vs. the last ten years)
• Geography (hella vs. wicked)
• Traditional demographics (age, gender, education)
• Personal identity/style (nerd, goth, bro, mom)
BECAUSE X
HTTP://IDIBON.COM/INNOVATING -INNOVATION/
INNOVATIONS AND THEIR
COMMUNITIES
• Because X’ers
disporportionately like:
• YouTube
• Tumblr
• One Direction (especially
Harry)
• Justin Bieber
• Ariana Grande
• “bands”
• pizza
• sex
• cats
• books
• They are decidedly less likely
to talk about
• software
• basketball
• NASCAR
• business
• words associated with African-
American Vernacular English
THEXINBECAUSEX
Part of speech Word counts ≥ 50
Noun (people, spoilers) 32.02%
Compressed clause
(ilysm)
21.78%
Adjective (ugly, tired) 16.04%
Interjection (sweg, omg) 14.71%
Agreement (yeah, no) 12.97%
Pronoun (you, me) 2.45%
PART OF SPEECH TAGGERS ARE GOOD
• There’s even a pretty good one for Twitter POS
INNOVATIONS CLUMP
#BLACKLIVESMATTER
HTTP://I DI BON.COM/BLA CKLI VESMA TTER -EVENTS -
CHA NGE -CONVERSA TI ONS /
TOPIC MODELING
• In the previous sections, I’ve been noting what you can
do when you have two or more comparison sets
• How is wholesome used in time x vs. time y vs. time z
• What are the differences between English speakers talking
about entrepreneurship vs. French speakers and Spanish
speakers?
• How are people who use the innovative Because X
construction different than people who don’t use it?
• In this section, we talk about topic modeling, which is a
way to automatically identify clusters within a data set,
even if you don’t have a comparison set.
• We’ll use this to explore conversations around
#blacklivesmatter, but we’ll also see how these
conversations shift before/after a particular moment in
time
TIME MATTERS
TOPICS (EVEN WHEN YOU DON’T HAVE
AN A PRIORI COMPARISON SET)
UNKNOWN UNKNOWNS
• In general, topic modeling is a way of addressing
the limits of our knowledge. If you’re asking a
question about data, you probably know
something about the data going in.
• But what we hear from people is that they are keenly aware
that they don’t know what they don’t know.
• Topic modeling is meant to help that.
• In the next slides, another use of topic modeling:
identifying the themes of Martin Luther King Jr.’s
major speeches and sermons
• Topic modeling Dr.
King’s major
speeches and
sermons gets
these topics
• Which change
over time
• See also
http://idibon.com/
topic-detection-
mlk/
Towards a dictionary of the future

Towards a dictionary of the future

  • 1.
    T O WAR D S A D I C T I O N A R Y O F T H E F U T U R E COUNTS, COMPARISONS, COLLOCATIONS, CONTESTATIONS
  • 2.
  • 5.
    SOME OTHER PLACESTO CHECK OUT • The Google Ngram Viewer helps you understand trends across a bazillion books that Google has digitized. It’s an amazing resource: • So are the Corpus of Historical American English: http://corpus.byu.edu/coha/ (COHA) • And the Corpus of Contemporary English: http://corpus.byu.edu/coca/ (COCA)
  • 6.
  • 7.
  • 8.
    TAKING CARE WITHCOUNTS • The counts in the last two slides are too small to be anything more than interesting • The next slide shows us tracking the collocates of future • Collocates are the words that appear near a given word—one of the chief collocates of salt is pepper, for example
  • 9.
  • 10.
  • 11.
    What’s going on inUrban Dictionary? • Identity • Play • Politics
  • 12.
    KEYWORDS • What arethe words that are most contested? • How do they change? • Who controls the future? • Liberty vs. Freedom
  • 13.
    JACK GRIEVE FINDINGWOTY’S • See also http://idibon.com/quantifying-word-year/
  • 14.
    • p.s.—in my ideal Dictionaryof the Future, we understand the geography of how a word is used
  • 15.
    MEANING IS INTHE USE • “For a large class of cases of the employment of the word ‘meaning’— though not for all— this way can be explained in this way: the meaning of a word is its use in the language” — Wittgenstein, Philosophical Investigations
  • 16.
    MEANING IN THEUSE • Tumblr moms use over 4 x’s as many and as Twitter peeps • What are the collocates? • Blue: his he him • Purple: she’s she • No pink heart option! • See also http://www.washingtonpost.com/sf/opinions/2015/02/12/why-moms-love-emoji/ and http://idibon.com/emomji-emoji-new-moms-use/
  • 17.
    CO-OCCURRENCES MATTER (MOVIE REVIEWRATINGS AND WORDS) • The idea here is that if you’re writing a review and use the word wow, you’re being very positive or very negative. You don’t say Wow, I have a balanced and neutral opinion on this very often. • If you’re using however, however, you’re likely to be in the middle of your movie review rating or travel summary—not at the very positive/negative extremes. • See also http://web.stanford.edu/~cgpotts/manuscripts/potts-schwarz-exclamatives08.pdf and http://web.stanford.edu/~cgpotts/papers/constant-davis-potts-schwarz-expressives.pdf
  • 18.
    FOUR CASE STUDIES •Wholesomeness: http://idibon.com/wholesome- branding-campaign-effectiveness/ • Entrepreneur: http://idibon.com/entrepreneurs- french-spanish-english/ • Because X: http://idibon.com/innovating- innovation/ • #BlackLivesMatter: http://idibon.com/blacklivesmatter-events-change- conversations/
  • 19.
    WHOLESOMENESS HTTP://I DI BON.COM/WHOLESOME-BRA NDI NG - CA MPA IGN -EFFECTI VENESS /
  • 20.
  • 21.
    DEEP HISTORY • Thefirst uses of wholesome tended to be about ‘virtuous teachings’. • In Wycliffe’s Bible way back in 1382: The..holsum wordis of oure Lord Jhesu Crist. (1 Timothy 6:3) (Modern versions treat wordis as ‘words’, ‘teachings’, or ‘instructions’.)
  • 22.
  • 23.
    HOW ABOUT INSOCIAL MEDIA? • You have to deal with spam (11% of data in this case; another 36% of data is “Wholesome Radio”, which is probably irrelevant) • In 2014 tweets: • Food: 23% (but mostly not about Honey Maid) • Humans: 23% (and how they can/should live; church- related mentions are prominent) • Entertainment: 13% (movies, TV) • Now let’s compare this to 2011 tweet uses: • Humans: 32% • Entertainment: 12% • Food: 9%
  • 24.
  • 25.
    MORE ON CONTESTEDWORDS • In the next slide, you’ll see an image from Monroe et al (2008) • This is work that takes the basic thing we know: Republicans and Democrats speak about the same issue differently. • In the next slide, they are showing methods that can pull about how the parties speak about abortion when they take the floor. • The words at the top are the Democratic party words, the ones at the bottom are the Republican party words. • http://languagelog.ldc.upenn.edu/myl/Monroe.pdf
  • 27.
    ENTREPRENEUR HTTP ://I DIBON.COM/ENTREPRENEURS -FRENCH -SPA NI SH- ENGLI SH/
  • 28.
    ENTREPRENEUR IN ENGLISH,FRENCH, SPANISH • Tycoon, mogul, industrialist • A flavor of ‘ill-gotten gains’ • Entrepreuneur doesn’t seem to have this—in English right now • Collocates have to do with: • Advice • Success • Investors • Marketing • Social (media/services/topics/techniques) • Failure (especially fear-of) • Lots of named entities (SXSW, Dubai, #KSA, Twitter, Google, LinkedIn, Etsy) • The people using entrepreneur identify themselves as • Authors, speakers, writers, bloggers, strategists, (life) coaches, consultants, moms, wives, husbands, fathers, food-lovers, music-lovers
  • 29.
    KEY: GET COMPARISONSETS Group/Context A Group/Context B
  • 30.
    INTERCONNECTED AXES OF DIFFERENCE •Genre (State of the Unions vs. Reddit comments) • Time (1940s vs. the last ten years) • Geography (hella vs. wicked) • Traditional demographics (age, gender, education) • Personal identity/style (nerd, goth, bro, mom)
  • 31.
  • 32.
    INNOVATIONS AND THEIR COMMUNITIES •Because X’ers disporportionately like: • YouTube • Tumblr • One Direction (especially Harry) • Justin Bieber • Ariana Grande • “bands” • pizza • sex • cats • books • They are decidedly less likely to talk about • software • basketball • NASCAR • business • words associated with African- American Vernacular English
  • 33.
  • 34.
    Part of speechWord counts ≥ 50 Noun (people, spoilers) 32.02% Compressed clause (ilysm) 21.78% Adjective (ugly, tired) 16.04% Interjection (sweg, omg) 14.71% Agreement (yeah, no) 12.97% Pronoun (you, me) 2.45% PART OF SPEECH TAGGERS ARE GOOD • There’s even a pretty good one for Twitter POS
  • 35.
  • 36.
    #BLACKLIVESMATTER HTTP://I DI BON.COM/BLACKLI VESMA TTER -EVENTS - CHA NGE -CONVERSA TI ONS /
  • 37.
    TOPIC MODELING • Inthe previous sections, I’ve been noting what you can do when you have two or more comparison sets • How is wholesome used in time x vs. time y vs. time z • What are the differences between English speakers talking about entrepreneurship vs. French speakers and Spanish speakers? • How are people who use the innovative Because X construction different than people who don’t use it? • In this section, we talk about topic modeling, which is a way to automatically identify clusters within a data set, even if you don’t have a comparison set. • We’ll use this to explore conversations around #blacklivesmatter, but we’ll also see how these conversations shift before/after a particular moment in time
  • 38.
  • 39.
    TOPICS (EVEN WHENYOU DON’T HAVE AN A PRIORI COMPARISON SET)
  • 40.
    UNKNOWN UNKNOWNS • Ingeneral, topic modeling is a way of addressing the limits of our knowledge. If you’re asking a question about data, you probably know something about the data going in. • But what we hear from people is that they are keenly aware that they don’t know what they don’t know. • Topic modeling is meant to help that. • In the next slides, another use of topic modeling: identifying the themes of Martin Luther King Jr.’s major speeches and sermons
  • 41.
    • Topic modelingDr. King’s major speeches and sermons gets these topics • Which change over time • See also http://idibon.com/ topic-detection- mlk/