Words and More Words:
Challenges of Big (Text) Data
Edie Rasmussen
Visiting Professor, Nanyang Technological University
Professor, University of British Columbia
WKWSCI
SYMPOSIUM
2014
Big Data, Big Ideas for Smarter
Communities
Outline
• The Rise of Big Text Data
• Challenges for Text Data
• Research Opportunities
– Counting and Culturomics
– Extracting Meaning from Text
2
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
The Rise of Big Text Data
• Before there was Big Data, there were large
bibliographic databases:
– Dialog: ~180 scholarly databases
– Lexis/Nexis: 5 billion documents (business/law/news)
– Citation Indexes: > 40 million records
• IR techniques designed for rapid access to very
large (text) databases
• Swanson: “Undiscovered public knowledge”
(1987)
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
3
Current Text Sources
• Digitized Legacy Materials
– Google Books, Hathi Trust (11 million volumes, 500 TB)
• The Web
• Search Logs (over 2 million queries per minute)
• Wikipedia (~4.5 million English articles)
• Blogs (The Blogosphere)
• Twitter (The Twitterverse)
• Test Collections
– Smaller
– Experimentally more robust
4
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Challenges of Text
• Legacy Text/Digitization Costs
• Quality (OCR Errors; Metadata Errors)
• Availability (Access, Copyright, Privacy)
• Reliability
– Algorithmic dependencies
– Creator trustworthiness
• Authorship Issues (Identification, Authority)
• Lack of Structure
• Lack of Context
• Ambiguity of human language
• Breadth vs. Depth
5
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Processing Text
• Tokenizing, stopping, stemming
• Statistics of text: term values (tf*idf)
• “Bag of Words” approach
• Other evidence: network structures
• Similarity calculations
• Creating ranked lists
• Note: Probabilistic rather than Deterministic
6
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Counting and the Rise of Culturomics
• “Culturomics is the application of high-
throughput data collection and analysis to the
study of human culture”
• Database of >5 million digitized books (~4%)
• Michel et al. (Science, 2011): “Quantitative
analysis of culture using millions of digitized
books”
• Google’s N-Gram Viewer
7
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Using the N-Gram Viewer
8
typhoid
gout
1800 20001900
HIV
cholera
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
How Far Will Counting Take us?
• Many limitations (e.g. incomplete data set)
• Some surprisingly sophisticated analyses:
– Size of English lexicon
– Change in word usage (irregular verbs) over time
– Cultural turnover (inventions)
– The nature (duration) of fame
– Patterns of censorship (“suppression index”)
9
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Critiques of Culturomics
• “The death of theory”
• “…second-rate scholars will use the Google
Books corpus to churn out gigabytes of
uninformative graphs and insignificant
conclusions.” (Nunberg, 2011)
• Books as a representation of human history
• A “time sink”
10
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Social Media as Big Data
• ‘Internet Minute’
– 320+ new Twitter accounts
– 100,000 new Tweets
– 2+ million search queries
– 6 new Wikipedia articles
– 30 hours of video uploaded
(Source: Intel
http://www.intel.com/content/www/us/en/communications/interne
t-minute-infographic.html)
11
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
TM: Topic Detection and Tracking
• Tracking a story line over time
• News wire input, identify new story, find
subsequent instances
• Story segmentation, First story detection,
Clustering of like stories
• Interesting to news, business, security analysts
12
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
TM: Sentiment Analysis/Opinion Mining
• Rich data from Blogs and Tweets
• Basically a classification problem (SVM, Naïve
Bayes, etc.) - > positive, negative, neutral
• Involves Entity Extraction, NLP, sentiment
vocabularies
• Of interest to government and businesses
• See Stanford SA of movie reviews:
http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
13
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
TM: Trends and Predictions
• Can Tweets and Search Logs be used to
predict the future?
• Google Flu Trends, Google Dengue Trends
– Correlated with Search Terms
• Network analysis on Tweets on Arab Spring
• Assessing tone of global news data to predict
national stability, location of terrorists, etc.
(Leetaru)
• Predicting opinions (recommender systems)
14
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
TM: Question Answering
• Combines multiple sources of evidence:
– Question type identification
– Information retrieval of candidate text
– Natural language processing
– Entity extraction
– Hypothesis generation and scoring (confidence)
– Ranking hypotheses
15
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
16
Watson, 2011
Hans Peter Luhn, 1952
Watson, 2011
Structuring Research:
“Digging Into Data” Program
• Addresses: “how "big data" changes the research
landscape for the humanities and social sciences”
• 3 rounds of international research funding
• Canada, US, UK, plus Netherlands
• Team approach: scholars, scientists, information
professionals
• Requires international teams; funding from at
least two countries
• Wide range of datasets made available
• http://www.diggingintodata.org/
17
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
18
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Thank you!
19
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities

Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

  • 1.
    Words and MoreWords: Challenges of Big (Text) Data Edie Rasmussen Visiting Professor, Nanyang Technological University Professor, University of British Columbia WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 2.
    Outline • The Riseof Big Text Data • Challenges for Text Data • Research Opportunities – Counting and Culturomics – Extracting Meaning from Text 2 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 3.
    The Rise ofBig Text Data • Before there was Big Data, there were large bibliographic databases: – Dialog: ~180 scholarly databases – Lexis/Nexis: 5 billion documents (business/law/news) – Citation Indexes: > 40 million records • IR techniques designed for rapid access to very large (text) databases • Swanson: “Undiscovered public knowledge” (1987) WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities 3
  • 4.
    Current Text Sources •Digitized Legacy Materials – Google Books, Hathi Trust (11 million volumes, 500 TB) • The Web • Search Logs (over 2 million queries per minute) • Wikipedia (~4.5 million English articles) • Blogs (The Blogosphere) • Twitter (The Twitterverse) • Test Collections – Smaller – Experimentally more robust 4 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 5.
    Challenges of Text •Legacy Text/Digitization Costs • Quality (OCR Errors; Metadata Errors) • Availability (Access, Copyright, Privacy) • Reliability – Algorithmic dependencies – Creator trustworthiness • Authorship Issues (Identification, Authority) • Lack of Structure • Lack of Context • Ambiguity of human language • Breadth vs. Depth 5 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 6.
    Processing Text • Tokenizing,stopping, stemming • Statistics of text: term values (tf*idf) • “Bag of Words” approach • Other evidence: network structures • Similarity calculations • Creating ranked lists • Note: Probabilistic rather than Deterministic 6 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 7.
    Counting and theRise of Culturomics • “Culturomics is the application of high- throughput data collection and analysis to the study of human culture” • Database of >5 million digitized books (~4%) • Michel et al. (Science, 2011): “Quantitative analysis of culture using millions of digitized books” • Google’s N-Gram Viewer 7 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 8.
    Using the N-GramViewer 8 typhoid gout 1800 20001900 HIV cholera WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 9.
    How Far WillCounting Take us? • Many limitations (e.g. incomplete data set) • Some surprisingly sophisticated analyses: – Size of English lexicon – Change in word usage (irregular verbs) over time – Cultural turnover (inventions) – The nature (duration) of fame – Patterns of censorship (“suppression index”) 9 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 10.
    Critiques of Culturomics •“The death of theory” • “…second-rate scholars will use the Google Books corpus to churn out gigabytes of uninformative graphs and insignificant conclusions.” (Nunberg, 2011) • Books as a representation of human history • A “time sink” 10 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 11.
    Social Media asBig Data • ‘Internet Minute’ – 320+ new Twitter accounts – 100,000 new Tweets – 2+ million search queries – 6 new Wikipedia articles – 30 hours of video uploaded (Source: Intel http://www.intel.com/content/www/us/en/communications/interne t-minute-infographic.html) 11 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 12.
    TM: Topic Detectionand Tracking • Tracking a story line over time • News wire input, identify new story, find subsequent instances • Story segmentation, First story detection, Clustering of like stories • Interesting to news, business, security analysts 12 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 13.
    TM: Sentiment Analysis/OpinionMining • Rich data from Blogs and Tweets • Basically a classification problem (SVM, Naïve Bayes, etc.) - > positive, negative, neutral • Involves Entity Extraction, NLP, sentiment vocabularies • Of interest to government and businesses • See Stanford SA of movie reviews: http://nlp.stanford.edu:8080/sentiment/rntnDemo.html 13 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 14.
    TM: Trends andPredictions • Can Tweets and Search Logs be used to predict the future? • Google Flu Trends, Google Dengue Trends – Correlated with Search Terms • Network analysis on Tweets on Arab Spring • Assessing tone of global news data to predict national stability, location of terrorists, etc. (Leetaru) • Predicting opinions (recommender systems) 14 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 15.
    TM: Question Answering •Combines multiple sources of evidence: – Question type identification – Information retrieval of candidate text – Natural language processing – Entity extraction – Hypothesis generation and scoring (confidence) – Ranking hypotheses 15 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 16.
    16 Watson, 2011 Hans PeterLuhn, 1952 Watson, 2011
  • 17.
    Structuring Research: “Digging IntoData” Program • Addresses: “how "big data" changes the research landscape for the humanities and social sciences” • 3 rounds of international research funding • Canada, US, UK, plus Netherlands • Team approach: scholars, scientists, information professionals • Requires international teams; funding from at least two countries • Wide range of datasets made available • http://www.diggingintodata.org/ 17 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 18.
    18 WKWSCI SYMPOSIUM 2014 BigData, Big Ideas for Smarter Communities
  • 19.
    Thank you! 19 WKWSCI SYMPOSIUM2014 Big Data, Big Ideas for Smarter Communities