0
Words and More Words:
Challenges of Big (Text) Data
Edie Rasmussen
Visiting Professor, Nanyang Technological University
Pr...
Outline
• The Rise of Big Text Data
• Challenges for Text Data
• Research Opportunities
– Counting and Culturomics
– Extra...
The Rise of Big Text Data
• Before there was Big Data, there were large
bibliographic databases:
– Dialog: ~180 scholarly ...
Current Text Sources
• Digitized Legacy Materials
– Google Books, Hathi Trust (11 million volumes, 500 TB)
• The Web
• Sea...
Challenges of Text
• Legacy Text/Digitization Costs
• Quality (OCR Errors; Metadata Errors)
• Availability (Access, Copyri...
Processing Text
• Tokenizing, stopping, stemming
• Statistics of text: term values (tf*idf)
• “Bag of Words” approach
• Ot...
Counting and the Rise of Culturomics
• “Culturomics is the application of high-
throughput data collection and analysis to...
Using the N-Gram Viewer
8
typhoid
gout
1800 20001900
HIV
cholera
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Com...
How Far Will Counting Take us?
• Many limitations (e.g. incomplete data set)
• Some surprisingly sophisticated analyses:
–...
Critiques of Culturomics
• “The death of theory”
• “…second-rate scholars will use the Google
Books corpus to churn out gi...
Social Media as Big Data
• ‘Internet Minute’
– 320+ new Twitter accounts
– 100,000 new Tweets
– 2+ million search queries
...
TM: Topic Detection and Tracking
• Tracking a story line over time
• News wire input, identify new story, find
subsequent ...
TM: Sentiment Analysis/Opinion Mining
• Rich data from Blogs and Tweets
• Basically a classification problem (SVM, Naïve
B...
TM: Trends and Predictions
• Can Tweets and Search Logs be used to
predict the future?
• Google Flu Trends, Google Dengue ...
TM: Question Answering
• Combines multiple sources of evidence:
– Question type identification
– Information retrieval of ...
16
Watson, 2011
Hans Peter Luhn, 1952
Watson, 2011
Structuring Research:
“Digging Into Data” Program
• Addresses: “how "big data" changes the research
landscape for the huma...
18
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Thank you!
19
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Upcoming SlideShare
Loading in...5
×

Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

267

Published on

Presented during the WKWSCI Symposium 2014
21 March 2014
Marina Bay Sands Expo and Convention Centre
Organized by the Wee Kim Wee School of Communication and Information at Nanyang Technological University

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
267
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen"

  1. 1. Words and More Words: Challenges of Big (Text) Data Edie Rasmussen Visiting Professor, Nanyang Technological University Professor, University of British Columbia WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  2. 2. Outline • The Rise of Big Text Data • Challenges for Text Data • Research Opportunities – Counting and Culturomics – Extracting Meaning from Text 2 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  3. 3. The Rise of Big Text Data • Before there was Big Data, there were large bibliographic databases: – Dialog: ~180 scholarly databases – Lexis/Nexis: 5 billion documents (business/law/news) – Citation Indexes: > 40 million records • IR techniques designed for rapid access to very large (text) databases • Swanson: “Undiscovered public knowledge” (1987) WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities 3
  4. 4. Current Text Sources • Digitized Legacy Materials – Google Books, Hathi Trust (11 million volumes, 500 TB) • The Web • Search Logs (over 2 million queries per minute) • Wikipedia (~4.5 million English articles) • Blogs (The Blogosphere) • Twitter (The Twitterverse) • Test Collections – Smaller – Experimentally more robust 4 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  5. 5. Challenges of Text • Legacy Text/Digitization Costs • Quality (OCR Errors; Metadata Errors) • Availability (Access, Copyright, Privacy) • Reliability – Algorithmic dependencies – Creator trustworthiness • Authorship Issues (Identification, Authority) • Lack of Structure • Lack of Context • Ambiguity of human language • Breadth vs. Depth 5 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  6. 6. Processing Text • Tokenizing, stopping, stemming • Statistics of text: term values (tf*idf) • “Bag of Words” approach • Other evidence: network structures • Similarity calculations • Creating ranked lists • Note: Probabilistic rather than Deterministic 6 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  7. 7. Counting and the Rise of Culturomics • “Culturomics is the application of high- throughput data collection and analysis to the study of human culture” • Database of >5 million digitized books (~4%) • Michel et al. (Science, 2011): “Quantitative analysis of culture using millions of digitized books” • Google’s N-Gram Viewer 7 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  8. 8. Using the N-Gram Viewer 8 typhoid gout 1800 20001900 HIV cholera WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  9. 9. How Far Will Counting Take us? • Many limitations (e.g. incomplete data set) • Some surprisingly sophisticated analyses: – Size of English lexicon – Change in word usage (irregular verbs) over time – Cultural turnover (inventions) – The nature (duration) of fame – Patterns of censorship (“suppression index”) 9 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  10. 10. Critiques of Culturomics • “The death of theory” • “…second-rate scholars will use the Google Books corpus to churn out gigabytes of uninformative graphs and insignificant conclusions.” (Nunberg, 2011) • Books as a representation of human history • A “time sink” 10 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  11. 11. Social Media as Big Data • ‘Internet Minute’ – 320+ new Twitter accounts – 100,000 new Tweets – 2+ million search queries – 6 new Wikipedia articles – 30 hours of video uploaded (Source: Intel http://www.intel.com/content/www/us/en/communications/interne t-minute-infographic.html) 11 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  12. 12. TM: Topic Detection and Tracking • Tracking a story line over time • News wire input, identify new story, find subsequent instances • Story segmentation, First story detection, Clustering of like stories • Interesting to news, business, security analysts 12 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  13. 13. TM: Sentiment Analysis/Opinion Mining • Rich data from Blogs and Tweets • Basically a classification problem (SVM, Naïve Bayes, etc.) - > positive, negative, neutral • Involves Entity Extraction, NLP, sentiment vocabularies • Of interest to government and businesses • See Stanford SA of movie reviews: http://nlp.stanford.edu:8080/sentiment/rntnDemo.html 13 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  14. 14. TM: Trends and Predictions • Can Tweets and Search Logs be used to predict the future? • Google Flu Trends, Google Dengue Trends – Correlated with Search Terms • Network analysis on Tweets on Arab Spring • Assessing tone of global news data to predict national stability, location of terrorists, etc. (Leetaru) • Predicting opinions (recommender systems) 14 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  15. 15. TM: Question Answering • Combines multiple sources of evidence: – Question type identification – Information retrieval of candidate text – Natural language processing – Entity extraction – Hypothesis generation and scoring (confidence) – Ranking hypotheses 15 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  16. 16. 16 Watson, 2011 Hans Peter Luhn, 1952 Watson, 2011
  17. 17. Structuring Research: “Digging Into Data” Program • Addresses: “how "big data" changes the research landscape for the humanities and social sciences” • 3 rounds of international research funding • Canada, US, UK, plus Netherlands • Team approach: scholars, scientists, information professionals • Requires international teams; funding from at least two countries • Wide range of datasets made available • http://www.diggingintodata.org/ 17 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  18. 18. 18 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  19. 19. Thank you! 19 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×