DiaView:Visualise Cultural Change inDiachronic CorporaDavid BeavanUCL Centre for Digital Humanities@DavidBeavanwww.scottis...
Google Books corpus/Ngram Viewerhttp://books.google.com/ngrams/
Google Books corpus• OCR quality variable, particularly poor in 1700s  (difficulties with long-s: ſ )• Does not evenly sam...
DiaView uses• English One Million corpus  “Books with low OCR quality were removed, and  serials were removed.”• 1850 to p...
DiaView concept•   Quick and easy to use•   Aggregate and summarise data•   Promote browsing and opportunistic discovery• ...
DiaView method/measuring salience   Proportion of term occurrences in             entire corpus                   vs   Pro...
Word ‘and’100 of 1000 words in entire corpus is ‘and’ = 10%Year 1   45 of 500 words = 9%      = -10% of corpus proportion ...
DiaView method• Word frequency alone does not dictate salience  (extraordinary over use does)• Traverse entire corpus by y...
www.scottishcorpus.ac.uk/corpus/diaview
www.scottishcorpus.ac.uk/corpus/diaview
DiaView:Visualise Cultural Change inDiachronic CorporaDavid BeavanUCL Centre for Digital Humanities@DavidBeavanwww.scottis...
DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012
Upcoming SlideShare
Loading in...5
×

DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

429

Published on

Talk given at Digital Humanities 2012 (DH2012) in Hamburg, Germany on 18 July 2012.
Web site: http://www.scottishcorpus.ac.uk/corpus/diaview/
Video: http://lecture2go.uni-hamburg.de/konferenzen/-/k/13916
Abstract: http://www.dh2012.uni-hamburg.de/conference/programme/abstracts/diaview-visualise-cultural-change-in-diachronic-corpora/

This paper will introduce and demonstrate DiaView, a new tool to investigate and visualise word usage in diachronic corpora. DiaView highlights cultural change over time by exposing salient lexical items from each decade or year, and providing them to the user in an effortless visualisation. This is made possible by examining large quantities of diachronic textual data, in this case the Google Books corpus (Michel et al. 2010) of one million English books. This paper will introduce the methods and technologies at its core, perform a demonstration of the tool and discuss further possibilities.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
429
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

DiaView: visualise cultural change in diachronic corpora, David Beavan, UCLDH, DH2012

  1. 1. DiaView:Visualise Cultural Change inDiachronic CorporaDavid BeavanUCL Centre for Digital Humanities@DavidBeavanwww.scottishcorpus.ac.uk/corpus/diaview
  2. 2. Google Books corpus/Ngram Viewerhttp://books.google.com/ngrams/
  3. 3. Google Books corpus• OCR quality variable, particularly poor in 1700s (difficulties with long-s: ſ )• Does not evenly sample across genres (data collection fairly opportunistic)• Chronological placement questionable (implicit metadata not always correct)• Very large data set (155 billion tokens)
  4. 4. DiaView uses• English One Million corpus “Books with low OCR quality were removed, and serials were removed.”• 1850 to present (avoids long-s)• 98 billion tokens (still very large)• Filter out very infrequently used words (or keep large sample of most frequently used)
  5. 5. DiaView concept• Quick and easy to use• Aggregate and summarise data• Promote browsing and opportunistic discovery• Help identify cultural trends across time• Highlight salient or ‘interesting’ terms• Provide links to more in-depth analysis• Inspect corpus by decade or year• Ability to work with any corpora or any dataset
  6. 6. DiaView method/measuring salience Proportion of term occurrences in entire corpus vs Proportion of term occurrences in particular year
  7. 7. Word ‘and’100 of 1000 words in entire corpus is ‘and’ = 10%Year 1 45 of 500 words = 9% = -10% of corpus proportion (10%)Year 2 55 of 500 words = 11% = +10% of corpus proportion (10%)Word ‘sausage’20 of 1000 words in entire corpus is ‘sausage’ = 2%Year 1 4 of 500 words = 0.2% = -90% of corpus proportion (2%)Year 2 16 of 500 words = 3.2% = +60% of corpus proportion (2%)Rank for salience by year, ignoring underuse (not negative %ages)Year 1 -Year 2 ‘sausage’ (+60%), ‘and’ (+10%)
  8. 8. DiaView method• Word frequency alone does not dictate salience (extraordinary over use does)• Traverse entire corpus by year/decade• Calculate salience for each type• Rank types according to salience• Apply visual style to word lists• Create links back to Ngram Viewer for in-depth analysis
  9. 9. www.scottishcorpus.ac.uk/corpus/diaview
  10. 10. www.scottishcorpus.ac.uk/corpus/diaview
  11. 11. DiaView:Visualise Cultural Change inDiachronic CorporaDavid BeavanUCL Centre for Digital Humanities@DavidBeavanwww.scottishcorpus.ac.uk/corpus/diaview
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×