2010 Digital Humanities London - Dutch Republic of Letters


Published on

Digital Humanities 2010 London.
About the CKCC project: Dutch Republic of Letters. With Charles van den Heuvel.

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Slide 7 Vergelijking Waterschyring met model voor het schuren van een haven in het binnenland gelegen door middel van spilsluizen en de afwatering in de kaart van Note Hier zien we duidelijk overeenkomsten. Echter, ondanks grote overeenkomsten in de figuur is het door onduidelijkheden in de datering van de niet door Stevin gepubliceerde teksten moeilijk na te gaan of dit werk Note kan hebben geïnspireerd. In ieder geval heeft, zoals we nog zullen zien, een ander werk van Stevin een grotere rol gespeeld in Note ’ s argumentatie van zijn uitvinding. Boevendien wordt het werk expliciet door Beeckman ’ s ter ondersteuning van Note ’ s verdediging genoemd. Het betreft De Beghinselen des Waterwichts van 1584.
  • Catalogus Epistularum Neerlandaricum (CEN), or the Catalogue of letters in Dutch repositories. It is a relatively old database, already available via Telnet in the early 1990s, before the world wide web came into being. CEN is an exhaustive database of letters in the collections of five Dutch university libraries, the Royal Library, and four other important libraries. It contains more than 265,000 descriptions of approximately 1,000,000 letters, dating from 1600 until the present day (of which ca. 100,000 from the 17th century). It supplies the following metadata: sender, recipient, place of sending, year, language, repository and shelf mark. The format in which this database will be made available to the project is to be negotiated with the owner, OCLC6. Usage of this database will enable us to make assertions about the fraction of the selected letters with respect to the total body of letters. Moreover, it allows us to increase the density of the networks we are interested in, leading to unprecedented research opportunities.
  • How did knowledge circulate in the 17th-century Dutch Republic? How were elements of knowledge picked up by the learned community? How was this new knowledge processed, disseminated, theorized and ultimately accepted, or rejected? How can we combine and structure various sets of letters of 17th-century scholars in such a way that we can analyze the circulation of knowledge in an international context and follow the development of themes of interest in space and time? How can we make this information on knowledge production accessible to interdisciplinary research in the Humanities?
  • How can we combine and structure various sets of letters of 17th-century scholars and their correspondents in such a way that we can analyze and visualize the circulation and appropriation of knowledge production in a wider international context and recognize the development of themes of interest and scholarly debates in space and time? How can we make this information on knowledge production accessible to interdisciplinary research in the Humanities? How can this information be enriched by annotation ?
  • Letters not uniformly available Multilingual and spelling variations Automated/Manual Linking and Tagging: Much interpretations needed to resolve references to names, dates, places, ideas and concepts; heterogeneous annotations How to make visualizations informative for research at basis of data? Qualitative: Who is corresponding/introducing? Can we distinguish circles and types of scholars? Where are they located/do they meet? Can we distinguish types of letters/rethorical structures? Can we distinguish emerging themes and debates in these networks? Quantitative: Number of correspondents. Frequency and duration of correspondence. Percentage of various languages and themes.
  • !NB mention distinction between keyword and concept extraction
  • WMatrix: good on a per letter basis; not so handy for the whole corpus
  • LDA is puur statistich je kunt de input voor LDA verbeteren door stemming je kunt NER verbeteren door part of speech analysis concept extraction LDA is voor topical modeling keywords => topics samenstellen => labelen topic modeling => concepten
  • Topic Modelling – with Mallet and LDA latent Dirichlet allocation an Relational Topical Modelling topics linked to senders and receivers of letters Comment on dips and peaks – worth exploring the little guys! Why are they peaking? next step: visualise the dynamics of topics in geography (buienradar)
  • De nadruk op infrastructuur -voor CLARIN -ook Alfalab -toekomstige computational humanities -geleerdenbrieven (nu ook een CLARIN-NL project)
  • http://www.clarin.eu/external/index.php?page=activities&sub=2
  • see WP-2
  • 2010 Digital Humanities London - Dutch Republic of Letters

    1. 1. Letters, Ideas and scholarly communicationInformation Technology @ 1650 Using digital corpora of letters to disclose the circulation of knowledge in the 17th centuryErik-Jan Bos, Univ. Utrecht, erik-jan.bos@phil.uu.nl  scholarly communicationCharles van den Heuvel, VKS, @ 2050 charles.vandenheuvel@vks.knaw.nlDirk Roorda (that’s me), DANS, dirk.roorda@dans.knaw.nl
    2. 2. http://ckcc.huygens.knaw.nl/
    3. 3. NotaBeeckmanCats STEVINrelation disciplinesdirect - waterindirect - literatureHuygens STEVINLangeren
    4. 4. Corpora of 17th century scholars Constantijn Huygens Christiaan Huygens Grotius Descartes Swammerdam Leeuwenhoek Barleaus Spinoza 4 and more?
    5. 5. Corpus Number In Format Metadata Normalized? of letters: posession?Grotius 7946 Yes TEI In Interp Yes, DBNL element codesVan 337 Yes TEI In Interp Yes, DBNLLeeuwenhoek element codesDescartes 750 Yes XML (no other No, plain text TEI) markupBarlaeus 1200 300 ready Word unknown unknownSwammerdam 80 Yes Word unknown unknownConstantijn 7295 Yes xml Probably DBNL codesHuygens Interp elementChristiaan 2900? Medio 2010 probably Probably DBNL codesHuygens TEI Interp element
    6. 6. CEN -MetadataCatalogus Epistularum Neerlandaricum265,000 descriptions of approximately1,000,000 lettersfrom 1600 – now of which100,000 letters in 17th century
    7. 7. Research Questions• History of science: • How did knowledge circulate in the 17th- century Dutch Republic?• Patterns in knowledge growth: • How can we visualise sets of letters that exhibit features of knowledge circulation?• Re-use: • How can we expose the sources, annotations, and resulting patterns to further research?
    8. 8. ChallengeTraditional scholarship• interpretation• close reading East• solving puzzlesComputational methods We•dealing with patterns st•gleaned from large quantities of texts•by automatic toolsEast is east and West is west and ...
    9. 9. Issues to deal with• making the sources uniformly available • well coded in TEI, access rights• overcoming the language barrier • (17th cent varieties of French, Latin, Dutch)• named entity recognition & concepts • people, places, dates, concepts, instruments • mixture of interpretation and algorithms• creating useful visualisations • aiding exploration by historians of science
    10. 10. ICT in Humanities Research• collaboratory • e-Laborate as starting point• algorithmic pipelines • from source material to visualisation• infrastructure • archiving results • re-using data • developing new algorithms • disseminating the methodology
    11. 11. collaboratory
    12. 12. pipelines
    13. 13. pipelines (current)• language detection, usingLanguage Identification from Text Using N-gram Based Cumulative Frequency AdditionBashir Ahmed, Sung-Hyuk Cha, and Charles Tappert 2004• results latin dutch french german
    14. 14. pipelines (current)• spelling normalisation • VARD (http://www.comp.lancs.ac.uk/~barona/vard2/) • with help from (http://www.dicollecte.org/home.php?prj=fr)• results • French: VARD works (after improvements), although designed for historical English • Dutch: still on the lookout for a combination of resources, tools, and dexterity • Latin: later
    15. 15. pipelines (current)
    16. 16. pipelines (current)• named entity recognition • known tools get 70% • search for optimal tools in the next stage
    17. 17. pipelines (insights)• expect the most from statistical methods• language technology may boost results• it remains to be seen by how much
    18. 18. Source: ScottTopic-Author-Time Weingart UIA
    19. 19. infrastructure
    20. 20. the project’s legacy• more than publications • curated sources, annotations, visualisations• more than algoritms • a framework for analysis of historical texts• more than a piece of historical research • data and (intermediate) results worthwhile to • linguists, computer scientists, sociologists• more than a passive dataset • extensible, dynamic, interactive
    21. 21. preserving the results• part of the CLARIN infrastructure • http://www.clarin.eu/ • http://www.clarin.nl/• materials in a Trusted Digital Repository (DANS) • http://easy.dans.knaw.nl/dms
    22. 22. working with CLARIN• CLARIN-EU • Outreach to humanities: use cases • CKCC one of 10 selected projects • received expert input for choice of language tools• CLARIN-NL • CKCC one of 10 initial projects in the Dutch national construction effort • support for applying language technology
    23. 23. Adapting to CLARIN• Conforming to standards• CLARIN standards are in evolution • (and will remain evolvable)• Common MetaData Infrastructure • a registry of metadata components • defined by the community • with explicit semantics (http://www.isocat.org/ )• Data in TEI (as export/import format)
    24. 24. Trusted Digital Repository• materials • reliable (provenance metadata) • findable (CMDI metadata) • referable (persistent identifiers) • accessible (viewable in webbrowser) • usable (downloadable)• sooner or later: • high-performance computing • memento: a time-sensitive webinterface to the dynamic contents of the collaboratory (http://arxiv.org/abs/0911.1112 )
    25. 25. http://www.clarin.eu/node/3073
    26. 26. http://ckcc.huygens.knaw.nl/