HIT project - Humanities Integration Technology

  • 243 views
Uploaded on

Enhancing Research in the Humanities through an Integrated Knowledge Management System. …

Enhancing Research in the Humanities through an Integrated Knowledge Management System.
Presentation of the project prototype at AHLIST 2012 (June)

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
243
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • HIT is an innovative management system that allows one to administer texts, analyze information, and collate images or other sources in a comprehensive web portal, especially custom made for CLA faculties and students. With HIT, one can access and search information on the Internet held on different public repositories in the field of arts, humanities and social sciences (such as Anthropology, Communication, English, Spanish, History, Philosophy, Political Science, Sociology, Visual and Performing Arts, etc.) in a unified way.
  • This project was develop in collaboration with Constantino Malagón Luque, associate professor of Artificial Intelligence in the Department of Computer Science at Nebrija University (Madrid, Spain); Justo Hidalgo, Vice President, product management and consulting at Denodo Technologies and co-founder of the 24symbols company; And by Yonsoo Kim, myself, professor of Spanish. What a professor of Spanish has to do in technology and Science?
  • This project was develop in collaboration with Constantino Malagón Luque, associate professor of Artificial Intelligence in the Department of Computer Science at Nebrija University (Madrid, Spain); Justo Hidalgo, Vice President, product management and consulting at Denodo Technologies and co-founder of the 24symbols company; And by Yonsoo Kim, myself, professor of Spanish. What a professor of Spanish has to do in technology and Science?
  • This project was develop in collaboration with Constantino Malagón Luque, associate professor of Artificial Intelligence in the Department of Computer Science at Nebrija University (Madrid, Spain); Justo Hidalgo, Vice President, product management and consulting at Denodo Technologies and co-founder of the 24symbols company; And by Yonsoo Kim, myself, professor of Spanish.
  • Constan and I have co-founded a research team called MMEDIS (Medieval Medicine Documents Identification System), where diverse interdisciplinary researchers pursue their goals to create an automatic transcription program with Artificial Intelligence. We have two essential reasons to carry out the MMEDIS project: First, we aims to analyze how medicine shaped and affected lives of the medieval people. This interest stems from my research on Teresa de Cartagena, a converted Jewish nun who became deaf and wrote religious treatises. I intend to investigate physical disabilities and diseases that inflicted pain on people in medieval Europe. Second, we plan to study and transcribe hand-written documents in more efficient ways than traditional paleographic transcription of manuscripts.
  • Originally composed in Latin by Gilbertus Anglicus (Gilbert the Englishman), his Compendium of Medicine was a primary text of the medical revolution in thirteenth-century Europe. Composed mainly of medicinal recipes, it offered advice on diagnosis, medicinal preparation, and prognosis. In the fifteenth-century it was translated into Middle English to accommodate a widening audience for learning and medical "secrets." For example, Faye Marie Getz provides a critical edition of the Middle English text, with an extensive introduction to the learned, practical, and social components of medieval medicine and a summary of the text in modern English. Her book entitled Healing and society in medieval England: a Middle English translation of the pharmaceutical writings of Gilbertus Anglicus. Like this type of manuscript, once that it ’ s transcribed people do not go back to the original…. Because of all the intensive work that the manuscript required.
  • It ’ s a tedious work and only specialist in paleography can read it. But the problem does not end here…
  • Also, we need to decode all the abbreviations in medieval medical documents. We have submitted and article base on the study of the handwriting recognition process.
  • Our transcriptions project will take some time to make it work efficiently. However, we realized that there are in the internet some websites that works on manually transcribed manuscripts. For example: Hispanic Seminary has published online 55 texts. SPANISH MEDICAL TEXTS      [55 texts / 2,642,403 tokens]      PREPARED BY:           FRANCISCO GAGO JOVER           Mª TERESA HERRERA           Mª ESTELA GONZÁLEZ DE FAUVE All these texts are available but you cannot access them if you don ’ t know where to find them in the internet.
  • The significance and originality of the HIT project is to exemplify that knowledge should be presented beyond two-dimensional spaces such as paper (encyclopedia) or as keyword search websites (Wikipedia, Google, Yahoo, etc.). Knowledge has to be obtainable in infinitely explorative and proliferating ways in the mashup, reaching its maximum complexity. The true potential of this project is almost limitless because its integrated knowledge system can be used for research or self-learning in any field. Instead of a mere input-output model, any search and reading will lead to contextualized and integrated learning.
  • (DO NOT READ) The significance of this project The significance and originality of the HIT project is to exemplify that knowledge should be presented beyond two-dimensional spaces such as paper (encyclopedia) or as keyword search websites (Wikipedia, Google, Yahoo, etc.). Knowledge has to be obtainable in infinitely explorative and proliferating ways in the mashup, reaching its maximum complexity. The true potential of this project is almost limitless because its integrated knowledge system can be used for research or self-learning in any field. Instead of a mere input-output model, any search and reading will lead to contextualized and integrated learning. HIT can address the two major problems of contemporary digital humanities: overload of useless information and lack of textual context. The world of electronic communication is a world of textual overabundance in which the written texts that are offered go far beyond the reader ’ s ability to take advantage of them. Often, researchers have denounced the uselessness of the overload of information on the web. Thus, ideally, one should know where, why, and how she or he should gather the most accurate and reliable texts on the internet. This is precisely what HIT will do by organizing and synthesizing data and texts—all of them available in one single search. In the HIT project, I will research and select information available on the internet and filter out only needed and trustful information. Furthermore, HIT will analyze not only external repositories but also internal repositories, such as Purdue Library ’ s database and catalogs. The other problem facing current digital humanities is that texts, content or information are usually provided without taking into account its context. Reading in front of the computer screen is generally a discontinuous reading process that seeks, using keywords or thematic headings, the fragment that the reader wishes to find: an article in an electronic periodical, a passage in a book, or some information on a website. This is done without necessarily knowing the identity or coherence of the entire text from which the fragment was extracted. In a certain sense, one might say that in the digital world all textual entities are like databases that offer fragments, the reading of which in no way implies a perception of the work or the body of works from which they came. This explains the confusion of the contemporary reader. The HIT platform, for example, when we just search for a keyword, will also make available at the same time the original source from which the fragment was extracted, including, for example, a location map, images, notes, and references. The HIT project will contribute to innovation in the humanities in three key ways: (a) in user interface, by producing a means by which users are able to interact with this integrated knowledge as one can see below; b) in allowing the integration of Purdue library databases (ComDisDome, Historical Abstracts with Full Text, ITER, JSTOR, MUSE, Patrologia Latina Database, etc.); (c) in the integration of valuable humanities contents which could be located on various external sources or repositories to produce original and valuable knowledge. As a consequence, with the integrated knowledge management system, the text itself is presented with its context, which means the humanistic knowledge that integrates the learning environment. Reading will consist of unfolding multiple and unique textual units onto the screen, units that will be created in accordance with each reader ’ s focus or interest.
  • HIT can address the two major problems of contemporary digital humanities: overload of useless information and lack of textual context. The world of electronic communication is a world of textual overabundance in which the written texts that are offered go far beyond the reader ’ s ability to take advantage of them. Often, researchers have denounced the uselessness of the overload of information on the web. Thus, ideally, one should know where, why, and how she or he should gather the most accurate and reliable texts on the internet. This is precisely what HIT will do by organizing and synthesizing data and texts—all of them available in one single search. What we are going to demonstrate today is only a PROOF OF CONCEPT. However, our initial project was base on these concepts. We researched and selected information available on the internet and filter out only needed and trustful information. We did a survey with different professors from different field in order to find out about their most reliable websites. HIT analyze not only external repositories but also internal repositories, such as Purdue Library ’ s database and catalogs. The other problem facing current digital humanities is that texts, content or information are usually provided without taking into account its context. Reading in front of the computer screen is generally a discontinuous reading process that seeks, using keywords or thematic headings, the fragment that the reader wishes to find: an article in an electronic periodical, a passage in a book, or some information on a website. This is done without necessarily knowing the identity or coherence of the entire text from which the fragment was extracted. In a certain sense, one might say that in the digital world all textual entities are like databases that offer fragments, the reading of which in no way implies a perception of the work or the body of works from which they came. This explains the confusion of the contemporary reader. The HIT platform, for example, when we just search for a keyword, will also make available at the same time the original source from which the fragment was extracted, including, for example, a location map, images, notes, and references. This is my idea of integrating all these information and make it flexible to all the people.
  • Constantino Malag ón Professor of Computer Engineering Universidad Antonio de Nebrija, Spain Justo Hidalgo Vice-Presindent, Product Management and Consulting at Denodo Technologies Co-Founder of the 24symbols Company Both have to work hard to make my request.
  • The function and development of the HIT web portal The HIT project will be constructed according to the architecture image shown below. I will explain its four layers starting from the very bottom of the image. Acquisition Layer : The different data sources that provide early modern age documents in digitalized form, their transcriptions, plus any other useful internal or web-based external repositories, will be accessed by the Data Acquisition Layer, as shown at the bottom of the figure. One of the critical assets of this component is that the web data extraction module is capable of extracting web data in a structured manner, therefore converting the web in a “ virtual database. ”   Processing Layer : This platform provides the opportunity of combining, mashing up and transforming the data from heterogeneous databases and sources in an easier and more powerful way. Specifically, the architecture proposed will be able to perform syntactic (i.e. transformations and combinations based on the structure of the content extracted, such as unifying the names of authors based on whether we want a structure of the kind {surname, first_name} or {first_name surname}) and semantic (i.e. transformations and combinations based on the meaning of the content extracted) tasks. From this layer on, Justo Hidalgo, from Denodo Technology, will develop the software. The HIT interface will be built by following the most relevant industry standards, such as JDBC, ODBC, SOAP/WSDL and REST, for both data access and publishing.   Categorization Layer : The categorization module, on top of the data combination layer, sorts out information previously stored or delivered in real time, and it assigns each piece of information to a set of categories.   Final View : Finally, a basic presentation layer is built in order to allow researchers to visualize the overall mashup and categorization results. The platform is built as a series of components, by following the best practices in software engineering, which simplify the development and integration of all the resources. This is shown in the following image.
  • In order to do that we need: To extend the list of repositories. By repositories we mean two kind of data sources: - Structured: for example, any database ,which has tables, fields, records and values. This includes any sources from Purdue Library. These are called core sources. - Unstructured or semistructured: these include web pages or plain text files. For example, wikipedia. This are called context sources because they provide contextual information based on the author, document or whatever we choose. We have the survey of frequent use databases by different faculty members at CLA. Our first step will be to develop the first rating categories—structured and unstructured—from the list (see attached file). To extend the list of functionalities: To develop the application for mobile devices: Android and Apple iOS To adapt the web design to the Purdue standards. - The results screen should be more interactive (like igoogle, you should be able to move the different panels, and show or hide some of them). In order to do that, we have to develop the system by using the very latest web technologies, like html5. HIT mashup will be stored at Purdue University with a domain name like http://cla.purdue.edu/hit To secure the system. We need users to authenticate with their own Purdue account (user and password), using secure protocols like https. To develop a caching results module - this module will make our system faster.
  • The HIT system is jointly developing with the collaboration of some of the members of the MMEDIS and the new HIT team members.

Transcript

  • 1. HIT Humanities Integration TechnologyEnhancing research in the Humanities through an integrated knowledge management system
  • 2. Team
  • 3. CollaborationConstantino Malagón Associate professor of Computer Engineering Universidad Nebrija, SpainJusto Hidalgo Vice-President, Denodo Technologies Co-Founder of 24symbolsYonsoo Kim Assistant Professor of Spanish School of Languages & Cultures, Purdue University
  • 4. CollaborationJavier Polanco - Developer Undergraduate student at Nebrija University Now, Computer EngineerCarlos Martínez – Web Designer Undergraduate student at Nebrija UniversityEric Herrera – Website Testing Undergraduate student at Purdue University
  • 5. HITIntroduction and ObjectivesSolutionResultsConclusionsFuture Work
  • 6. IntroductionOur first idea: Help researchers inHumanities1. Medieval documents (MMEDIS.com)  First: Transcription  Then: Search, Access, Context2. Finally, a web portal (HIT)
  • 7. Medieval DocumentMSS 120Author: Gilbertus Anglicus
  • 8. Medieval documentsAutomatic transcription Abbreviations in medieval medical documentsInternational Conference of Frontiers inHandwriting Recognition (ICFHR2012) Main peer reviewed conference
  • 9. Medieval documentsSearch and access Hispanic Seminary: El Corpus de Textos Médicos Españoles: http://www.hispanicseminary.org/t&c/med/index.htm Keyword: “medicina”
  • 10. Medieval documentsContext Author Dates Related research
  • 11. HIT Web portalVisualization tool Expand the document type to any published, digitized format, not just medieval texts Expand the type and number ofsources, databases and repositories Expand the contextual information
  • 12. ObjectivesThis implies to extend our first idea to a more generalsystemAnd more flexible
  • 13. Flexible!
  • 14. DifficultBut flexibility implies a greater degree ofdifficulty
  • 15. Solution: HIT
  • 16. HITRepositoriesData accessData integrationVisualizationUser interaction
  • 17. RepositoriesCore – These sources provide the digitized documentsContextual – These sources provide contextualinformation
  • 18. RepositoriesCore– Jstor– Project Muse– MLA Bibliography– Patrología LatinaContextual– Amazon– Google Books– Wikipedia
  • 19. Access TypesAPI (Application Programming Interface)Screen Scraping
  • 20. AccessAPI (Application Programming Interface)– These sources provide a set of rules and programmatic “doors” that let us interact with them– Example: Amazon, Google Books– Amazon, give me all info you have about the book with ISBN=“XXXX”
  • 21. AccessScreen Scraping– We need to “scratch” the web page and create structure out of it– Example: Wikipedia
  • 22. Data integrationData virtualization
  • 23. Data visualization Web app enabled for: Browsers: I. Explorer(several versions),Google Chrome, Firefox Device: Desktop,mobile devices,tablets
  • 24. InteractionLike igoogle Set of personalized panels
  • 25. Arquitecture
  • 26. Virtualization
  • 27. Results
  • 28. ConclusionsThe proof of concept has shown:- How to access heterogeneous, web-baseddata sources- How to integrate those data pieces in a singledata model
  • 29. ConclusionsThe proof of concept has shown:- How to execute search methods among thosesources- How to visualize this info in a meaningful,useful way
  • 30. Future workTo expand the list of repositoriesPortal personalizationAdapt HIT to all kinds of devicesAddition of semantic capabilities
  • 31. Thanks!