SlideShare a Scribd company logo
1 of 31
Download to read offline
Embedding Metadata and Other
      Semantics In Word-Processing
              Documents
                 Peter Sefton (University Southern Queensland)
                   Ian Barnes (Australian National University)
                  Ron Ward (University Southern Queensland)
             Jim Downing (University of Cambridge) (presenting)


                                         [breath]


The paper supporting this presentation provides important detail and can be obtained from
http://www.dspace.cam.ac.uk/handle/1810/206423
Agenda

Motivations

Axioms of choice

Interoperability is Hard

The approach

Examples (+ chemistry)

                           http://www.flickr.com/photos/forezt/524108228
Why is this interesting?


We want to move towards semantically-rich
documents for e-Research. In some disciplines 100%
of documents start life in a word processor.

Introduction of real world constraints yields
interesting result
Semantically Rich Documents


         Enable automation

         Prevent information loss

         Better discovery

         Improved presentation




Automation - zero click upload, not filling in redundant forms etc
Information loss - rich data reduced to tables, images.
Semantic information leads to richer alternatives for discovery and communication of
research.
Fully Supported Research - all the supporting data delivered with the text
Constraints
                    Work
                    in the
               real world,
                    today
                                                http://www.flickr.com/photos/amirjina/2281612876
Solution had to work in ICE - the Integrated Content Environment, a distributed authoring
system in production at USQ.

Therefore the approach is PRAGMATIC!
Real World

Metadata, semantics and data not easily distinguished

Document creation == Metadata creation

 Not separable activities

 Metadata is in the document

Documents have multiple, distributed authors
Tools and Formats

Microsoft Word [Adoption]

OpenOffice.org writer
[Access]

ICE - Integrated Content
Environment

                 .doc, .docx, OOXML, ODF

                 HTML, PDF
The Difference Between
       Standards and Interoperability




This is the test that semantic solutions must work inside to be useful in production - once
semantics are created, they must survive when the document is edited in the wild.
This is the simple subset of document interop we’re talking about, including only word and
OO Writer.

In the wild you can’t control what formats people use to save, or the software they use.

If any of these routes destroys semantics, then we’ve lost interoperability.

There are a lot of standards already involved in this space, but none of them on their own
deliver semantic data interoperability.
Interoperability in Publishing




PDF - scholarly publishing now
HTML - the medium term future of scholarly publishing.

Converter needed since HTML and PDF creation in OOo Writer and MS Word produce pretty
poor results.
http://www.flickr.com/photos/druclimb/289636172


When you apply these interoperability constraints, the solution space gets very small.
<metaphor>Like walking along a ridge, keep it simple and take small steps. The paths off to
the side lead quickly to peril.</metaphor>
Approaches Ruled Out
MS Word “Smart Tags”

 No interop with OOo, but not necessarily a bad idea

MS Word foreign namespace XML encoding

 Expensive, no interop with OOo, lock-in issues

ODF 1.2 embedded semantic

 No Word equivalent in sight

Things that would destroy WYSIWYG such as using wiki
markup in the word processor.
Define A New Encoding
                       Standard?




Codifying a standard wouldn’t work unless vanilla wp software can be shown not to destroy
the information.

For delivering interoperability in this area, standards are not sufficient.
Microformats!




      http://www.flickr.com/photos/onion/2046003604
Encoding Microformats

         Tables: for, like, tabulating things

         Styles: The original extensible inline semantic
         mechanism for word processing and still working!

         Links

         Frames: fragile

         Bookmarks and fields: require lots of field testing, not
         all that reliable in an interop situation


The paper contains much more detail about the mechanism.
Styles




The style approach is: -
 * Simple
 * Metadata schema agnostic
 * User extensible

It doesn’t /need/ any plugin / customized software to work.
Style: p-meta-author


         Style: p-meta-affiliation


d


d
              Style: p-meta-issued                             Style: p-meta-abstract




    Styles can be nested by placing inline styled text within styled paragraphs.
{ 'title':['Metadata in ICE documents'],
                      'author':[{'name':'Ian Barnes',
                                  'affiliation':'ANU'},
                                 {'name':'Peter Sefton',
                                  'affiliation':'USQ'}
                               ]
                    }




Tables are also useful since the layout implies semantics
Toolbars




The toolbars are implemented for Word and Writer. They provide easy access to the common
microformat encoding styles and structures. They also contain macros for communicating
with the ICE system, and uploading the document to the Institutional Repo / publisher system
etc.
http://www.flickr.com/photos/jima/460348206


To make it even easier, templates can be used that include sample text in the relevant places
- all the user has to do is replace the sample text.
Dublin Core metadata can be extracted directly from the document.
As can RDF metadata using the ORE vocabulary.
ICE-TheOREM

        Semantics in chemistry thesis documents

        Structural elements, Chapters, Appendices etc

        Data (molecules, spectral data etc)

        Chemical entities in text




http://wwmm.ch.cam.ac.uk/trac/theorem/
Chemistry

                                               Style: p-exptl-compound

                                                  Link to data.

                                                 Style: p-exptl-compnum



                                                     Style: p-compound-name




This text from a synthetic chemistry thesis.

Highlights the grey area between data and metadata - the compound name is data, but also
the subject of the document.
These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/
demos/index.htm
These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/
demos/index.htm
These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/
demos/index.htm
These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/
demos/index.htm
These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/
demos/index.htm
FIN
                 Thank you.




http://www.flickr.com/photos/jaysun/367670007
ICE - Integrated Content Environment http://ice.usq.edu.au/
       Demos at http://ice.usq.edu.au/presentations/demos/
                   ICE-TheOREM. Tag: jisctheorem
              https://wwmm.ch.cam.ac.uk/trac/theorem
http://www.jisc.ac.uk/whatwedo/programmes/digitalrepositories2007/
                          theoremice.aspx
                           Peter Sefton
                       sefton@usq.edu.au
                       http://ptsefton.com/
                         Jim Downing
                      ojd20@cam.ac.uk
            http://wwmm.ch.cam.ac.uk/blogs/downing/

More Related Content

Similar to Embedding Metadata In Word Processing Documents

SLIDEGen: Approach to automatic Slides Generation
SLIDEGen: Approach to automatic Slides GenerationSLIDEGen: Approach to automatic Slides Generation
SLIDEGen: Approach to automatic Slides GenerationIRJET Journal
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsIRJET Journal
 
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Stuart Chalk
 
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...Jerry SILVER
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling TechniqueCarmen Sanborn
 
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...IEEEMEMTECHSTUDENTSPROJECTS
 
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEEMEMTECHSTUDENTPROJECTS
 
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET Journal
 
PowerPoint
PowerPointPowerPoint
PowerPointVideoguy
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshopl_ernest
 
0001 introduction to database management system
0001 introduction to database management system0001 introduction to database management system
0001 introduction to database management systemJugdambay S
 
The need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsThe need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsMarkus Neteler
 
Trekk cross media series using xml to create once - distribute everywhere - e...
Trekk cross media series using xml to create once - distribute everywhere - e...Trekk cross media series using xml to create once - distribute everywhere - e...
Trekk cross media series using xml to create once - distribute everywhere - e...Jeffrey Stewart
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldRob Gillen
 
A web standards & ud approach for access (bps public)
A web standards & ud approach for access (bps   public)A web standards & ud approach for access (bps   public)
A web standards & ud approach for access (bps public)Howard Kramer
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the HaystackAdrian Stevenson
 

Similar to Embedding Metadata In Word Processing Documents (20)

SLIDEGen: Approach to automatic Slides Generation
SLIDEGen: Approach to automatic Slides GenerationSLIDEGen: Approach to automatic Slides Generation
SLIDEGen: Approach to automatic Slides Generation
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive Graphs
 
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
 
Bibliographic metadata (including citation)
Bibliographic metadata (including citation)Bibliographic metadata (including citation)
Bibliographic metadata (including citation)
 
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
 
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
 
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction Framework
 
Sword Bl 0903[1]
Sword Bl 0903[1]Sword Bl 0903[1]
Sword Bl 0903[1]
 
PowerPoint
PowerPointPowerPoint
PowerPoint
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
 
0001 introduction to database management system
0001 introduction to database management system0001 introduction to database management system
0001 introduction to database management system
 
Metadata Cloud
Metadata CloudMetadata Cloud
Metadata Cloud
 
The need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsThe need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formats
 
Trekk cross media series using xml to create once - distribute everywhere - e...
Trekk cross media series using xml to create once - distribute everywhere - e...Trekk cross media series using xml to create once - distribute everywhere - e...
Trekk cross media series using xml to create once - distribute everywhere - e...
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
A web standards & ud approach for access (bps public)
A web standards & ud approach for access (bps   public)A web standards & ud approach for access (bps   public)
A web standards & ud approach for access (bps public)
 
Sweo talk
Sweo talkSweo talk
Sweo talk
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the Haystack
 

More from Jim Downing

The Metaverse in Fashion
The Metaverse in FashionThe Metaverse in Fashion
The Metaverse in FashionJim Downing
 
Metail and eTryOn for De Montfort Uni Fashion
Metail and eTryOn for De Montfort Uni FashionMetail and eTryOn for De Montfort Uni Fashion
Metail and eTryOn for De Montfort Uni FashionJim Downing
 
Creative Cambridge Metail presentation
Creative Cambridge Metail presentationCreative Cambridge Metail presentation
Creative Cambridge Metail presentationJim Downing
 
XR in fashion & the eTryOn project
XR in fashion  & the eTryOn projectXR in fashion  & the eTryOn project
XR in fashion & the eTryOn projectJim Downing
 
Towards Lensfield
Towards LensfieldTowards Lensfield
Towards LensfieldJim Downing
 
Web Feeds and Repositories
Web Feeds and RepositoriesWeb Feeds and Repositories
Web Feeds and RepositoriesJim Downing
 

More from Jim Downing (6)

The Metaverse in Fashion
The Metaverse in FashionThe Metaverse in Fashion
The Metaverse in Fashion
 
Metail and eTryOn for De Montfort Uni Fashion
Metail and eTryOn for De Montfort Uni FashionMetail and eTryOn for De Montfort Uni Fashion
Metail and eTryOn for De Montfort Uni Fashion
 
Creative Cambridge Metail presentation
Creative Cambridge Metail presentationCreative Cambridge Metail presentation
Creative Cambridge Metail presentation
 
XR in fashion & the eTryOn project
XR in fashion  & the eTryOn projectXR in fashion  & the eTryOn project
XR in fashion & the eTryOn project
 
Towards Lensfield
Towards LensfieldTowards Lensfield
Towards Lensfield
 
Web Feeds and Repositories
Web Feeds and RepositoriesWeb Feeds and Repositories
Web Feeds and Repositories
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 

Embedding Metadata In Word Processing Documents

  • 1. Embedding Metadata and Other Semantics In Word-Processing Documents Peter Sefton (University Southern Queensland) Ian Barnes (Australian National University) Ron Ward (University Southern Queensland) Jim Downing (University of Cambridge) (presenting) [breath] The paper supporting this presentation provides important detail and can be obtained from http://www.dspace.cam.ac.uk/handle/1810/206423
  • 2. Agenda Motivations Axioms of choice Interoperability is Hard The approach Examples (+ chemistry) http://www.flickr.com/photos/forezt/524108228
  • 3. Why is this interesting? We want to move towards semantically-rich documents for e-Research. In some disciplines 100% of documents start life in a word processor. Introduction of real world constraints yields interesting result
  • 4. Semantically Rich Documents Enable automation Prevent information loss Better discovery Improved presentation Automation - zero click upload, not filling in redundant forms etc Information loss - rich data reduced to tables, images. Semantic information leads to richer alternatives for discovery and communication of research. Fully Supported Research - all the supporting data delivered with the text
  • 5. Constraints Work in the real world, today http://www.flickr.com/photos/amirjina/2281612876 Solution had to work in ICE - the Integrated Content Environment, a distributed authoring system in production at USQ. Therefore the approach is PRAGMATIC!
  • 6. Real World Metadata, semantics and data not easily distinguished Document creation == Metadata creation Not separable activities Metadata is in the document Documents have multiple, distributed authors
  • 7. Tools and Formats Microsoft Word [Adoption] OpenOffice.org writer [Access] ICE - Integrated Content Environment .doc, .docx, OOXML, ODF HTML, PDF
  • 8. The Difference Between Standards and Interoperability This is the test that semantic solutions must work inside to be useful in production - once semantics are created, they must survive when the document is edited in the wild.
  • 9. This is the simple subset of document interop we’re talking about, including only word and OO Writer. In the wild you can’t control what formats people use to save, or the software they use. If any of these routes destroys semantics, then we’ve lost interoperability. There are a lot of standards already involved in this space, but none of them on their own deliver semantic data interoperability.
  • 10. Interoperability in Publishing PDF - scholarly publishing now HTML - the medium term future of scholarly publishing. Converter needed since HTML and PDF creation in OOo Writer and MS Word produce pretty poor results.
  • 11. http://www.flickr.com/photos/druclimb/289636172 When you apply these interoperability constraints, the solution space gets very small. <metaphor>Like walking along a ridge, keep it simple and take small steps. The paths off to the side lead quickly to peril.</metaphor>
  • 12. Approaches Ruled Out MS Word “Smart Tags” No interop with OOo, but not necessarily a bad idea MS Word foreign namespace XML encoding Expensive, no interop with OOo, lock-in issues ODF 1.2 embedded semantic No Word equivalent in sight Things that would destroy WYSIWYG such as using wiki markup in the word processor.
  • 13. Define A New Encoding Standard? Codifying a standard wouldn’t work unless vanilla wp software can be shown not to destroy the information. For delivering interoperability in this area, standards are not sufficient.
  • 14. Microformats! http://www.flickr.com/photos/onion/2046003604
  • 15. Encoding Microformats Tables: for, like, tabulating things Styles: The original extensible inline semantic mechanism for word processing and still working! Links Frames: fragile Bookmarks and fields: require lots of field testing, not all that reliable in an interop situation The paper contains much more detail about the mechanism.
  • 16. Styles The style approach is: - * Simple * Metadata schema agnostic * User extensible It doesn’t /need/ any plugin / customized software to work.
  • 17. Style: p-meta-author Style: p-meta-affiliation d d Style: p-meta-issued Style: p-meta-abstract Styles can be nested by placing inline styled text within styled paragraphs.
  • 18. { 'title':['Metadata in ICE documents'], 'author':[{'name':'Ian Barnes', 'affiliation':'ANU'}, {'name':'Peter Sefton', 'affiliation':'USQ'} ] } Tables are also useful since the layout implies semantics
  • 19. Toolbars The toolbars are implemented for Word and Writer. They provide easy access to the common microformat encoding styles and structures. They also contain macros for communicating with the ICE system, and uploading the document to the Institutional Repo / publisher system etc.
  • 20. http://www.flickr.com/photos/jima/460348206 To make it even easier, templates can be used that include sample text in the relevant places - all the user has to do is replace the sample text.
  • 21. Dublin Core metadata can be extracted directly from the document.
  • 22. As can RDF metadata using the ORE vocabulary.
  • 23. ICE-TheOREM Semantics in chemistry thesis documents Structural elements, Chapters, Appendices etc Data (molecules, spectral data etc) Chemical entities in text http://wwmm.ch.cam.ac.uk/trac/theorem/
  • 24. Chemistry Style: p-exptl-compound Link to data. Style: p-exptl-compnum Style: p-compound-name This text from a synthetic chemistry thesis. Highlights the grey area between data and metadata - the compound name is data, but also the subject of the document.
  • 25. These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/ demos/index.htm
  • 26. These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/ demos/index.htm
  • 27. These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/ demos/index.htm
  • 28. These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/ demos/index.htm
  • 29. These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/ demos/index.htm
  • 30. FIN Thank you. http://www.flickr.com/photos/jaysun/367670007
  • 31. ICE - Integrated Content Environment http://ice.usq.edu.au/ Demos at http://ice.usq.edu.au/presentations/demos/ ICE-TheOREM. Tag: jisctheorem https://wwmm.ch.cam.ac.uk/trac/theorem http://www.jisc.ac.uk/whatwedo/programmes/digitalrepositories2007/ theoremice.aspx Peter Sefton sefton@usq.edu.au http://ptsefton.com/ Jim Downing ojd20@cam.ac.uk http://wwmm.ch.cam.ac.uk/blogs/downing/