SlideShare a Scribd company logo
1 of 24
SchemaCMD An XML-based storage schema for the compilation of mixed-source CMD corpora Cornelius Puschmann University of Düsseldorf [email_address] Towards a Reference Corpus of Web Genres University of Birmingham 27 July 2007
Contents of this presentation ,[object Object],[object Object],[object Object],[object Object],[object Object]
A) Problems of classifying digital genres
Approaches to genre and text typology ,[object Object],[object Object],[object Object],[object Object],[object Object]
A faceted classification scheme (Herring 2007) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Aspects of digital genres concrete abstract
Are blogs a genre? ,[object Object],[object Object]
B) Granularity and meta-data in blogs
Blog content syndication ,[object Object],[object Object],[object Object],[object Object],[object Object]
A sample RSS 2.0 feed ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
C) An example: the corporate web log corpus
The corpus tool ,[object Object],[object Object],[object Object],[object Object]
MySQL data structure - sources - BNC top 100 words - sub-types of corp. blogs - blogs, press eds., ... - n-grams (not computed due to cost) - POS frequencies by post - post data (via RSS/Atom) - additional post statistics - tokens (depends on types) - types (string + POS)
Corpus data ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
D) Variation and genre
Measuring text formality via f-scores ,[object Object],[object Object],[object Object]
Example: high f-score (press release) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Example: low f-score (blog entry) ,[object Object],[object Object],[object Object],[object Object],[object Object]
F-score over time for two sources light blue = Jonathan Schwartz (blog); dark blue = New York Times (editorials)
F-score and standard deviation for all sources x-axis = stdev; y-axis = f-score; dot size = number of posts editorials press releases
E) Individuated CL?
Observations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Thanks for listening!
SchemaCMD An XML-based storage schema for the compilation of mixed-source CMD corpora Cornelius Puschmann University of Düsseldorf [email_address] Towards a Reference Corpus of Web Genres University of Birmingham 27 July 2007

More Related Content

Similar to SchemaCMD - An XML-based storage schema for the compilation of mixed-source CMD corpora

Getty Presentation of IMA/AIC OSCI tool
Getty Presentation of IMA/AIC OSCI toolGetty Presentation of IMA/AIC OSCI tool
Getty Presentation of IMA/AIC OSCI tool
Robert J. Stein
 
Presentation of the AIC-IMA publishing tool for OSCI
Presentation of the AIC-IMA publishing tool for OSCIPresentation of the AIC-IMA publishing tool for OSCI
Presentation of the AIC-IMA publishing tool for OSCI
Robert J. Stein
 
Metadata Workshop-Maastricht - November 6, 2008
Metadata Workshop-Maastricht - November 6, 2008Metadata Workshop-Maastricht - November 6, 2008
Metadata Workshop-Maastricht - November 6, 2008
askamy
 
Metadata Workshop - Utrecht - November 5, 2008
Metadata Workshop - Utrecht - November 5, 2008Metadata Workshop - Utrecht - November 5, 2008
Metadata Workshop - Utrecht - November 5, 2008
askamy
 
Metadata first, ontologies second
Metadata first, ontologies secondMetadata first, ontologies second
Metadata first, ontologies second
Joseba Abaitua
 
Yahoo Making The Web Searchable
Yahoo  Making The  Web  SearchableYahoo  Making The  Web  Searchable
Yahoo Making The Web Searchable
kksst
 
(Christopher)With packet-switched networks its services allow mult
(Christopher)With packet-switched networks its services allow mult(Christopher)With packet-switched networks its services allow mult
(Christopher)With packet-switched networks its services allow mult
SilvaGraf83
 
(Christopher)With packet-switched networks its services allow mult
(Christopher)With packet-switched networks its services allow mult(Christopher)With packet-switched networks its services allow mult
(Christopher)With packet-switched networks its services allow mult
MoseStaton39
 
MPEG-7 Services in Community Engines
MPEG-7 Services in Community EnginesMPEG-7 Services in Community Engines
MPEG-7 Services in Community Engines
Ralf Klamma
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
liddy
 

Similar to SchemaCMD - An XML-based storage schema for the compilation of mixed-source CMD corpora (20)

Getty Presentation of IMA/AIC OSCI tool
Getty Presentation of IMA/AIC OSCI toolGetty Presentation of IMA/AIC OSCI tool
Getty Presentation of IMA/AIC OSCI tool
 
Presentation of the AIC-IMA publishing tool for OSCI
Presentation of the AIC-IMA publishing tool for OSCIPresentation of the AIC-IMA publishing tool for OSCI
Presentation of the AIC-IMA publishing tool for OSCI
 
Metadata Workshop-Maastricht - November 6, 2008
Metadata Workshop-Maastricht - November 6, 2008Metadata Workshop-Maastricht - November 6, 2008
Metadata Workshop-Maastricht - November 6, 2008
 
Semantics In Declarative Systems
Semantics In Declarative SystemsSemantics In Declarative Systems
Semantics In Declarative Systems
 
Beyond Seamless Access: Meta-data In The Age of Content Integration
Beyond Seamless Access: Meta-data In The Age of Content IntegrationBeyond Seamless Access: Meta-data In The Age of Content Integration
Beyond Seamless Access: Meta-data In The Age of Content Integration
 
Metadata Workshop - Utrecht - November 5, 2008
Metadata Workshop - Utrecht - November 5, 2008Metadata Workshop - Utrecht - November 5, 2008
Metadata Workshop - Utrecht - November 5, 2008
 
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
 
April 2016 - USG Web Tech Day - Let's Talk Drupal
April 2016 - USG Web Tech Day - Let's Talk DrupalApril 2016 - USG Web Tech Day - Let's Talk Drupal
April 2016 - USG Web Tech Day - Let's Talk Drupal
 
Metadata first, ontologies second
Metadata first, ontologies secondMetadata first, ontologies second
Metadata first, ontologies second
 
Eprints Application Profile
Eprints Application ProfileEprints Application Profile
Eprints Application Profile
 
Statster introduction essay
Statster introduction essayStatster introduction essay
Statster introduction essay
 
A Logic-Based Approach To Semantic Information Extraction
A Logic-Based Approach To Semantic Information ExtractionA Logic-Based Approach To Semantic Information Extraction
A Logic-Based Approach To Semantic Information Extraction
 
Yahoo Making The Web Searchable
Yahoo  Making The  Web  SearchableYahoo  Making The  Web  Searchable
Yahoo Making The Web Searchable
 
Eprints Special Session - DC-2006, Mexico
Eprints Special Session - DC-2006, MexicoEprints Special Session - DC-2006, Mexico
Eprints Special Session - DC-2006, Mexico
 
(Christopher)With packet-switched networks its services allow mult
(Christopher)With packet-switched networks its services allow mult(Christopher)With packet-switched networks its services allow mult
(Christopher)With packet-switched networks its services allow mult
 
(Christopher)With packet-switched networks its services allow mult
(Christopher)With packet-switched networks its services allow mult(Christopher)With packet-switched networks its services allow mult
(Christopher)With packet-switched networks its services allow mult
 
Tutorial semantic wikis and applications
Tutorial   semantic wikis and applicationsTutorial   semantic wikis and applications
Tutorial semantic wikis and applications
 
MPEG-7 Services in Community Engines
MPEG-7 Services in Community EnginesMPEG-7 Services in Community Engines
MPEG-7 Services in Community Engines
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
Repositories thru the looking glass
Repositories thru the looking glassRepositories thru the looking glass
Repositories thru the looking glass
 

More from Cornelius Puschmann

Beyond the stars: Interpreting discourse cohesion in Twitter as an indicator ...
Beyond the stars: Interpreting discourse cohesion in Twitter as an indicator ...Beyond the stars: Interpreting discourse cohesion in Twitter as an indicator ...
Beyond the stars: Interpreting discourse cohesion in Twitter as an indicator ...
Cornelius Puschmann
 
(Academic) Community Management in the Humanities and Social Sciences for Pub...
(Academic) Community Management in the Humanities and Social Sciences for Pub...(Academic) Community Management in the Humanities and Social Sciences for Pub...
(Academic) Community Management in the Humanities and Social Sciences for Pub...
Cornelius Puschmann
 
Studying Twitter conversations as (dynamic) graphs: visualization and structu...
Studying Twitter conversations as (dynamic) graphs: visualization and structu...Studying Twitter conversations as (dynamic) graphs: visualization and structu...
Studying Twitter conversations as (dynamic) graphs: visualization and structu...
Cornelius Puschmann
 

More from Cornelius Puschmann (20)

Collecting Twitter Data
Collecting Twitter DataCollecting Twitter Data
Collecting Twitter Data
 
A Tale of Two Platforms: Emerging communicative patterns in two scientific bl...
A Tale of Two Platforms: Emerging communicative patterns in two scientific bl...A Tale of Two Platforms: Emerging communicative patterns in two scientific bl...
A Tale of Two Platforms: Emerging communicative patterns in two scientific bl...
 
Digitale Methoden in den Sozial- und Geisteswissenschaften: Chancen und Herau...
Digitale Methoden in den Sozial- und Geisteswissenschaften: Chancen und Herau...Digitale Methoden in den Sozial- und Geisteswissenschaften: Chancen und Herau...
Digitale Methoden in den Sozial- und Geisteswissenschaften: Chancen und Herau...
 
Twitter as a data source for (socio)linguistic research
Twitter as a data source for (socio)linguistic researchTwitter as a data source for (socio)linguistic research
Twitter as a data source for (socio)linguistic research
 
Form and Function of Digital Genres of Scholarly Communication: Results of th...
Form and Function of Digital Genres of Scholarly Communication: Results of th...Form and Function of Digital Genres of Scholarly Communication: Results of th...
Form and Function of Digital Genres of Scholarly Communication: Results of th...
 
Vernetzung, Sichtbarkeit, Information: Nutzungsmotive informeller digitaler K...
Vernetzung, Sichtbarkeit, Information: Nutzungsmotive informeller digitaler K...Vernetzung, Sichtbarkeit, Information: Nutzungsmotive informeller digitaler K...
Vernetzung, Sichtbarkeit, Information: Nutzungsmotive informeller digitaler K...
 
The Pragmatics of Retweeting
The Pragmatics of RetweetingThe Pragmatics of Retweeting
The Pragmatics of Retweeting
 
Data Access, Ownership and Control in Social Web Services: Issues for Twitter...
Data Access, Ownership and Control in Social Web Services: Issues for Twitter...Data Access, Ownership and Control in Social Web Services: Issues for Twitter...
Data Access, Ownership and Control in Social Web Services: Issues for Twitter...
 
Knowledge or Credit? The (Un)changing Face of Academic Publishing from the Ph...
Knowledge or Credit? The (Un)changing Face of Academic Publishing from the Ph...Knowledge or Credit? The (Un)changing Face of Academic Publishing from the Ph...
Knowledge or Credit? The (Un)changing Face of Academic Publishing from the Ph...
 
Wissenschaftliche Blogs: Nutzungsweisen und Nutzer
Wissenschaftliche Blogs: Nutzungsweisen und NutzerWissenschaftliche Blogs: Nutzungsweisen und Nutzer
Wissenschaftliche Blogs: Nutzungsweisen und Nutzer
 
Was ist ein Wissenschaftsblog?
Was ist ein Wissenschaftsblog?Was ist ein Wissenschaftsblog?
Was ist ein Wissenschaftsblog?
 
Wissenschaftliche Blogs: Schnittstelle zur Öffentlichkeit oder virtueller Elf...
Wissenschaftliche Blogs: Schnittstelle zur Öffentlichkeit oder virtueller Elf...Wissenschaftliche Blogs: Schnittstelle zur Öffentlichkeit oder virtueller Elf...
Wissenschaftliche Blogs: Schnittstelle zur Öffentlichkeit oder virtueller Elf...
 
Beyond the stars: Interpreting discourse cohesion in Twitter as an indicator ...
Beyond the stars: Interpreting discourse cohesion in Twitter as an indicator ...Beyond the stars: Interpreting discourse cohesion in Twitter as an indicator ...
Beyond the stars: Interpreting discourse cohesion in Twitter as an indicator ...
 
(Academic) Community Management in the Humanities and Social Sciences for Pub...
(Academic) Community Management in the Humanities and Social Sciences for Pub...(Academic) Community Management in the Humanities and Social Sciences for Pub...
(Academic) Community Management in the Humanities and Social Sciences for Pub...
 
Doing A Small-Scale Diachronic Twitter User Study
Doing A Small-Scale Diachronic Twitter User StudyDoing A Small-Scale Diachronic Twitter User Study
Doing A Small-Scale Diachronic Twitter User Study
 
Social data: what it is, who owns it, and why you should care
Social data: what it is, who owns it, and why you should careSocial data: what it is, who owns it, and why you should care
Social data: what it is, who owns it, and why you should care
 
Twitter zwischen Nachrichtenkanal und Mikronarrativ
Twitter zwischen Nachrichtenkanal und MikronarrativTwitter zwischen Nachrichtenkanal und Mikronarrativ
Twitter zwischen Nachrichtenkanal und Mikronarrativ
 
#www2010 user activity chart
#www2010 user activity chart#www2010 user activity chart
#www2010 user activity chart
 
#s21 user activity chart
#s21 user activity chart#s21 user activity chart
#s21 user activity chart
 
Studying Twitter conversations as (dynamic) graphs: visualization and structu...
Studying Twitter conversations as (dynamic) graphs: visualization and structu...Studying Twitter conversations as (dynamic) graphs: visualization and structu...
Studying Twitter conversations as (dynamic) graphs: visualization and structu...
 

Recently uploaded

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Recently uploaded (20)

How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 

SchemaCMD - An XML-based storage schema for the compilation of mixed-source CMD corpora

  • 1. SchemaCMD An XML-based storage schema for the compilation of mixed-source CMD corpora Cornelius Puschmann University of Düsseldorf [email_address] Towards a Reference Corpus of Web Genres University of Birmingham 27 July 2007
  • 2.
  • 3. A) Problems of classifying digital genres
  • 4.
  • 5.
  • 6. Aspects of digital genres concrete abstract
  • 7.
  • 8. B) Granularity and meta-data in blogs
  • 9.
  • 10.
  • 11. C) An example: the corporate web log corpus
  • 12.
  • 13. MySQL data structure - sources - BNC top 100 words - sub-types of corp. blogs - blogs, press eds., ... - n-grams (not computed due to cost) - POS frequencies by post - post data (via RSS/Atom) - additional post statistics - tokens (depends on types) - types (string + POS)
  • 14.
  • 16.
  • 17.
  • 18.
  • 19. F-score over time for two sources light blue = Jonathan Schwartz (blog); dark blue = New York Times (editorials)
  • 20. F-score and standard deviation for all sources x-axis = stdev; y-axis = f-score; dot size = number of posts editorials press releases
  • 22.
  • 24. SchemaCMD An XML-based storage schema for the compilation of mixed-source CMD corpora Cornelius Puschmann University of Düsseldorf [email_address] Towards a Reference Corpus of Web Genres University of Birmingham 27 July 2007