SlideShare a Scribd company logo
1 of 17
Download to read offline
Challenges in the linguistic exploitation of specialized
republishable web corpora
Adrien Barbaresi
Berlin-Brandenburg Academy of Sciences and Humanities (BBAW)
RESAW conference 2015
˚Arhus – June 10, 2015
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 1 / 15
Outline
• Context
• Specialized web corpora
• Construction and availability
• Challenges
• Metadata extraction
• Quality assessment of content
• Licensing and republishing
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 2 / 15
Context Specialized web corpora
Text corpora
Text collections
in German
gathered on the Web
used by linguists
available via a web interface
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 3 / 15
Context Specialized web corpora
“Specialized” corpora
Definition
The corpora focus on a particular text genre or source.
Goal for linguists: better coverage of specific written text types and genres
not found in “traditional” corpora.
Construction
1 Discovery and download: web crawling techniques
2 Stored in a processed version: linguistic corpus
3 Standardized formats: interoperability within the research community
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 4 / 15
Context Specialized web corpora
Two cases of republishable corpora
“Standard” case: German political speeches
Chancellery | 1.831 speeches | 1998–2012
Presidency | 1.442 speeches | 1984–2012
https://adrien.barbaresi.eu/corpora/speeches/
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 5 / 15
Context Specialized web corpora
Two cases of republishable corpora
“Standard” case: German political speeches
Chancellery | 1.831 speeches | 1998–2012
Presidency | 1.442 speeches | 1984–2012
https://adrien.barbaresi.eu/corpora/speeches/
“Borderline” case: German blogs under Creative Commons licenses
Blogs | 250.000 documents | ∼ 100 MTokens
https://kaskade.dwds.de/dstar/blogs/
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 5 / 15
Context Construction and availability
File formats
1 Web archives (HTML, no WARC to this day)
⇒ linguistic processing toolchain
2a XML TEI format (https://tei-c.org)
2b Browser-friendly HTML documents
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 6 / 15
Context Construction and availability
Interface to the political speeches: static HTML documents
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 7 / 15
Context Construction and availability
Interface to the blogs: querying architecture @DWDS
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 8 / 15
Challenges Metadata extraction
Data quality
Even small or rare mistakes in date encoding for instance may cause the
application to be disregarded or discarded by researchers in the humanities.
Potentially erroneous metadata in “one size fits all” web corpora may
undermine the relevance of web texts for linguistic purposes.
→ “Hi-Fi” web corpora promote web sources and modernization of
research methodology
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 9 / 15
Challenges Metadata extraction
Examples: quality of metadata
Figure: Relative frequency of lemma
“Google” in the blog corpus, classified
by date
Figure: Relative frequency of lemma
“Zuckerberg” in the blog corpus,
classified by date
Querying and plotting software (DDC & DiaCollo): Bryan Jurish (BBAW)
http://odo.dwds.de/~moocow/software/
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 10 / 15
Challenges Quality assessment of content
Example: text quality (query: “document” in blog corpus)
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 11 / 15
Challenges Licensing and republishing
Last but not least: License issues
Different countries, different laws (public domain in the USA, political
speeches in Germany etc.)
To be sure: check content and licenses
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 12 / 15
Challenges Licensing and republishing
Manual content checks for the blogs
2727 blog candidates
1766 blogs can be used without restriction (65 %), since all the textual
content qualifies for archiving:
• At least something on the website
• It is a blog
• Mostly written in German
• Under CC license
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 13 / 15
Challenges Licensing and republishing
CC licence terms (blog corpus)
Most frequent licence types:
652 BY-NC-SA
532 BY-NC-ND
351 BY-SA
282 BY
129 BY-NC
58 BY-ND
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 14 / 15
Challenges Licensing and republishing
CC licence terms (blog corpus)
Most frequent licence types:
652 BY-NC-SA
532 BY-NC-ND
351 BY-SA
282 BY
129 BY-NC
58 BY-ND
Remarks
• Theoretically, the CC license cannot be overridden by another once
the content has been published
• The usage of *-ND might be a problem
• Differences between countries are not supposed to be a concern
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 14 / 15
Challenges Licensing and republishing
Thank you for your attention
barbaresi@bbaw.de
@adbarbaresi
http://purl.org/adrien-barbaresi
Document under CC BY-SA 4.0 license
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 15 / 15

More Related Content

Viewers also liked

Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Estelle Delpech
 
Chelo Vargas-Sierra
Chelo Vargas-SierraChelo Vargas-Sierra
Chelo Vargas-SierraChelo Vargas
 
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionSarvnaz Karimi
 
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Association for Computational Linguistics
 
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeDealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeEstelle Delpech
 
Applicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesApplicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesEstelle Delpech
 
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...Association for Computational Linguistics
 
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Estelle Delpech
 
Word Formation in English
Word Formation in EnglishWord Formation in English
Word Formation in Englishteflang
 

Viewers also liked (9)

Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
 
Chelo Vargas-Sierra
Chelo Vargas-SierraChelo Vargas-Sierra
Chelo Vargas-Sierra
 
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
 
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
 
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeDealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
 
Applicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesApplicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologies
 
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
 
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
 
Word Formation in English
Word Formation in EnglishWord Formation in English
Word Formation in English
 

Similar to Challenges in the linguistic exploitation of specialized republishable web corpora

How Far Have We Come? From eLib to NOF-digi and Beyond
How Far Have We Come? From eLib to NOF-digi and BeyondHow Far Have We Come? From eLib to NOF-digi and Beyond
How Far Have We Come? From eLib to NOF-digi and Beyondlisbk
 
SANLIC 2019 - 99 Knowledgebase problems: a KBART crash course
SANLIC 2019 -  99 Knowledgebase problems: a KBART crash courseSANLIC 2019 -  99 Knowledgebase problems: a KBART crash course
SANLIC 2019 - 99 Knowledgebase problems: a KBART crash courseMatthew Ragucci
 
DBpedia/association Introduction The Hague 12.2.2016
DBpedia/association Introduction The Hague 12.2.2016DBpedia/association Introduction The Hague 12.2.2016
DBpedia/association Introduction The Hague 12.2.2016Sebastian Hellmann
 
IWMW 1998: Deploying new web technologies
IWMW 1998: Deploying new web technologiesIWMW 1998: Deploying new web technologies
IWMW 1998: Deploying new web technologiesIWMW
 
HTML5, CSS3 and the Future of the Web
HTML5, CSS3 and the Future of the WebHTML5, CSS3 and the Future of the Web
HTML5, CSS3 and the Future of the WebBerg Brandt
 
Metadata and Content Aggregation for UKOER
Metadata and Content Aggregation for UKOERMetadata and Content Aggregation for UKOER
Metadata and Content Aggregation for UKOERPhil Barker
 
Metadata and Content Aggregation for ukoer
Metadata and Content Aggregation for ukoerMetadata and Content Aggregation for ukoer
Metadata and Content Aggregation for ukoerR. John Robertson
 
Documentation With Open Source Tools·(ასლი)
Documentation With Open Source Tools·(ასლი)Documentation With Open Source Tools·(ასლი)
Documentation With Open Source Tools·(ასლი)Rashad Aliyev
 
Documentation With Open Source Tools
Documentation With Open Source ToolsDocumentation With Open Source Tools
Documentation With Open Source ToolsRashad Aliyev
 
Towards an Agile Authoring methodology: Learning from Lean (AgileTheDocs Conf...
Towards an Agile Authoring methodology: Learning from Lean (AgileTheDocs Conf...Towards an Agile Authoring methodology: Learning from Lean (AgileTheDocs Conf...
Towards an Agile Authoring methodology: Learning from Lean (AgileTheDocs Conf...Ellis Pratt
 
HTML5 and the Open Web Platform - Lecture 03 - Web Information Systems (WE-DI...
HTML5 and the Open Web Platform - Lecture 03 - Web Information Systems (WE-DI...HTML5 and the Open Web Platform - Lecture 03 - Web Information Systems (WE-DI...
HTML5 and the Open Web Platform - Lecture 03 - Web Information Systems (WE-DI...Beat Signer
 
Sharepoint Document Conversion
Sharepoint Document ConversionSharepoint Document Conversion
Sharepoint Document ConversionColin Gardner
 
How useful are Weblogs, RSS-Newsfeeds Wikis and Podcasting to information spe...
How useful are Weblogs, RSS-Newsfeeds Wikis and Podcasting to information spe...How useful are Weblogs, RSS-Newsfeeds Wikis and Podcasting to information spe...
How useful are Weblogs, RSS-Newsfeeds Wikis and Podcasting to information spe...Michael Fanning
 
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)jiscpowr
 
GlueCon 2018: Are REST APIs Still Relevant Today?
GlueCon 2018: Are REST APIs Still Relevant Today?GlueCon 2018: Are REST APIs Still Relevant Today?
GlueCon 2018: Are REST APIs Still Relevant Today?LaunchAny
 
CSS Frameworks: Categories, Criteria and Recommendations
CSS Frameworks: Categories, Criteria and RecommendationsCSS Frameworks: Categories, Criteria and Recommendations
CSS Frameworks: Categories, Criteria and Recommendationssirajrkhan
 
FOSSology & GSOC Journey
FOSSology & GSOC JourneyFOSSology & GSOC Journey
FOSSology & GSOC JourneyGaurav Mishra
 

Similar to Challenges in the linguistic exploitation of specialized republishable web corpora (20)

How Far Have We Come? From eLib to NOF-digi and Beyond
How Far Have We Come? From eLib to NOF-digi and BeyondHow Far Have We Come? From eLib to NOF-digi and Beyond
How Far Have We Come? From eLib to NOF-digi and Beyond
 
SANLIC 2019 - 99 Knowledgebase problems: a KBART crash course
SANLIC 2019 -  99 Knowledgebase problems: a KBART crash courseSANLIC 2019 -  99 Knowledgebase problems: a KBART crash course
SANLIC 2019 - 99 Knowledgebase problems: a KBART crash course
 
From e-Lib to NOF-digi and beyond
From e-Lib to NOF-digi and beyondFrom e-Lib to NOF-digi and beyond
From e-Lib to NOF-digi and beyond
 
DBpedia/association Introduction The Hague 12.2.2016
DBpedia/association Introduction The Hague 12.2.2016DBpedia/association Introduction The Hague 12.2.2016
DBpedia/association Introduction The Hague 12.2.2016
 
IWMW 1998: Deploying new web technologies
IWMW 1998: Deploying new web technologiesIWMW 1998: Deploying new web technologies
IWMW 1998: Deploying new web technologies
 
HTML5, CSS3 and the Future of the Web
HTML5, CSS3 and the Future of the WebHTML5, CSS3 and the Future of the Web
HTML5, CSS3 and the Future of the Web
 
Metadata and Content Aggregation for UKOER
Metadata and Content Aggregation for UKOERMetadata and Content Aggregation for UKOER
Metadata and Content Aggregation for UKOER
 
Metadata and Content Aggregation for ukoer
Metadata and Content Aggregation for ukoerMetadata and Content Aggregation for ukoer
Metadata and Content Aggregation for ukoer
 
Documentation With Open Source Tools·(ასლი)
Documentation With Open Source Tools·(ასლი)Documentation With Open Source Tools·(ასლი)
Documentation With Open Source Tools·(ასლი)
 
Documentation With Open Source Tools
Documentation With Open Source ToolsDocumentation With Open Source Tools
Documentation With Open Source Tools
 
aask
aaskaask
aask
 
Towards an Agile Authoring methodology: Learning from Lean (AgileTheDocs Conf...
Towards an Agile Authoring methodology: Learning from Lean (AgileTheDocs Conf...Towards an Agile Authoring methodology: Learning from Lean (AgileTheDocs Conf...
Towards an Agile Authoring methodology: Learning from Lean (AgileTheDocs Conf...
 
HTML5 and the Open Web Platform - Lecture 03 - Web Information Systems (WE-DI...
HTML5 and the Open Web Platform - Lecture 03 - Web Information Systems (WE-DI...HTML5 and the Open Web Platform - Lecture 03 - Web Information Systems (WE-DI...
HTML5 and the Open Web Platform - Lecture 03 - Web Information Systems (WE-DI...
 
Sharepoint Document Conversion
Sharepoint Document ConversionSharepoint Document Conversion
Sharepoint Document Conversion
 
How useful are Weblogs, RSS-Newsfeeds Wikis and Podcasting to information spe...
How useful are Weblogs, RSS-Newsfeeds Wikis and Podcasting to information spe...How useful are Weblogs, RSS-Newsfeeds Wikis and Podcasting to information spe...
How useful are Weblogs, RSS-Newsfeeds Wikis and Podcasting to information spe...
 
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)
 
DBpedia ♥ Commons
DBpedia ♥ CommonsDBpedia ♥ Commons
DBpedia ♥ Commons
 
GlueCon 2018: Are REST APIs Still Relevant Today?
GlueCon 2018: Are REST APIs Still Relevant Today?GlueCon 2018: Are REST APIs Still Relevant Today?
GlueCon 2018: Are REST APIs Still Relevant Today?
 
CSS Frameworks: Categories, Criteria and Recommendations
CSS Frameworks: Categories, Criteria and RecommendationsCSS Frameworks: Categories, Criteria and Recommendations
CSS Frameworks: Categories, Criteria and Recommendations
 
FOSSology & GSOC Journey
FOSSology & GSOC JourneyFOSSology & GSOC Journey
FOSSology & GSOC Journey
 

Recently uploaded

Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 

Recently uploaded (20)

Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 

Challenges in the linguistic exploitation of specialized republishable web corpora

  • 1. Challenges in the linguistic exploitation of specialized republishable web corpora Adrien Barbaresi Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) RESAW conference 2015 ˚Arhus – June 10, 2015 Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 1 / 15
  • 2. Outline • Context • Specialized web corpora • Construction and availability • Challenges • Metadata extraction • Quality assessment of content • Licensing and republishing Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 2 / 15
  • 3. Context Specialized web corpora Text corpora Text collections in German gathered on the Web used by linguists available via a web interface Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 3 / 15
  • 4. Context Specialized web corpora “Specialized” corpora Definition The corpora focus on a particular text genre or source. Goal for linguists: better coverage of specific written text types and genres not found in “traditional” corpora. Construction 1 Discovery and download: web crawling techniques 2 Stored in a processed version: linguistic corpus 3 Standardized formats: interoperability within the research community Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 4 / 15
  • 5. Context Specialized web corpora Two cases of republishable corpora “Standard” case: German political speeches Chancellery | 1.831 speeches | 1998–2012 Presidency | 1.442 speeches | 1984–2012 https://adrien.barbaresi.eu/corpora/speeches/ Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 5 / 15
  • 6. Context Specialized web corpora Two cases of republishable corpora “Standard” case: German political speeches Chancellery | 1.831 speeches | 1998–2012 Presidency | 1.442 speeches | 1984–2012 https://adrien.barbaresi.eu/corpora/speeches/ “Borderline” case: German blogs under Creative Commons licenses Blogs | 250.000 documents | ∼ 100 MTokens https://kaskade.dwds.de/dstar/blogs/ Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 5 / 15
  • 7. Context Construction and availability File formats 1 Web archives (HTML, no WARC to this day) ⇒ linguistic processing toolchain 2a XML TEI format (https://tei-c.org) 2b Browser-friendly HTML documents Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 6 / 15
  • 8. Context Construction and availability Interface to the political speeches: static HTML documents Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 7 / 15
  • 9. Context Construction and availability Interface to the blogs: querying architecture @DWDS Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 8 / 15
  • 10. Challenges Metadata extraction Data quality Even small or rare mistakes in date encoding for instance may cause the application to be disregarded or discarded by researchers in the humanities. Potentially erroneous metadata in “one size fits all” web corpora may undermine the relevance of web texts for linguistic purposes. → “Hi-Fi” web corpora promote web sources and modernization of research methodology Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 9 / 15
  • 11. Challenges Metadata extraction Examples: quality of metadata Figure: Relative frequency of lemma “Google” in the blog corpus, classified by date Figure: Relative frequency of lemma “Zuckerberg” in the blog corpus, classified by date Querying and plotting software (DDC & DiaCollo): Bryan Jurish (BBAW) http://odo.dwds.de/~moocow/software/ Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 10 / 15
  • 12. Challenges Quality assessment of content Example: text quality (query: “document” in blog corpus) Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 11 / 15
  • 13. Challenges Licensing and republishing Last but not least: License issues Different countries, different laws (public domain in the USA, political speeches in Germany etc.) To be sure: check content and licenses Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 12 / 15
  • 14. Challenges Licensing and republishing Manual content checks for the blogs 2727 blog candidates 1766 blogs can be used without restriction (65 %), since all the textual content qualifies for archiving: • At least something on the website • It is a blog • Mostly written in German • Under CC license Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 13 / 15
  • 15. Challenges Licensing and republishing CC licence terms (blog corpus) Most frequent licence types: 652 BY-NC-SA 532 BY-NC-ND 351 BY-SA 282 BY 129 BY-NC 58 BY-ND Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 14 / 15
  • 16. Challenges Licensing and republishing CC licence terms (blog corpus) Most frequent licence types: 652 BY-NC-SA 532 BY-NC-ND 351 BY-SA 282 BY 129 BY-NC 58 BY-ND Remarks • Theoretically, the CC license cannot be overridden by another once the content has been published • The usage of *-ND might be a problem • Differences between countries are not supposed to be a concern Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 14 / 15
  • 17. Challenges Licensing and republishing Thank you for your attention barbaresi@bbaw.de @adbarbaresi http://purl.org/adrien-barbaresi Document under CC BY-SA 4.0 license Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 15 / 15