Talk at RESAW conference 2015 on web crawling and web corpus construction. Challenges to tackle: metadata extraction, quality assessment, and licensing.
Much work flows into ensuring the “scientificity” of web texts and making the texts not only available but also citable in a scholarly sense.
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
Challenges in the linguistic exploitation of specialized republishable web corpora
1. Challenges in the linguistic exploitation of specialized
republishable web corpora
Adrien Barbaresi
Berlin-Brandenburg Academy of Sciences and Humanities (BBAW)
RESAW conference 2015
˚Arhus – June 10, 2015
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 1 / 15
2. Outline
• Context
• Specialized web corpora
• Construction and availability
• Challenges
• Metadata extraction
• Quality assessment of content
• Licensing and republishing
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 2 / 15
3. Context Specialized web corpora
Text corpora
Text collections
in German
gathered on the Web
used by linguists
available via a web interface
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 3 / 15
4. Context Specialized web corpora
“Specialized” corpora
Definition
The corpora focus on a particular text genre or source.
Goal for linguists: better coverage of specific written text types and genres
not found in “traditional” corpora.
Construction
1 Discovery and download: web crawling techniques
2 Stored in a processed version: linguistic corpus
3 Standardized formats: interoperability within the research community
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 4 / 15
5. Context Specialized web corpora
Two cases of republishable corpora
“Standard” case: German political speeches
Chancellery | 1.831 speeches | 1998–2012
Presidency | 1.442 speeches | 1984–2012
https://adrien.barbaresi.eu/corpora/speeches/
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 5 / 15
6. Context Specialized web corpora
Two cases of republishable corpora
“Standard” case: German political speeches
Chancellery | 1.831 speeches | 1998–2012
Presidency | 1.442 speeches | 1984–2012
https://adrien.barbaresi.eu/corpora/speeches/
“Borderline” case: German blogs under Creative Commons licenses
Blogs | 250.000 documents | ∼ 100 MTokens
https://kaskade.dwds.de/dstar/blogs/
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 5 / 15
7. Context Construction and availability
File formats
1 Web archives (HTML, no WARC to this day)
⇒ linguistic processing toolchain
2a XML TEI format (https://tei-c.org)
2b Browser-friendly HTML documents
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 6 / 15
8. Context Construction and availability
Interface to the political speeches: static HTML documents
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 7 / 15
9. Context Construction and availability
Interface to the blogs: querying architecture @DWDS
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 8 / 15
10. Challenges Metadata extraction
Data quality
Even small or rare mistakes in date encoding for instance may cause the
application to be disregarded or discarded by researchers in the humanities.
Potentially erroneous metadata in “one size fits all” web corpora may
undermine the relevance of web texts for linguistic purposes.
→ “Hi-Fi” web corpora promote web sources and modernization of
research methodology
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 9 / 15
11. Challenges Metadata extraction
Examples: quality of metadata
Figure: Relative frequency of lemma
“Google” in the blog corpus, classified
by date
Figure: Relative frequency of lemma
“Zuckerberg” in the blog corpus,
classified by date
Querying and plotting software (DDC & DiaCollo): Bryan Jurish (BBAW)
http://odo.dwds.de/~moocow/software/
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 10 / 15
12. Challenges Quality assessment of content
Example: text quality (query: “document” in blog corpus)
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 11 / 15
13. Challenges Licensing and republishing
Last but not least: License issues
Different countries, different laws (public domain in the USA, political
speeches in Germany etc.)
To be sure: check content and licenses
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 12 / 15
14. Challenges Licensing and republishing
Manual content checks for the blogs
2727 blog candidates
1766 blogs can be used without restriction (65 %), since all the textual
content qualifies for archiving:
• At least something on the website
• It is a blog
• Mostly written in German
• Under CC license
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 13 / 15
15. Challenges Licensing and republishing
CC licence terms (blog corpus)
Most frequent licence types:
652 BY-NC-SA
532 BY-NC-ND
351 BY-SA
282 BY
129 BY-NC
58 BY-ND
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 14 / 15
16. Challenges Licensing and republishing
CC licence terms (blog corpus)
Most frequent licence types:
652 BY-NC-SA
532 BY-NC-ND
351 BY-SA
282 BY
129 BY-NC
58 BY-ND
Remarks
• Theoretically, the CC license cannot be overridden by another once
the content has been published
• The usage of *-ND might be a problem
• Differences between countries are not supposed to be a concern
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 14 / 15
17. Challenges Licensing and republishing
Thank you for your attention
barbaresi@bbaw.de
@adbarbaresi
http://purl.org/adrien-barbaresi
Document under CC BY-SA 4.0 license
Adrien Barbaresi (BBAW) Specialized republishable web corpora 2015-06-10 15 / 15