SlideShare a Scribd company logo
1 of 14
Download to read offline
Processing Extensive Text Data 
at the Leipzig Corpora Collection 
Dirk Goldhahn 
MLODE 2014, Leipzig 
Natural Language Processing Group 
Institute of Computer Science 
University of Leipzig
Leipzig Corpora Collection 
2 
● Project that offers 
● Corpus-based monolingual full form dictionaries 
● Started as project „Deutscher Wortschatz“ 
● Accessible at http://corpora.uni-leipzig.de 
● Web interface 
● Web services 
● Integration into CLARIN 
● Export as rdf (Sebastian Hellmann)
Available Resources (Online) 
3 
● Corpora in >220 languages 
● Text material 
● Word frequencies 
● Word co-occurrences 
● Semantic maps (strongest word co-occurrences) 
● Topics 
● Subject areas (partially) 
● Morphological information (partially) 
● POS-tagged sentences (partially) 
● Further semantic and grammatical information
Example 
4
Example 
5
Number of Languages over Time 
6
World Map Interface 
7
Data Stock 
●Number of languages: >300, >1000 including religious texts 
●Number of sentences (overall): ~4 billion 
●Biggest language: English (1.1 billion) 
8
Text Acquisition Methods 
9 
●Generic web crawler (Heritrix) 
● Whole TLDs 
● News (abyzNewsLinks) 
●RSS feeds (daily) 
→ http://wortschatz.uni-leipzig.de/wort-des-tages/ 
●Distributed text crawling (FindLinks) 
●Bootstrapping web corpora 
●Text dumps (Wikipedia, Watchtower) 
●Parallel corpora (Bible)
Corpus Creation 
10 
●Processing steps: 
● Stripping 
● Language Separation (about 500 languages) 
● Sentence Segmentation 
● Cleaning 
● Sentence Scrambling (copyright reasons) 
● Tokenization 
● Database Creation 
● Statistical Evaluation
Statistical Evaluation 
11 
●Enrichment of corpora with statistical annotations 
● Word frequencies 
● Co-occurrence frequencies and significance (based on left or right 
neighbours or whole sentences) 
● Topics, POS 
●Creation of further corpora statistics 
● >200 statistical properties based on letter, word, sentence, sources level 
etc. 
● e.g.: For automatic quality assurance using typical distributions 
Hindi newspaper 2010 % 
0 50 100 150 200 250 
1 
0,8 
0,6 
0,4 
0,2 
0 
0 50 100 150 200 250 
6 
5 
4 
3 
2 
1 
0 
Sundanese Wikipedia 2007 
%
LCC as Linked Data 
12 
●LD starts with data and metadata 
● Data + metadata for hundreds of languages 
● All data processed and stored identically, 
including metadata 
● Still at the beginning of turning into LD
LCC as Linked Data 
13 
●Monolingual Metadata: 
● Frequencies, word co-occurrences, NER, POS-tags, topics 
● Integration of external sources (DBPedia, FreeBase, Wordnet) 
● Still problems to solve (linking, disambiguation, ...) 
→ New Web Interface 
●Multilingual Metadata: 
● Crosslingual word co-occurrences (parallel text) 
● External sources (DBPedia...)
14 
●Data available via: 
● Web interface 
(http://corpora.informatik.uni-leipzig.de) 
● Download 
(http://corpora.informatik.uni-leipzig.de/download.html) 
● Webservices 
(http://wortschatz.uni-leipzig.de/Webservices/) 
Thank you!

More Related Content

What's hot

What's hot (20)

Data quality in Real Estate
Data quality in Real EstateData quality in Real Estate
Data quality in Real Estate
 
Linking knowledge spaces
Linking knowledge spacesLinking knowledge spaces
Linking knowledge spaces
 
Linked Open Data: A simple how-to
Linked Open Data: A simple how-toLinked Open Data: A simple how-to
Linked Open Data: A simple how-to
 
Normalizing Data for Migrations
Normalizing Data for MigrationsNormalizing Data for Migrations
Normalizing Data for Migrations
 
XML: A New Standard for Data
XML: A New Standard for DataXML: A New Standard for Data
XML: A New Standard for Data
 
Resident good: NoSQL
Resident good: NoSQLResident good: NoSQL
Resident good: NoSQL
 
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality AssessmentAre Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
 
A Higher-Order Data Flow Model for Heterogeneous Big Data
A Higher-Order Data Flow Model for Heterogeneous Big DataA Higher-Order Data Flow Model for Heterogeneous Big Data
A Higher-Order Data Flow Model for Heterogeneous Big Data
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext Search
 
Comparison with storing data using NoSQL(CouchDB) and a relational database.
Comparison with storing data using NoSQL(CouchDB) and a relational database.Comparison with storing data using NoSQL(CouchDB) and a relational database.
Comparison with storing data using NoSQL(CouchDB) and a relational database.
 
Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic Cross lingual information retrieval across 100 languages - Andrej Muhic
Cross lingual information retrieval across 100 languages - Andrej Muhic
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma API
 
DBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in DublinDBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in Dublin
 
Open Data Node - Platform and Methodology - 2015-May
Open Data Node - Platform and Methodology - 2015-MayOpen Data Node - Platform and Methodology - 2015-May
Open Data Node - Platform and Methodology - 2015-May
 
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...
 
NoSql evaluation
NoSql evaluationNoSql evaluation
NoSql evaluation
 
DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
 
On a web of data streams
On a web of data streamsOn a web of data streams
On a web of data streams
 
Kalp Corporate MongoDB Tutorials
Kalp Corporate MongoDB TutorialsKalp Corporate MongoDB Tutorials
Kalp Corporate MongoDB Tutorials
 

Similar to Dirk Goldhahn: Introduction to the German Wortschatz Project

The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
Sören Auer
 

Similar to Dirk Goldhahn: Introduction to the German Wortschatz Project (20)

Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
 
Translating software with SDL Passolo
Translating software with SDL PassoloTranslating software with SDL Passolo
Translating software with SDL Passolo
 
Introduction to SDL Passolo
Introduction to SDL PassoloIntroduction to SDL Passolo
Introduction to SDL Passolo
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
 
Lemon at-mlw3
Lemon at-mlw3Lemon at-mlw3
Lemon at-mlw3
 
Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"
 
Introduction to HDF 3.0
Introduction to HDF 3.0Introduction to HDF 3.0
Introduction to HDF 3.0
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
 
Standardisation Challenges in Data Management Lifecycle
Standardisation Challenges in Data Management LifecycleStandardisation Challenges in Data Management Lifecycle
Standardisation Challenges in Data Management Lifecycle
 
The Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New TechnologiesThe Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New Technologies
 
Php packages
Php packagesPhp packages
Php packages
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...
 
Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...Apertium: a unique free/open-source MT system for related languages [but not ...
Apertium: a unique free/open-source MT system for related languages [but not ...
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
 
Oc wg-nif-20130711
Oc wg-nif-20130711Oc wg-nif-20130711
Oc wg-nif-20130711
 
5 Programming Languages for Databases to Learn In 2023.pptx
5 Programming Languages for Databases to Learn In 2023.pptx5 Programming Languages for Databases to Learn In 2023.pptx
5 Programming Languages for Databases to Learn In 2023.pptx
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
LODStats (Presentation for KESW2013 System Demo)
LODStats (Presentation for KESW2013 System Demo)LODStats (Presentation for KESW2013 System Demo)
LODStats (Presentation for KESW2013 System Demo)
 
How community software supports language documentation and data analysis
How community software supports language documentation and data analysisHow community software supports language documentation and data analysis
How community software supports language documentation and data analysis
 

More from mbruemmer

Oscar Muñoz: Content Analytics for Media Agencies
Oscar Muñoz: Content Analytics for Media AgenciesOscar Muñoz: Content Analytics for Media Agencies
Oscar Muñoz: Content Analytics for Media Agencies
mbruemmer
 
Alessio Bosca: Linked Data for Content Analytics in CELI
Alessio Bosca: Linked Data for Content Analytics in CELIAlessio Bosca: Linked Data for Content Analytics in CELI
Alessio Bosca: Linked Data for Content Analytics in CELI
mbruemmer
 
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked DataMark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
mbruemmer
 
Ilan Kernerman: Generating Multilingual Lexicographic Resources
Ilan Kernerman: Generating Multilingual Lexicographic ResourcesIlan Kernerman: Generating Multilingual Lexicographic Resources
Ilan Kernerman: Generating Multilingual Lexicographic Resources
mbruemmer
 
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysisLemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
mbruemmer
 

More from mbruemmer (7)

Oscar Muñoz: Content Analytics for Media Agencies
Oscar Muñoz: Content Analytics for Media AgenciesOscar Muñoz: Content Analytics for Media Agencies
Oscar Muñoz: Content Analytics for Media Agencies
 
Alessio Bosca: Linked Data for Content Analytics in CELI
Alessio Bosca: Linked Data for Content Analytics in CELIAlessio Bosca: Linked Data for Content Analytics in CELI
Alessio Bosca: Linked Data for Content Analytics in CELI
 
Marc Egger: Text Analytics for Brand Research -Non-reactive Concept Mapping t...
Marc Egger: Text Analytics for Brand Research -Non-reactive Concept Mapping t...Marc Egger: Text Analytics for Brand Research -Non-reactive Concept Mapping t...
Marc Egger: Text Analytics for Brand Research -Non-reactive Concept Mapping t...
 
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked DataMark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data
 
Ilan Kernerman: Generating Multilingual Lexicographic Resources
Ilan Kernerman: Generating Multilingual Lexicographic ResourcesIlan Kernerman: Generating Multilingual Lexicographic Resources
Ilan Kernerman: Generating Multilingual Lexicographic Resources
 
Tatiana Gornostay: Language Meets Knowledge in Digital Content Management
Tatiana Gornostay: Language Meets Knowledge in Digital Content ManagementTatiana Gornostay: Language Meets Knowledge in Digital Content Management
Tatiana Gornostay: Language Meets Knowledge in Digital Content Management
 
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysisLemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
 

Recently uploaded

Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 

Recently uploaded (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 

Dirk Goldhahn: Introduction to the German Wortschatz Project

  • 1. Processing Extensive Text Data at the Leipzig Corpora Collection Dirk Goldhahn MLODE 2014, Leipzig Natural Language Processing Group Institute of Computer Science University of Leipzig
  • 2. Leipzig Corpora Collection 2 ● Project that offers ● Corpus-based monolingual full form dictionaries ● Started as project „Deutscher Wortschatz“ ● Accessible at http://corpora.uni-leipzig.de ● Web interface ● Web services ● Integration into CLARIN ● Export as rdf (Sebastian Hellmann)
  • 3. Available Resources (Online) 3 ● Corpora in >220 languages ● Text material ● Word frequencies ● Word co-occurrences ● Semantic maps (strongest word co-occurrences) ● Topics ● Subject areas (partially) ● Morphological information (partially) ● POS-tagged sentences (partially) ● Further semantic and grammatical information
  • 6. Number of Languages over Time 6
  • 8. Data Stock ●Number of languages: >300, >1000 including religious texts ●Number of sentences (overall): ~4 billion ●Biggest language: English (1.1 billion) 8
  • 9. Text Acquisition Methods 9 ●Generic web crawler (Heritrix) ● Whole TLDs ● News (abyzNewsLinks) ●RSS feeds (daily) → http://wortschatz.uni-leipzig.de/wort-des-tages/ ●Distributed text crawling (FindLinks) ●Bootstrapping web corpora ●Text dumps (Wikipedia, Watchtower) ●Parallel corpora (Bible)
  • 10. Corpus Creation 10 ●Processing steps: ● Stripping ● Language Separation (about 500 languages) ● Sentence Segmentation ● Cleaning ● Sentence Scrambling (copyright reasons) ● Tokenization ● Database Creation ● Statistical Evaluation
  • 11. Statistical Evaluation 11 ●Enrichment of corpora with statistical annotations ● Word frequencies ● Co-occurrence frequencies and significance (based on left or right neighbours or whole sentences) ● Topics, POS ●Creation of further corpora statistics ● >200 statistical properties based on letter, word, sentence, sources level etc. ● e.g.: For automatic quality assurance using typical distributions Hindi newspaper 2010 % 0 50 100 150 200 250 1 0,8 0,6 0,4 0,2 0 0 50 100 150 200 250 6 5 4 3 2 1 0 Sundanese Wikipedia 2007 %
  • 12. LCC as Linked Data 12 ●LD starts with data and metadata ● Data + metadata for hundreds of languages ● All data processed and stored identically, including metadata ● Still at the beginning of turning into LD
  • 13. LCC as Linked Data 13 ●Monolingual Metadata: ● Frequencies, word co-occurrences, NER, POS-tags, topics ● Integration of external sources (DBPedia, FreeBase, Wordnet) ● Still problems to solve (linking, disambiguation, ...) → New Web Interface ●Multilingual Metadata: ● Crosslingual word co-occurrences (parallel text) ● External sources (DBPedia...)
  • 14. 14 ●Data available via: ● Web interface (http://corpora.informatik.uni-leipzig.de) ● Download (http://corpora.informatik.uni-leipzig.de/download.html) ● Webservices (http://wortschatz.uni-leipzig.de/Webservices/) Thank you!