Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
2312 PACLIC
1. - Development of Korean NLP
- Largely follows classical computaional
linguistics approaches
- NLP pipeline to downstream tasks
- Why & how Korean NLP studies have
become popular (at least domestic)
- Challenging syntax and morphology
(pro-drop, agglutinative)
- Non latin alphabet nor Kanji-oriented
writing system
- Motivation of model and corpus studies
of own language
Korean NLP Paradigm
Past and Ongoing Projects
- KAIST & NIKL (Corpora & treebanks)
- ExoBrain by ETRI (CL and QA)
- AI HUB by NIA (Nationwide datahub)
- LDC (Dataset catalogs)
- Open corpora by individuals &
organizations in academia/industry
(Downstream tasks & Benchmarks)
Revisiting Korean
Corpus Studies through
Technological Advances
Won Ik Cho
Sangwhan Moon
Youngsook Song
Seoul National University
Tokyo Institute of Technology
Sionic AI Inc.
Diachronic Overview of (Open) Korean Corpora (1990s – 2023)
• Large-scale, raw text to small, specific, annotated text
- Raw & annotated corpora, Treebanks (of syntactic area) to Downstream
tasks related to language understanding, and to Benchmarks (set of tasks)
• Token-annotated classification to document-annotated
classification and span/generation
- Model development tends to develop from categorical priority (statistical
models and perceptrons) to ones for translation/generation (PLMs, LLMs)
- Capacity of model becomes larger (lowering computation budget)
• Written or spoken (web) text to texts in various areas
- Flourishing data provide from web domain (e.g., social media) and
increasing demand from areas beyond linguistics (e.g., finance, law)
Trend of Korean Corpus Studies
- Large-scale corpora + Treebanks (1990s-2000s; Usually driven by institutes)
- Downstream tasks of NLP pipeline and parallel corpora (2000s-)
- Sentiment analysis, QA, entailment, and similarity datasets (2010s-)
- Dialogue studies, offensive language, societal bias and fairness (2020s-)
- Benchmark studies for PLMs + Pretraining corpora (2020s-)
- Benchmarks for LLM evaluation (2022-)
Our Repository for
Open Korean Corpora
Studies