SlideShare a Scribd company logo
- Development of Korean NLP
- Largely follows classical computaional
linguistics approaches
- NLP pipeline to downstream tasks
- Why & how Korean NLP studies have
become popular (at least domestic)
- Challenging syntax and morphology
(pro-drop, agglutinative)
- Non latin alphabet nor Kanji-oriented
writing system
- Motivation of model and corpus studies
of own language
Korean NLP Paradigm
Past and Ongoing Projects
- KAIST & NIKL (Corpora & treebanks)
- ExoBrain by ETRI (CL and QA)
- AI HUB by NIA (Nationwide datahub)
- LDC (Dataset catalogs)
- Open corpora by individuals &
organizations in academia/industry
(Downstream tasks & Benchmarks)
Revisiting Korean
Corpus Studies through
Technological Advances
Won Ik Cho
Sangwhan Moon
Youngsook Song
Seoul National University
Tokyo Institute of Technology
Sionic AI Inc.
Diachronic Overview of (Open) Korean Corpora (1990s – 2023)
• Large-scale, raw text to small, specific, annotated text
- Raw & annotated corpora, Treebanks (of syntactic area) to Downstream
tasks related to language understanding, and to Benchmarks (set of tasks)
• Token-annotated classification to document-annotated
classification and span/generation
- Model development tends to develop from categorical priority (statistical
models and perceptrons) to ones for translation/generation (PLMs, LLMs)
- Capacity of model becomes larger (lowering computation budget)
• Written or spoken (web) text to texts in various areas
- Flourishing data provide from web domain (e.g., social media) and
increasing demand from areas beyond linguistics (e.g., finance, law)
Trend of Korean Corpus Studies
- Large-scale corpora + Treebanks (1990s-2000s; Usually driven by institutes)
- Downstream tasks of NLP pipeline and parallel corpora (2000s-)
- Sentiment analysis, QA, entailment, and similarity datasets (2010s-)
- Dialogue studies, offensive language, societal bias and fairness (2020s-)
- Benchmark studies for PLMs + Pretraining corpora (2020s-)
- Benchmarks for LLM evaluation (2022-)
Our Repository for
Open Korean Corpora
Studies 

More Related Content

Similar to 2312 PACLIC

OOR Architecture - Towards a Network of Linked Ontology Repositories
OOR Architecture - Towards a Network of Linked Ontology RepositoriesOOR Architecture - Towards a Network of Linked Ontology Repositories
OOR Architecture - Towards a Network of Linked Ontology RepositoriesKim Viljanen
 
Update From OCLC Research May 2008
Update From OCLC Research May 2008Update From OCLC Research May 2008
Update From OCLC Research May 2008Nancy Elkington
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
 
Innovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLPInnovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLPariadnenetwork
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
 
Introduction about AAT-Taiwan Project
Introduction about AAT-Taiwan ProjectIntroduction about AAT-Taiwan Project
Introduction about AAT-Taiwan ProjectAAT Taiwan
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
 
NJVR: The NanJing Vocabulary Repository
NJVR: The NanJing Vocabulary RepositoryNJVR: The NanJing Vocabulary Repository
NJVR: The NanJing Vocabulary RepositoryGong Cheng
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)Vladimir Alexiev, PhD, PMP
 
Grosof haley-talk-semtech2013-ver6-10-13
Grosof haley-talk-semtech2013-ver6-10-13Grosof haley-talk-semtech2013-ver6-10-13
Grosof haley-talk-semtech2013-ver6-10-13Brian Ulicny
 
Ontology Engineering for Systems Engineering
Ontology Engineering for Systems EngineeringOntology Engineering for Systems Engineering
Ontology Engineering for Systems EngineeringAnatoly Levenchuk
 
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)Marcia Zeng
 
Sinmin Literature Review Presentation
Sinmin Literature Review PresentationSinmin Literature Review Presentation
Sinmin Literature Review PresentationChamila Wijayarathna
 
김문형-Curriculum Vitae
김문형-Curriculum Vitae김문형-Curriculum Vitae
김문형-Curriculum VitaeMunhyong Kim
 

Similar to 2312 PACLIC (20)

Ontologies Fmi 042010
Ontologies Fmi 042010Ontologies Fmi 042010
Ontologies Fmi 042010
 
OOR Architecture - Towards a Network of Linked Ontology Repositories
OOR Architecture - Towards a Network of Linked Ontology RepositoriesOOR Architecture - Towards a Network of Linked Ontology Repositories
OOR Architecture - Towards a Network of Linked Ontology Repositories
 
The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
 
Update From OCLC Research May 2008
Update From OCLC Research May 2008Update From OCLC Research May 2008
Update From OCLC Research May 2008
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
BT02.pptx
BT02.pptxBT02.pptx
BT02.pptx
 
Innovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLPInnovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLP
 
Cwn aat talk
Cwn aat talkCwn aat talk
Cwn aat talk
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
Introduction about AAT-Taiwan Project
Introduction about AAT-Taiwan ProjectIntroduction about AAT-Taiwan Project
Introduction about AAT-Taiwan Project
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Using pedagogic corpora in ELT
Using pedagogic corpora in ELTUsing pedagogic corpora in ELT
Using pedagogic corpora in ELT
 
NJVR: The NanJing Vocabulary Repository
NJVR: The NanJing Vocabulary RepositoryNJVR: The NanJing Vocabulary Repository
NJVR: The NanJing Vocabulary Repository
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)
 
Grosof haley-talk-semtech2013-ver6-10-13
Grosof haley-talk-semtech2013-ver6-10-13Grosof haley-talk-semtech2013-ver6-10-13
Grosof haley-talk-semtech2013-ver6-10-13
 
Ontology Engineering for Systems Engineering
Ontology Engineering for Systems EngineeringOntology Engineering for Systems Engineering
Ontology Engineering for Systems Engineering
 
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
A Metadata Application Profile for KOS Vocabulary Registries (KOS-AP)
 
Sinmin Literature Review Presentation
Sinmin Literature Review PresentationSinmin Literature Review Presentation
Sinmin Literature Review Presentation
 
1910 JK27
1910 JK271910 JK27
1910 JK27
 
김문형-Curriculum Vitae
김문형-Curriculum Vitae김문형-Curriculum Vitae
김문형-Curriculum Vitae
 

More from WarNik Chow

2206 FAccT_inperson
2206 FAccT_inperson2206 FAccT_inperson
2206 FAccT_inpersonWarNik Chow
 
2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech datasetWarNik Chow
 
2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2eWarNik Chow
 
2102 Redone seminar
2102 Redone seminar2102 Redone seminar
2102 Redone seminarWarNik Chow
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH WarNik Chow
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categoriesWarNik Chow
 
2010 HCLT Hate Speech
2010 HCLT Hate Speech2010 HCLT Hate Speech
2010 HCLT Hate SpeechWarNik Chow
 
2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLPWarNik Chow
 

More from WarNik Chow (20)

2311 EAAMO
2311 EAAMO2311 EAAMO
2311 EAAMO
 
2211 HCOMP
2211 HCOMP2211 HCOMP
2211 HCOMP
 
2211 APSIPA
2211 APSIPA2211 APSIPA
2211 APSIPA
 
2211 AACL
2211 AACL2211 AACL
2211 AACL
 
2210 CODI
2210 CODI2210 CODI
2210 CODI
 
2206 FAccT_inperson
2206 FAccT_inperson2206 FAccT_inperson
2206 FAccT_inperson
 
2206 Modupop!
2206 Modupop!2206 Modupop!
2206 Modupop!
 
2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset
 
2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e
 
2106 PRSLLS
2106 PRSLLS2106 PRSLLS
2106 PRSLLS
 
2106 JWLLP
2106 JWLLP2106 JWLLP
2106 JWLLP
 
2106 ACM DIS
2106 ACM DIS2106 ACM DIS
2106 ACM DIS
 
2104 Talk @SSU
2104 Talk @SSU2104 Talk @SSU
2104 Talk @SSU
 
2103 ACM FAccT
2103 ACM FAccT2103 ACM FAccT
2103 ACM FAccT
 
2102 Redone seminar
2102 Redone seminar2102 Redone seminar
2102 Redone seminar
 
2011 NLP-OSS
2011 NLP-OSS2011 NLP-OSS
2011 NLP-OSS
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
 
2010 HCLT Hate Speech
2010 HCLT Hate Speech2010 HCLT Hate Speech
2010 HCLT Hate Speech
 
2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP
 

Recently uploaded

Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyRafigAliyev2
 
Machine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxMachine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxbenishzehra469
 
how can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoinhow can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like BitcoinDOT TECH
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理cyebo
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonPayment Village
 
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxMALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxNidaFaviankaNawawi
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIAlejandraGmez176757
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxDilipVasan
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Jon Hansen
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfMichaelSenkow
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?DOT TECH
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理pyhepag
 

Recently uploaded (20)

Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
Machine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxMachine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptx
 
how can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoinhow can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoin
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxMALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 

2312 PACLIC

  • 1. - Development of Korean NLP - Largely follows classical computaional linguistics approaches - NLP pipeline to downstream tasks - Why & how Korean NLP studies have become popular (at least domestic) - Challenging syntax and morphology (pro-drop, agglutinative) - Non latin alphabet nor Kanji-oriented writing system - Motivation of model and corpus studies of own language Korean NLP Paradigm Past and Ongoing Projects - KAIST & NIKL (Corpora & treebanks) - ExoBrain by ETRI (CL and QA) - AI HUB by NIA (Nationwide datahub) - LDC (Dataset catalogs) - Open corpora by individuals & organizations in academia/industry (Downstream tasks & Benchmarks) Revisiting Korean Corpus Studies through Technological Advances Won Ik Cho Sangwhan Moon Youngsook Song Seoul National University Tokyo Institute of Technology Sionic AI Inc. Diachronic Overview of (Open) Korean Corpora (1990s – 2023) • Large-scale, raw text to small, specific, annotated text - Raw & annotated corpora, Treebanks (of syntactic area) to Downstream tasks related to language understanding, and to Benchmarks (set of tasks) • Token-annotated classification to document-annotated classification and span/generation - Model development tends to develop from categorical priority (statistical models and perceptrons) to ones for translation/generation (PLMs, LLMs) - Capacity of model becomes larger (lowering computation budget) • Written or spoken (web) text to texts in various areas - Flourishing data provide from web domain (e.g., social media) and increasing demand from areas beyond linguistics (e.g., finance, law) Trend of Korean Corpus Studies - Large-scale corpora + Treebanks (1990s-2000s; Usually driven by institutes) - Downstream tasks of NLP pipeline and parallel corpora (2000s-) - Sentiment analysis, QA, entailment, and similarity datasets (2010s-) - Dialogue studies, offensive language, societal bias and fairness (2020s-) - Benchmark studies for PLMs + Pretraining corpora (2020s-) - Benchmarks for LLM evaluation (2022-) Our Repository for Open Korean Corpora Studies 