Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[Naver Clova AI] KorQuAD v1.0 참관기

2019/09/05
LGCNS AI Tech Talk for NLU (feat.KorQuAD)
- Naver Clova AI LaRva Team 이동준님
- KorQuAD v1.0 참관기

  • Be the first to comment

  • Be the first to like this

[Naver Clova AI] KorQuAD v1.0 참관기

  1. 1. KorQuAD v1.0 참관기 Clova AI LaRva Team
  2. 2. Contents 1. LaRva Team 2. 참여하게 된 계기 3. LaRva Factory 4. 효과적이였던 방법론들!
  3. 3. LaRva Team 김성동 ● LM 기반 Text 생성 ● LM, Transfer learning ● Dialogue model ● NLU ● Big NLP 모델 학습 ● 분산학습 ● CLaF: Clova Language Framework (ㄱㄴㄷ순) 김한주 김민정 김규완 김성동 이동준
  4. 4. 참여하게 된 계기
  5. 5. 참여하게 된 계기 Downstream task
  6. 6. 참여하게 된 계기 Downstream task Question Answering Tasks!!
  7. 7. LaRva Factory Data Collection Build vocab / Preprocessing Downstream task ● Single sentence Classification ● Sentence Pair Classification ● Question Answering ● Sequence Tagging ● ... Pre-training - On-the-fly preprocessing - Various training setting - Half-Precision training - Distributed training
  8. 8. LaRva Factory Pipeline Benchmark server 1. 매 K Step 마다 Storage server에 체크포인트 업로드 2. Trigger를 통해 Downstream task 전부 돌리기 3. Benchmark Server에 성능 업데이트! Storage server x N Run Fine-tuning Report best score
  9. 9. LaRva Factory Pipeline Benchmark server 1. 매 K Step 마다 Storage server에 체크포인트 업로드 2. Trigger를 통해 Downstream task 전부 돌리기 3. Benchmark Server에 성능 업데이트! 4. 괜찮은 모델이 나오면… Submit 준비해보자!! Storage server x N Run Fine-tuning Report best score KorQuAD 괜찮은거 하나 나왔어요~ F1: 94.21
  10. 10. 효과적이였던 방법론들! LaRva+: N-gram masking https://arxiv.org/pdf/1904.09223.pdf => Korean : Space-level(어절) masking
  11. 11. 효과적이였던 방법론들! LaRva+: N-gram masking https://arxiv.org/pdf/1904.09223.pdf => Korean : Space-level(어절) masking
  12. 12. 효과적이였던 방법론들! LaRva-Large: Model size --- LaRva-KOR LARGE --- LaRva-KOR BASE --- LaRva-KOR SMALL --- Multilingual(BASE)
  13. 13. 효과적이였던 방법론들! LaRva-Large: Model size --- LaRva-KOR LARGE --- LaRva-KOR BASE --- LaRva-KOR SMALL --- Multilingual(BASE)
  14. 14. 효과적이였던 방법론들! Fine-tuning: Data Augmentation https://arxiv.org/pdf/1810.04805.pdf KorQuAD 형식의 데이터들 (1 Epoch) KorQuAD v1.0 Train set
  15. 15. 효과적이였던 방법론들! Fine-tuning: Data Augmentation https://arxiv.org/pdf/1810.04805.pdf KorQuAD 형식의 데이터들 (1 Epoch) KorQuAD v1.0 Train set 기존 데이터만 사용대비 F1 기준, 0.5 정도 상승!
  16. 16. 효과적이였던 방법론들! CLaF: Tokenizer Passage Passage doc_stride: 64 [UNK] [UNK] https://github.com/naver/claf SentTokenizer No [UNK]!
  17. 17. 효과적이였던 방법론들! CLaF: Tokenizer Passage Passage doc_stride: 64 [UNK] [UNK] https://github.com/naver/claf SentTokenizer No [UNK]! F1 기준, 0.5 정도 상승! (Dev Set - Test Set Gap 감소)
  18. 18. 효과적이였던 방법론들! NSML: AutoML
  19. 19. 효과적이였던 방법론들! Fine-tuning: NSML - AutoML 보통 F1 기준, 0.5 ~ 1 정도 상승!
  20. 20. 감사합니다!

×