Successfully reported this slideshow.
Your SlideShare is downloading. ×

A Framework for Automatic Question Answering in Indian Languages

Ad

A Framework for Automatic Question
Answering in Indian Languages
Ritwik Mishra (PhD19xxx)
27th July 2022

Ad

Comprehensive Panel Members
Dr Rajiv Ratn Shah
PhD advisor
IIIT-Delhi
Prof Ponnurangam Kumaraguru
PhD advisor
IIIT-Hyderab...

Ad

What is Automatic Question Answering (QnA)?
Human/user asks a question, and a computer produces an answer.
3

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Loading in …3
×

Check these out next

1 of 73 Ad
1 of 73 Ad

A Framework for Automatic Question Answering in Indian Languages

Download to read offline

The distribution of research efforts done in the field of Natural Language
Processing (NLP) has not been uniform across all natural languages. It has
been observed that there is a significant gap between the development of
NLP tools in Indic languages (indic-NLP), and in European languages. We
aim to explore different directions to develop an automatic question answering system for Indic languages. We built a FAQ-retrieval based chatbot for
healthcare workers and young mothers of India. It supported Hindi language in either Devanagri script or Roman script. We observed that, in our
FAQ database, if there exists a question similar to the query asked by the
user, then the developed chatbot is able to find a relevant Question-Answer
pair (QnA) among its top-3 suggestions 70% of the time. We also observed
that performance of our chatbot is dependent on the diversity in the FAQ
database. Since database creation requires substantial manual efforts, we decided to explore other ways to curate knowledge from raw text irrespective
of domain.
We developed an Open Information Extraction (OIE) tool for Indic languages. During the preprocessing, chunking of text is performed with our
fine-tuned chunker, and the phrase-level dependency tree was constructed
using the predicted chunks. In order to generate triples, various rules were
handcrafted using the dependency relations in Indic languages. Our method
performed better than other multilingual OIE tools on manual and automatic evaluations. The contextual embeddings used in this work does not
take syntactic structure of sentence into consideration. Hence, we devised
an architecture that takes the dependency tree of the sentence into consideration to calculate Dependency-aware Transformer (DaT) embeddings.
Since the dependency tree is also a graph, we used Graph Convolution
Network (GCN) to incorporate the dependency information into the contextual embeddings, thus producing DaT embeddings. We used a hate-speech
detection task to evaluate the effectiveness of DaT embeddings. Our future
plan is to evaluate the applicability of DaT embeddings for the task of chunking. Moreover, the broader aim for the future is to develop an end-to-end
pronoun resolution model to improve the quality of triples and DaT embeddings. We also aim to explore the applicability of all our works to solve the
problem of long-context question answering.

The distribution of research efforts done in the field of Natural Language
Processing (NLP) has not been uniform across all natural languages. It has
been observed that there is a significant gap between the development of
NLP tools in Indic languages (indic-NLP), and in European languages. We
aim to explore different directions to develop an automatic question answering system for Indic languages. We built a FAQ-retrieval based chatbot for
healthcare workers and young mothers of India. It supported Hindi language in either Devanagri script or Roman script. We observed that, in our
FAQ database, if there exists a question similar to the query asked by the
user, then the developed chatbot is able to find a relevant Question-Answer
pair (QnA) among its top-3 suggestions 70% of the time. We also observed
that performance of our chatbot is dependent on the diversity in the FAQ
database. Since database creation requires substantial manual efforts, we decided to explore other ways to curate knowledge from raw text irrespective
of domain.
We developed an Open Information Extraction (OIE) tool for Indic languages. During the preprocessing, chunking of text is performed with our
fine-tuned chunker, and the phrase-level dependency tree was constructed
using the predicted chunks. In order to generate triples, various rules were
handcrafted using the dependency relations in Indic languages. Our method
performed better than other multilingual OIE tools on manual and automatic evaluations. The contextual embeddings used in this work does not
take syntactic structure of sentence into consideration. Hence, we devised
an architecture that takes the dependency tree of the sentence into consideration to calculate Dependency-aware Transformer (DaT) embeddings.
Since the dependency tree is also a graph, we used Graph Convolution
Network (GCN) to incorporate the dependency information into the contextual embeddings, thus producing DaT embeddings. We used a hate-speech
detection task to evaluate the effectiveness of DaT embeddings. Our future
plan is to evaluate the applicability of DaT embeddings for the task of chunking. Moreover, the broader aim for the future is to develop an end-to-end
pronoun resolution model to improve the quality of triples and DaT embeddings. We also aim to explore the applicability of all our works to solve the
problem of long-context question answering.

Advertisement
Advertisement

More Related Content

More from IIIT Hyderabad (20)

Advertisement

A Framework for Automatic Question Answering in Indian Languages

  1. 1. A Framework for Automatic Question Answering in Indian Languages Ritwik Mishra (PhD19xxx) 27th July 2022
  2. 2. Comprehensive Panel Members Dr Rajiv Ratn Shah PhD advisor IIIT-Delhi Prof Ponnurangam Kumaraguru PhD advisor IIIT-Hyderabad Prof Radhika Mamidi External expert IIIT-Hyderabad Dr Arun Balaji Buduru Internal expert IIIT-Delhi Dr Raghava Mutharaju Internal expert IIIT-Delhi 2
  3. 3. What is Automatic Question Answering (QnA)? Human/user asks a question, and a computer produces an answer. 3
  4. 4. Outline ● Motivation ● Thesis Statement ● Contributions ○ FAQ retrieval ○ Open Information Extraction ○ Better Contextual Embeddings ● Future Plans ○ Pronoun Resolution ○ Long-context Question Answering 4
  5. 5. Question Answering Open Information Extraction (OIE) Better Contextual Embeddings Pronoun Resolution Our Work 5 Thesis Overview
  6. 6. Why languages? ● Human intelligence is very much attributed to our ability to express complex ideas through language. ● With the advent of text modality (i.e. writing) our ability to store and spread information increased dramatically. 6
  7. 7. Why Indian Languages? ● Despite being spoken by billions of people, Indic languages are considered low-resource due to the lack of annotated resources. “Hindi is... an underdog. It has enough resources and everything looks promising for Hindi.” - Dr. Monojit Choudhury, MSR India, ML India Podcast, Aug 2020 7
  8. 8. Cases where Indic-QnA works well 8
  9. 9. Cases where it fails 9
  10. 10. Thesis Statement ● Explore the possibility to develop a framework for Automated Question-Answering (QnA) in Indian languages with the help of multiple supporting tasks like Open Information Extraction (OIE) and Pronoun Resolution, and improved contextual embeddings. 10
  11. 11. Thesis Statement (visualized) 11 Question-Answering Retrieval-based Open Information Extraction Pronoun resolution Chunking Improve Contextual Embeddings
  12. 12. Question-Answering 12 Question-Answering Retrieval-based Open Information Extraction Pronoun resolution Chunking Improve Contextual Embeddings
  13. 13. Types of QnA QnA Domain Context Answer Type Discourse Open-Domain Closed-Domain Short Long No Span-based (MRC) Sentence-based (AS2) Conversational Memory-less 13
  14. 14. Methodologies in QnA ● Rule based approach ● Generative approach ● Retrieval based approach 14
  15. 15. Methodologies in QnA ● Rule based approach ● Generative approach ● Retrieval based approach 15
  16. 16. Methodologies in QnA 16 ● Rule based approach ● Generative approach ● Retrieval based approach
  17. 17. Retrieval-based 17 Question-Answering Retrieval-based Open Information Extraction Pronoun resolution Chunking Improve Contextual Embeddings
  18. 18. FAQ retrieval based QnA User query (q) Questioni ( Qi ) Answeri ( Ai ) Top-k Question-Answer pairs (QA) Where Qi is most similar to the user query (q) k 18
  19. 19. FAQ retrieval based QnA (internal view) User query (q) FAQ database Sentence Similarity Calculator (SSC) # Q A # Q A score FAQ database with each row having its similarity score with q 19
  20. 20. Example User query (q) : बच्चे की ना भ फ ू ली हो है ठीक करने क े लए क्या क्या करे? Chatbot suggested: Q1) बच्चे की अचानक ना भ फ ू ल जाए तो क्या करें? A1) बच्चे की ना भ अचानक नहीं फ ू लती आमतौर पर यह देखा गया है क जब बच्चे का पेट थोड़ा फ ू लता है … Q2) अगर 5 म हने क बच्चे की ना भ फ ू ली हुई हो तो क्या करे? A2) 5 महीने क े बच्चे की ना भ आम तोर पर मसल्स की कमजोरी क े कारण होती है इस में कोई चंता …. Q3) नवजात बच्चे की फ ू ली हुए ना भ होती है वह जब ठीक हो जाए तो क्या उसक े बढ़ उसे गैस की प्रॉब्लम हो जाती है? A3) यह एक बह्रम है बच्चे की ना भ जब फ ु ल्ली होती है तो वह खतरे क े लक्षण नहीं है फ ु ल्ली हुई ना भ … 20
  21. 21. How to evaluate? Question ( Q ) Answer ( A ) User deciding the relevancy of the top-k Question-Answer pairs (QA) suggested by chatbot for 4 different queries k 21 Question ( Q ) Answer ( A ) k Question ( Q ) Answer ( A ) k Question ( Q ) Answer ( A ) k Top-k suggestions for q1 Top-k suggestions for q2 Top-k suggestions for q3 Top-k suggestions for q4
  22. 22. Evaluation Metrics 1. Success Rate (SR) 2. Precision at k (P@k) 3. Mean Average Precision (mAP) 4. Mean Reciprocal Rank (MRR) 5. normalized Discounted Cumulative Gain (nDCG) 22 Success Rate is the % of user queries for whom the chatbot suggested at least one relevant QA pair among its top-k suggestions.
  23. 23. Evaluation techniques 1. Success Rate (SR) 2. Precision at k (P@k) 3. Mean Average Precision (mAP) 4. Mean Reciprocal Rank (MRR) 5. normalized Discounted Cumulative Gain (nDCG) Precision at k is the proportion of recommended items in the top-k set that are relevant. 23
  24. 24. Evaluation techniques 1. Success Rate (SR) 2. Precision at k (P@k) 3. Mean Average Precision (mAP) 4. Mean Reciprocal Rank (MRR) 5. normalized Discounted Cumulative Gain (nDCG) For each user (or q) For each relevant item Compute precision till that item Average them Average them 24
  25. 25. Evaluation techniques 1. Success Rate (SR) 2. Precision at k (P@k) 3. Mean Average Precision (mAP) 4. Mean Reciprocal Rank (MRR) 5. normalized Discounted Cumulative Gain (nDCG) 25
  26. 26. Evaluation techniques 1. Success Rate (SR) 2. Precision at k (P@k) 3. Mean Average Precision (mAP) 4. Mean Reciprocal Rank (MRR) 5. normalized Discounted Cumulative Gain (nDCG) 26
  27. 27. Evaluation techniques 1. Success Rate (SR) 2. Precision at k (P@k) 3. Mean Average Precision (mAP) 4. Mean Reciprocal Rank (MRR) 5. normalized Discounted Cumulative Gain (nDCG) 27
  28. 28. Selected Literature ● Daniel, Jeanne E., et al. "Towards automating healthcare question answering in a noisy multilingual low-resource setting." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. ○ Non-public healthcare dataset in African languages ○ 230K QA pairs → 150K Qs with 126 As ○ Hence 126 clusters of Qs ○ Predict a cluster for q ● Bhagat, Pranav, Sachin Kumar Prajapati, and Aaditeshwar Seth. "Initial Lessons from Building an IVR-based Automated Question-Answering System." Proceedings of the 2020 International Conference on Information and Communication Technologies and Development. 2020. ○ Transcription based ○ 516 Qs with 90 As ○ User had to specify broad category of q ● Sakata, Wataru, et al. "FAQ retrieval using query-question similarity and BERT-based query-answer relevance." Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019. ○ Used stackexchange dataset (719 QA and 1K q) and scrapped localGovFAQ dataset (1.7K QA and 900 q). ○ Q-q similarity works better than QA-q or A-q similarity. 28
  29. 29. Selected Literature ● Daniel, Jeanne E., et al. "Towards automating healthcare question answering in a noisy multilingual low-resource setting." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. ○ Non-public healthcare dataset in African languages ○ 230K QA pairs → 150K Qs with 126 As ○ Hence 126 clusters of Qs ○ Predict a cluster for q ● Bhagat, Pranav, Sachin Kumar Prajapati, and Aaditeshwar Seth. "Initial Lessons from Building an IVR-based Automated Question-Answering System." Proceedings of the 2020 International Conference on Information and Communication Technologies and Development. 2020. ○ Transcription based ○ 516 Qs with 90 As ○ User had to specify broad category of q ● Sakata, Wataru, et al. "FAQ retrieval using query-question similarity and BERT-based query-answer relevance." Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019. ○ Used stackexchange dataset (719 QA and 1K q) and scrapped localGovFAQ dataset (1.7K QA and 900 q). ○ Q-q similarity works better than QA-q or A-q similarity. Research Gap: There is a need to propose automated solutions to provide healthcare related information to an end-user in remote places of India. 29
  30. 30. The proposed FAQ-chatbot Chatbot FAQ database if count < k User query (q) Finding top-k most similar QA pair for the given q Asking user if Q is similar to q? No No Yes A q cannot be answered Yes Response Response User User User 30
  31. 31. Different SSC techniques Sentence Similarity Calculator (SSC) Dependency Tree Pruning (DTP) method Cosine Similarity (COS) method Sentence-pair Classifier (SPC) method Training NOT needed Training needed 31
  32. 32. Evaluation data ● On field testing was not feasible due to COVID. ● We collected 336 queries from the healthcare workers. ● Manually annotated each of them, and found complete/partial matching Qs. 32 q 336 Healthcare-worker q Qs Hold-out test-set For each query (q) we manually found similar questions (Qs) from the FAQ database 270
  33. 33. Results 1/2 DTP SPC COS ℇ DTP DTPq-e SPC SPC+A SPCq-e COS COSq-e mAP 30.5 35.1 39.4 21.1 39.1 26.5 27.9 45.3 MRR 42.6 48.5 54.6 42.2 54.2 38.7 41.0 61.6 SR 27.1 59.6 66.2 49.6 64.4 47.7 51.1 70.3 nDCG 55.5 51.2 57.1 43.9 56.5 40.8 43.3 62.5 P@3 27.1 30.0 34.6 34.6 34.6 22.7 23.9 34.6 Table 1. Comparison of all three primary approaches on hold-out test set for top-3 suggestions. Ensemble (ℇ) is obtained by taking the best performing models, highlighted with underlined text, from each of the primary approach. 33
  34. 34. Results 2/2 ℇ-COS ℇ-DTP ℇ-SPC ℇ mAP 40.9 40.8 30.3 45.3 MRR 56.2 56.5 43.8 61.6 SR 66.2 66.2 51.1 70.3 nDCG 58.4 58.2 45.5 62.5 P@3 34.6 34.6 23.9 34.6 Table 2. Results of ablation study on the Ensemble method (ℇ). We observed that each approach plays an important role in the performance of the developed chatbot. 34
  35. 35. Limitations ● The retrieval-based approaches are limited to extracting information from an indexed database which has to be curated manually. ● In order to retrieve information from a text dump of raw sentences, a knowledge-graph has to constructed from the unstructured text. ● OIE tools could be used to extract facts from raw sentences in different domains. 35
  36. 36. Open Information Extraction 36 Question-Answering Retrieval-based Open Information Extraction Pronoun resolution Chunking Improve Contextual Embeddings
  37. 37. Open Information Extraction (OIE) ● Extract facts from raw sentences. ○ Use triples to represent the facts. ○ Format of triples is <head, relation, tail>. ● Example ○ PM Modi to visit UAE in Jan marking 50 yrs of diplomatic ties. ■ one of the possible meaningful triple would be: <PM Modi, to visit, UAE> 37
  38. 38. Why OIE is important? ● Its ability to extract triples from large amounts of texts in an unsupervised manner. ● It serves as an initial step in building or augmenting a knowledge graph out of an unstructured text. ● OIE tools have been used to solve downstream applications like Question-Answering, Text Summarization, and Entity Linking. 38
  39. 39. OIE in Indian language needs a special attention ● Take an English sentence ○ Ram ate an apple ○ A sensible triple would be ■ <Ram, ate, an apple> ○ A nonsensical triple would be ■ <an apple, ate, Ram> ● Now look at a Hindi sentence (same meaning) ○ राम ने सेब खाया ○ Both these triples would be sensible ■ <राम ने, खाया, सेब> ■ <सेब, खाया, राम ने> 39
  40. 40. How generated triples are evaluated? ● Manual Annotations Valid Well-formed Reasonable Concrete True Tabular representation Image representation 40
  41. 41. How generated triples are evaluated? ● Manual Annotations ● Automatic Evaluations 41
  42. 42. sent_id:1 वह ऑस्ट्रे लया क े पहले प्रधान मंत्री क े रूप में कायर्यरत थे और ऑस्ट्रे लया क े उच्च न्यायालय क े संस्थापक न्यायाधीश बने । (He served as the first Prime Minister of Australia and became a founding justice of the High Court of Australia .) ------ Cluster 1 ------ वह --> कायर्यरत थे --> [ऑस्ट्रे लया क े ]{a} [पहले] प्रधान मंत्री क े रूप में (He --> served --> as [the] [first] Prime Minister [of Australia]{a}) वह --> बने --> [ऑस्ट्रे लया क े उच्च न्यायालय क े ]{b} [संस्थापक] न्यायाधीश (He --> became --> [founding] justice [of the High Court of Australia]{b}) {a} ऑस्ट्रे लया क े --> property --> [पहले] प्रधान मंत्री क े रूप में |OR| वह --> [पहले] प्रधान मंत्री क े रूप में कायर्यरत थे --> ऑस्ट्रे लया क े ({a} of Australia --> property --> as [first] Prime Minister |OR| He --> served as [the] [first] Prime Minister --> of Australia) {b} [ऑस्ट्रे लया क े ]{c} उच्च न्यायालय क े --> property --> [संस्थापक] न्यायाधीश |OR| वह --> [संस्थापक] न्यायाधीश बने --> [ऑस्ट्रे लया क े ]{c} उच्च न्यायालय क े ({b} of High Court [of Australia]{c} --> property --> [founding] justice |OR| He --> became [founding] justice --> of High Court [of Australia]{c}) {c} ऑस्ट्रे लया क े --> property --> उच्च न्यायालय क े ({c} of Australia --> property --> of High Court) ------ Cluster 2 ------ वह --> [पहले] प्रधान मंत्री क े रूप में कायर्यरत थे --> ऑस्ट्रे लया क े (He --> served as [the] [first] Prime Minister --> of Australia) वह --> [संस्थापक] न्यायाधीश बने --> [ऑस्ट्रे लया क े ]{a} उच्च न्यायालय क े (He --> became [founding] justice --> of High Court [of Australia]{a}) {a} ऑस्ट्रे लया क े --> property --> उच्च न्यायालय क े ({a} of Australia --> property --> of High Court) Hindi-BenchIE 42
  43. 43. Selected Literature ● Faruqui, Manaal, and Shankar Kumar. "Multilingual Open Relation Extraction Using Cross-lingual Projection." Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015. ○ Supported Hindi but it was translation based ○ Didn’t release code but released triples ● Rao, Pattabhi RK, and Sobha Lalitha Devi. "EventXtract-IL: Event Extraction from Newswires and Social Media Text in Indian Languages@ FIRE 2018-An Overview." FIRE (Working Notes) (2018): 282-290. ○ Event specific and domain specific IE ● Ro, Youngbin, Yukyung Lee, and Pilsung Kang. "Multiˆ2OIE: Multilingual Open Information Extraction Based on Multi-Head Attention with BERT." Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. ○ Modelled triple extraction as sequence labelling problem through BERT embeddings (end-to-end) ○ Identify relations, and then their head-tail 43
  44. 44. Selected Literature ● Faruqui, Manaal, and Shankar Kumar. "Multilingual Open Relation Extraction Using Cross-lingual Projection." Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015. ○ Supported Hindi but it was translation based ○ Didn’t release code but released triples ● Rao, Pattabhi RK, and Sobha Lalitha Devi. "EventXtract-IL: Event Extraction from Newswires and Social Media Text in Indian Languages@ FIRE 2018-An Overview." FIRE (Working Notes) (2018): 282-290. ○ Event specific and domain specific IE ● Ro, Youngbin, Yukyung Lee, and Pilsung Kang. "Multiˆ2OIE: Multilingual Open Information Extraction Based on Multi-Head Attention with BERT." Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. ○ Modelled triple extraction as sequence labelling problem through BERT embeddings (end-to-end) ○ Identify relations, and then their head-tail Research Gap: (1) The field of Open Information Extraction (OIE) from unstructured text in Indic languages has not been explored much. Moreover, the effectiveness of existing multilingual OIE techniques has to be evaluated on Indic languages. (2) There is a scarcity of annotated resources for automatic evaluation of automatically generated triples for Indic languages. (3) Construction of knowledge-graph using extracted triples from unstructured text in Indian languages needs to be explored as well. 44
  45. 45. IndIE: our Indic OIE tool Raw Text Output Sentence segmented text, Triples for each sentence, Execution Time Off-the-shelf sentence segmentation and dependency parser by Stanford (a) XLM-roberta fine-tuned on chunk annotated data (b) Creating Merged-phrases Dependency Tree (MDT) 3: Chunk tags 3: Dependency tree Greedy algorithm based on hand-crafted rules (c) Triple extraction 2: Sentences 1: Shallow parsing 4: Passing MDT 5: Result accumulation 45
  46. 46. Chunker 46 Question-Answering Retrieval-based Open Information Extraction Pronoun resolution Chunking Improve Contextual Embeddings
  47. 47. Chunker Implementation 1/2 Transformer Tokenizer Pretrained Transformer Tokens Token IDs Embeddings 47
  48. 48. Chunker Implementation 2/2 Transformer Tokenizer Pretrained Transformer Feed-forward Layer Tokens Token IDs Token-level predictions Ground-Truth Calculating Loss (a) Initial Embeddings (b) Taking average (c) Final Embeddings 48 Our approach
  49. 49. Results: Chunker Classification Layers Common approach in literature A different approach Our approach 1 82±10 (50±20) 89±0.5 (62±1.0) 91±0.0 (65±0.5) 2 86±1.8 (51±6.2) 89±0.5 (54±7.4) 90±0.5 (54±4.5) 3 79±14 (43±13) 82±11 (41±12) 90±0.5 (48±2.2) Table 3. A comparison of three approaches for solving the sub-word token embeddings for chunking task. Four different random seeds were used to calculate the mean and standard deviation for the given samples. All the experiments were run on the combined data of TDIL and UD. The numbers written outside round brackets represent the accuracy, whereas numbers inside round brackets represent the macro average. Model Hindi English Urdu Nepali Gujarati Bengali XLM 78% 60% 84% 65% 56% 66% CRF 67% 56% 71% 58% 53% 53% Table 4. A comparison of (fine-tuned) XLM chunker and CRF chunker on the languages which are removed from training-set. The numbers represent the accuracy obtained by each model when sentences from the given language are used only in the test-set. 49
  50. 50. Results: Triple Evaluation (manual) Image options ArgOE M&K Multi2OIE PredPatt IndIE No information 17% 28% 71% 5% 4% Most Information 22% 44% 24% 29% 17% All information 61% 28% 5% 66% 79% Table 5. Percentage of sentences having no/most/all information in the image representation of their generated triples. The method which generates maximum triples with ‘All information’ is considered the best method. #Triple ArgOE M&K Multi2OIE PredPatt IndIE Total 45 180 50 40 272 Valid 38 142 10 39 252 Well-formed 32 75 9 36 240 Reasonable 26 69 6 32 227 Concrete 21 53 4 25 158 True 19 51 4 25 152 Table 6. Number of triples extracted by each OIE method for 106 Hindi sentences. 50
  51. 51. Results: Triple Evaluation (automatic) ArgOE M&K Multi2OIE PredPatt IndIE Precision 0.26 0.14 0.12 0.37 0.62 Recall 0.06 0.16 0.03 0.08 0.62 F1-score 0.10 0.15 0.05 0.14 0.62 Table 7. Performance of different OIE methods on Hindi-BenchIE golden set consisting of 75 Hindi sentences. It is observed that IndIE outperforms other methods on all three metrics. 51
  52. 52. Limitations ● Small size of evaluation datasets. ● The contextual embeddings (from transformers) are known to not take the syntactic structure of sentence into account. ● As a result, it has been observed that syntactic information is either absent in the transformer embeddings or it is not utilized while making the predictions. ● Therefore, there is a need to explore the possibility of incorporating syntactic (dependency) information in transformer embeddings. 52
  53. 53. Improve Contextual Embeddings 53 Question-Answering Retrieval-based Open Information Extraction Pronoun resolution Chunking Improve Contextual Embeddings
  54. 54. Dependency-aware Transformer (DaT) Embedding Transformer Tokenizer Pretrained Transformer Tokens Token IDs (a) Initial Embeddings (b) Taking average (c) Final Embeddings 54
  55. 55. DaT Embedding: Graph 1 2 3 4 5 1 0 1 1 0 0 2 0 0 0 1 1 3 1 0 0 0 0 4 0 1 0 0 0 5 0 1 0 0 0 Adjacency Matrix A 1 2 3 4 5 1 2 0 0 0 0 2 0 3 0 0 0 3 0 0 1 0 0 4 0 0 0 1 0 5 0 0 0 0 1 Degree Matrix D 1 2 3 4 5 55
  56. 56. DaT Embedding: Graph Convolution Network (GCN) 1 2 3 4 5 1 0 1 1 0 0 2 0 0 0 1 1 3 1 0 0 0 0 4 0 1 0 0 0 5 0 1 0 0 0 Adjacency Matrix A 1 2 3 4 5 1 2 0 0 0 0 2 0 3 0 0 0 3 0 0 1 0 0 4 0 0 0 1 0 5 0 0 0 0 1 Degree Matrix D 1 2 3 4 5 56
  57. 57. DaT Embedding: GCN message passing 57
  58. 58. Selected Literature ● Jie, Zhanming, Aldrian Obaja Muis, and Wei Lu. "Efficient dependency-guided named entity recognition." Thirty-First AAAI Conference on Artificial Intelligence. 2017. ● Marcheggiani, Diego, and Ivan Titov. "Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. ● Zhang, Yuhao, Peng Qi, and Christopher D. Manning. "Graph Convolution over Pruned Dependency Trees Improves Relation Extraction." Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. 58
  59. 59. Selected Literature ● Jie, Zhanming, Aldrian Obaja Muis, and Wei Lu. "Efficient dependency-guided named entity recognition." Thirty-First AAAI Conference on Artificial Intelligence. 2017. ● Marcheggiani, Diego, and Ivan Titov. "Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. ● Zhang, Yuhao, Peng Qi, and Christopher D. Manning. "Graph Convolution over Pruned Dependency Trees Improves Relation Extraction." Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. Research Gap: Incorporating dependency structure of a sentence in generating dependency-aware contextual embeddings is an under-explored field in Indic-NLP. 59
  60. 60. GCN variants 0) No GCN 1) 2.1) 2.2) 3.1) 60 …
  61. 61. Evaluation Dataset ● Small Dataset ○ Fearspeech | #4782 | Labels: 'Fear speech':1 , 'Normal': 0 ○ Codalab | #5728 | 'hate’:1, offensive':1,'defamation:1, 'fake':1, 'non-hostile':0 ○ HASOC | #4665 | ‘non-hostile’: 0, ‘fake’:0, … , ‘offensive’:1 , ‘hate’: 1 ● Big Dataset ○ IIITD abusive language data | 7 lacs (multilingual) | 30k Hindi | not abusive: 0, abusive:1 61
  62. 62. DaT Embedding Results 62 Epoch 1 Epoch 2 Epoch 3 Epoch 4 v0 71.2 +/- 2.9 (67.7-74.6) 75.9 +/- 1.2 (74.1-77.5) 77.4 +/- 0.6 (76.8-78.1) 79.0 +/- 0.7 (78.2-79.8) v1 72.4 +/- 1.9 (69.4-74.0) 76.6 +/- 1.2 (74.9-78.0) 78.8 +/- 1.0 (77.1-79.6) 80.3 +/- 1.1 (79.3-81.8) v2.2 74.2 +/- 1.1 (72.5-75.3) 77.2 +/- 0.9 (75.9-78.3) 79.3 +/- 0.8 (78.4-80.2) 80.5 +/- 0.7 (79.5-81.2) Epoch 1 Epoch 2 Epoch 3 Epoch 4 v0 71.2 +/- 2.5 (67.3-73.9) 76.2 +/- 1.2 (74.6-77.4) 77.9 +/- 0.6 (77.2-78.8) 79.8 +/- 0.6 (78.8-80.3) v1 72.6 +/- 1.6 (70.2-74.0) 76.5 +/- 1.0 (75.5-77.8) 78.5 +/- 1.1 (76.9-79.7) 79.9 +/- 1.1 (79.0-81.8) v2.2 74.2 +/- 1.3 (72.1-75.2) 76.6 +/- 1.0 (75.5-77.8) 79.0 +/- 1.0 (77.9-80.3) 80.0 +/- 1.2 (78.9-82.1) Table 8. Validation Accuracy on the Small dataset Table 9. Testing Accuracy on the Small dataset
  63. 63. Limitation लौटे राम अयोध्या बने वह राजा । । nsubj punct obl nsubj xcomp punct root root राम अयोध्या लौटे । वह राजा बने । [PAD] [PAD] राम 1 1 अयोध्या 1 1 लौटे 1 1 1 1 । 1 1 वह 1 1 राजा 1 1 बने 1 1 1 1 । 1 1 [PAD] 1 [PAD] 1 63
  64. 64. Future Plans ● End-to-End Pronoun Resolution in Hindi ● Indic-SpanBERT ● Long-context Question Answering 64
  65. 65. Pronoun Resolution 65 Question-Answering Retrieval-based Open Information Extraction Pronoun resolution Chunking Improve Contextual Embeddings
  66. 66. Pronoun Resolution Every speaker had to present his paper . Barack Obama visited India. Modiji came to receive Obama. He was brave. He was Ashoka the great. If you want to hire them, then all candidates must be treated nicely. Ram went to forest with his wife Krishna was an avatar of Vishnu. CEO of Reliance, Mukesh Ambani, inaugurated the SOM building. 66
  67. 67. Dataset Mujadia, Vandan, Palash Gupta, and Dipti Misra Sharma. "Coreference Annotation Scheme and Relation Types for Hindi." Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 2016. 67 Table 10. Corpus Details Table 11. Distribution of co-referential entities
  68. 68. Coref chain 68 Unique mention ID 0: Intermediate token 1: End token Unique chain ID They modify the mention Source
  69. 69. Selected Literature ● Lee, Kenton, et al. "End-to-end Neural Coreference Resolution." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. ● Joshi, Mandar, et al. "Spanbert: Improving pre-training by representing and predicting spans." Transactions of the Association for Computational Linguistics 8 (2020): 64-77. ● Dakwale, Praveen, Vandan Mujadia, and Dipti Misra Sharma. "A hybrid approach for anaphora resolution in hindi." Proceedings of the Sixth International Joint Conference on Natural Language Processing. 2013. ● Devi, Sobha Lalitha, Vijay Sundar Ram, and Pattabhi RK Rao. "A generic anaphora resolution engine for Indian languages." Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 2014. ● Sikdar, Utpal Kumar, Asif Ekbal, and Sriparna Saha. "A generalized framework for anaphora resolution in Indian languages." Knowledge-Based Systems 109 (2016): 147-159. 69
  70. 70. Selected Literature ● Lee, Kenton, et al. "End-to-end Neural Coreference Resolution." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. ● Joshi, Mandar, et al. "Spanbert: Improving pre-training by representing and predicting spans." Transactions of the Association for Computational Linguistics 8 (2020): 64-77. ● Dakwale, Praveen, Vandan Mujadia, and Dipti Misra Sharma. "A hybrid approach for anaphora resolution in hindi." Proceedings of the Sixth International Joint Conference on Natural Language Processing. 2013. ● Devi, Sobha Lalitha, Vijay Sundar Ram, and Pattabhi RK Rao. "A generic anaphora resolution engine for Indian languages." Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 2014. ● Sikdar, Utpal Kumar, Asif Ekbal, and Sriparna Saha. "A generalized framework for anaphora resolution in Indian languages." Knowledge-Based Systems 109 (2016): 147-159. Research Gap: (1) A publicly available tool for coreference resolution in Indic languages is required to be built using the state-of-art coreference resolution techniques. (2) Pretraining of transformer models using different subword tokenization methods and pretraining tasks are to be investigated extensively. 70
  71. 71. Publications 1. Mishra, Ritwik, Simranjeet Singh, Rajiv Ratn Shah, Ponnurangam Kumaraguru, and Pushpak Bhattacharya. "IndIE: A Multilingual Open Information Extraction Tool For Indic Languages". 2022. [Submitted to a special issue of TALLIP] 2. Mishra, Ritwik, Simranjeet Singh, Jasmeet Kaur, Rajiv Ratn Shah, and Pushpendra Singh. "Exploring the Use of Chatbots for Supporting Maternal and Child Health in Resource-constrained Environments". 2022. [Draft ready. Under internal review] Miscellaneous ● Mishra, Ritwik, Ponnurangam Kumaraguru, Rajiv Ratn Shah, Aanshul Sadaria, Shashank Srikanth, Kanay Gupta, Himanshu Bhatia, and Pratik Jain. "Analyzing traffic violations through e-challan system in metropolitan cities (workshop paper)." In 2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM), pp. 485-493. IEEE, 2020. 78
  72. 72. Acknowledgements ● I would like to express my gratitude to pillars group members for their valuable guidance. ● I would like to thank Simranjeet, Samarth, Ajeet, and Jasmeet for being diligent co-authors. ● I would also like to thank Prof Pushpak Bhattacharyya, and CFILT lab members for providing me great insights and hosting me at IIT Bombay under Anveshan Setu program. ● I would also like to University Grant Commission (UGC) Junior Research Fellowship (JRF) / Senior Research Fellowship (SRF) for funding my PhD program 79
  73. 73. Thank You 80

×