A Framework for Automatic Question Answering in Indian Languages
precogatIIITD
239 views
73 slides
Aug 08, 2022
Slide 1 of 73
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
About This Presentation
The distribution of research efforts done in the field of Natural Language
Processing (NLP) has not been uniform across all natural languages. It has
been observed that there is a significant gap between the development of
NLP tools in Indic languages (indic-NLP), and in European languages. We
aim t...
The distribution of research efforts done in the field of Natural Language
Processing (NLP) has not been uniform across all natural languages. It has
been observed that there is a significant gap between the development of
NLP tools in Indic languages (indic-NLP), and in European languages. We
aim to explore different directions to develop an automatic question answering system for Indic languages. We built a FAQ-retrieval based chatbot for
healthcare workers and young mothers of India. It supported Hindi language in either Devanagri script or Roman script. We observed that, in our
FAQ database, if there exists a question similar to the query asked by the
user, then the developed chatbot is able to find a relevant Question-Answer
pair (QnA) among its top-3 suggestions 70% of the time. We also observed
that performance of our chatbot is dependent on the diversity in the FAQ
database. Since database creation requires substantial manual efforts, we decided to explore other ways to curate knowledge from raw text irrespective
of domain.
We developed an Open Information Extraction (OIE) tool for Indic languages. During the preprocessing, chunking of text is performed with our
fine-tuned chunker, and the phrase-level dependency tree was constructed
using the predicted chunks. In order to generate triples, various rules were
handcrafted using the dependency relations in Indic languages. Our method
performed better than other multilingual OIE tools on manual and automatic evaluations. The contextual embeddings used in this work does not
take syntactic structure of sentence into consideration. Hence, we devised
an architecture that takes the dependency tree of the sentence into consideration to calculate Dependency-aware Transformer (DaT) embeddings.
Since the dependency tree is also a graph, we used Graph Convolution
Network (GCN) to incorporate the dependency information into the contextual embeddings, thus producing DaT embeddings. We used a hate-speech
detection task to evaluate the effectiveness of DaT embeddings. Our future
plan is to evaluate the applicability of DaT embeddings for the task of chunking. Moreover, the broader aim for the future is to develop an end-to-end
pronoun resolution model to improve the quality of triples and DaT embeddings. We also aim to explore the applicability of all our works to solve the
problem of long-context question answering.
Size: 2.28 MB
Language: en
Added: Aug 08, 2022
Slides: 73 pages
Slide Content
A Framework for Automatic Question
Answering in Indian Languages
Ritwik Mishra (PhD19xxx)
27th July 2022
Comprehensive Panel Members
Dr Rajiv Ratn Shah
PhD advisor
IIIT-Delhi
Prof Ponnurangam Kumaraguru
PhD advisor
IIIT-Hyderabad
Prof Radhika Mamidi
External expert
IIIT-Hyderabad
Dr Arun Balaji Buduru
Internal expert
IIIT-Delhi
Dr Raghava Mutharaju
Internal expert
IIIT-Delhi
2
What is Automatic Question Answering (QnA)?
Human/user asks a question, and a computer produces an answer.
3
Question Answering
Open
Information
Extraction
(OIE)
Better Contextual
Embeddings
Pronoun Resolution
Our Work
5
Thesis Overview
Why languages?
●Human intelligence is very much
attributed to our ability to express
complex ideas through language.
●With the advent of text modality
(i.e. writing) our ability to store
and spread information increased
dramatically.
6
Why Indian Languages?
●Despite being spoken by billions of people, Indic languages
are considered low-resource due to the lack of annotated
resources.
“Hindi is... an underdog. It has enough resources and everything looks
promising for Hindi.”
- Dr. Monojit Choudhury, MSR India, ML India Podcast, Aug 2020
7
Cases where Indic-QnA works well
8
Cases where it fails
9
Thesis Statement
●Explore the possibility to develop a framework for Automated
Question-Answering (QnA) in Indian languages with the help of multiple
supporting tasks like Open Information Extraction (OIE) and Pronoun
Resolution, and improved contextual embeddings.
10
Thesis Statement (visualized)
11
Question-Answering
Retrieval-based
Open
Information
Extraction
Pronoun
resolution
ChunkingImprove Contextual Embeddings
Question-Answering
12
Question-Answering
Retrieval-based
Open
Information
Extraction
Pronoun
resolution
ChunkingImprove Contextual Embeddings
Types of QnA
QnA
Domain
Context
Answer Type
Discourse
Open-Domain Closed-Domain
Short
Long
No
Span-based
(MRC)
Sentence-based
(AS2)
Conversational
Memory-less
13
Methodologies in QnA
●Rule based approach
●Generative approach
●Retrieval based approach
14
Methodologies in QnA
●Rule based approach
●Generative approach
●Retrieval based approach
15
Methodologies in QnA
16
●Rule based approach
●Generative approach
●Retrieval based approach
Retrieval-based
17
Question-Answering
Retrieval-based
Open
Information
Extraction
Pronoun
resolution
ChunkingImprove Contextual Embeddings
FAQ retrieval based QnA
User query
(q)
Question
i
( Q
i
)
Answer
i
( A
i
)
Top-k Question-Answer pairs (QA)
Where Qi is most similar to the user query (q)
k
18
FAQ retrieval based QnA (internal view)
User query (q)
FAQ database Sentence
Similarity
Calculator
(SSC)
# Q A
# Q A score
FAQ database with each
row having its similarity
score with q
19
Example
User query (q) : ब?च/ क' न!gभ फ?ल# ह* ह0 ठ$क करन/ क/ gलए ?य! ?य! कर/?
Chatbot suggested:
Q1) ब?च/ क' अच!नक न!gभ फ? ल ज!ए त* ?य! करE?
A1) ब?च/ क' न!gभ अच!नक नह#2 फ?लत" आमत+र पर यह द/ख! गय! ह0 iक जब ब?च/ क! प/ट थ*ड़! फ?लत! ह0 …
Q2) अगर 5 म?हन/ क ब?च/ क' न!gभ फ?ल# ह?ई ह* त* ?य! कर/?
A2) 5 मह#न/ क/ ब?च/ क' न!gभ आम त*र पर मस?स क' कमज*र# क/ क!रण ह*त" ह0 इस मE क*ई lच2त! ….
Q3) नवज!त ब?च/ क' फ?ल# ह?ए न!gभ ह*त" ह0 वह जब ठ$क ह* ज!ए त* ?य! उसक/ बढ़ उस/ ग0स क' $(?लम ह*
ज!त" ह0?
A3) यह एक ब1म ह0 ब?च/ क' न!gभ जब फ??ल# ह*त" ह0 त* वह खतर/ क/ ल?ण नह#2 ह0 फ??ल# ह?ई न!gभ …
20
How to evaluate?
Question ( Q
)
Answer ( A
)
User deciding the relevancy of the
top-k Question-Answer pairs (QA)
suggested by chatbot for 4 different queries
k
21
Question ( Q
)
Answer ( A
)
k
Question ( Q
)
Answer ( A
)
k
Question ( Q
)
Answer ( A
)
k
Top-k suggestions for q1 Top-k suggestions for q2
Top-k suggestions for q3 Top-k suggestions for q4
Evaluation Metrics
1.Success Rate (SR)
2.Precision at k (P@k)
3.Mean Average Precision (mAP)
4.Mean Reciprocal Rank (MRR)
5.normalized Discounted Cumulative Gain (nDCG)
22
Success Rate is the % of user queries for whom the chatbot suggested at least one relevant QA pair
among its top-k suggestions.
Evaluation techniques
1.Success Rate (SR)
2.Precision at k (P@k)
3.Mean Average Precision (mAP)
4.Mean Reciprocal Rank (MRR)
5.normalized Discounted Cumulative Gain (nDCG)
Precision at k is the proportion of recommended items in the top-k set that
are relevant.
23
Evaluation techniques
1.Success Rate (SR)
2.Precision at k (P@k)
3.Mean Average Precision (mAP)
4.Mean Reciprocal Rank (MRR)
5.normalized Discounted Cumulative Gain (nDCG)
For each user (or q)
For each relevant item
Compute precision till that item
Average them
Average them
24
Evaluation techniques
1.Success Rate (SR)
2.Precision at k (P@k)
3.Mean Average Precision (mAP)
4.Mean Reciprocal Rank (MRR)
5.normalized Discounted Cumulative Gain (nDCG)
25
Evaluation techniques
1.Success Rate (SR)
2.Precision at k (P@k)
3.Mean Average Precision (mAP)
4.Mean Reciprocal Rank (MRR)
5.normalized Discounted
Cumulative Gain (nDCG)
26
Evaluation techniques
1.Success Rate (SR)
2.Precision at k (P@k)
3.Mean Average Precision (mAP)
4.Mean Reciprocal Rank (MRR)
5.normalized Discounted
Cumulative Gain (nDCG)
27
Selected Literature
●Daniel, Jeanne E., et al. "Towards automating healthcare question answering in a noisy multilingual low-resource setting."
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
○Non-public healthcare dataset in African languages
○230K QA pairs → 150K Qs with 126 As
○Hence 126 clusters of Qs
○Predict a cluster for q
●Bhagat, Pranav, Sachin Kumar Prajapati, and Aaditeshwar Seth. "Initial Lessons from Building an IVR-based Automated
Question-Answering System." Proceedings of the 2020 International Conference on Information and Communication
Technologies and Development. 2020.
○Transcription based
○516 Qs with 90 As
○User had to specify broad category of q
●Sakata, Wataru, et al. "FAQ retrieval using query-question similarity and BERT-based query-answer relevance." Proceedings of
the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019.
○Used stackexchange dataset (719 QA and 1K q) and scrapped localGovFAQ dataset (1.7K QA and 900 q).
○Q-q similarity works better than QA-q or A-q similarity.
28
Selected Literature
●Daniel, Jeanne E., et al. "Towards automating healthcare question answering in a noisy multilingual low-resource setting."
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
○Non-public healthcare dataset in African languages
○230K QA pairs → 150K Qs with 126 As
○Hence 126 clusters of Qs
○Predict a cluster for q
●Bhagat, Pranav, Sachin Kumar Prajapati, and Aaditeshwar Seth. "Initial Lessons from Building an IVR-based Automated
Question-Answering System." Proceedings of the 2020 International Conference on Information and Communication
Technologies and Development. 2020.
○Transcription based
○516 Qs with 90 As
○User had to specify broad category of q
●Sakata, Wataru, et al. "FAQ retrieval using query-question similarity and BERT-based query-answer relevance." Proceedings of
the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2019.
○Used stackexchange dataset (719 QA and 1K q) and scrapped localGovFAQ dataset (1.7K QA and 900 q).
○Q-q similarity works better than QA-q or A-q similarity.
Research Gap: There is a need to propose automated solutions to provide healthcare related information
to an end-user in remote places of India.
29
The proposed FAQ-chatbot
Chatbot
FAQ
database if count < k
User query (q)
Finding top-k most similar
QA pair for the given q
Asking user if Q is similar to q?
No
No
Yes
A
q cannot be
answered
Yes Response
Response
User User User
30
Different SSC techniques
Sentence
Similarity
Calculator
(SSC)
Dependency Tree
Pruning (DTP) method
Cosine Similarity (COS)
method
Sentence-pair Classifier
(SPC) method
Training NOT needed Training needed
31
Evaluation data
●On field testing was not feasible due to COVID.
●We collected 336 queries from the healthcare workers.
●Manually annotated each of them, and found complete/partial matching Qs.
32
q
336
Healthcare-worker
q Qs
Hold-out
test-set
For each query (q) we manually
found similar questions (Qs) from
the FAQ database 270
Results 1/2
DTP SPC COS
ℇ
DTP DTP
q-e
SPC SPC
+A
SPC
q-e
COS COS
q-e
mAP 30.5 35.1 39.4 21.1 39.1 26.5 27.9 45.3
MRR 42.6 48.5 54.6 42.2 54.2 38.7 41.0 61.6
SR 27.1 59.6 66.2 49.6 64.4 47.7 51.1 70.3
nDCG 55.5 51.2 57.1 43.9 56.5 40.8 43.3 62.5
P@3 27.1 30.0 34.6 34.6 34.6 22.7 23.9 34.6
Table 1. Comparison of all three primary approaches on hold-out test set for top-3 suggestions. Ensemble
(ℇ) is obtained by taking the best performing models, highlighted with underlined text, from each of the
primary approach.
33
Results 2/2
ℇ
-COS
ℇ
-DTP
ℇ
-SPC
ℇ
mAP 40.9 40.8 30.3 45.3
MRR 56.2 56.5 43.8 61.6
SR 66.2 66.2 51.1 70.3
nDCG 58.4 58.2 45.5 62.5
P@3 34.6 34.6 23.9 34.6
Table 2. Results of ablation study on the Ensemble method (ℇ). We observed that each
approach plays an important role in the performance of the developed chatbot.
34
Limitations
●The retrieval-based approaches are limited to extracting information from an
indexed database which has to be curated manually.
●In order to retrieve information from a text dump of raw sentences, a
knowledge-graph has to constructed from the unstructured text.
●OIE tools could be used to extract facts from raw sentences in different
domains.
35
Open Information Extraction
36
Question-Answering
Retrieval-based
Open
Information
Extraction
Pronoun
resolution
ChunkingImprove Contextual Embeddings
Open Information Extraction (OIE)
●Extract facts from raw sentences.
○Use triples to represent the facts.
○Format of triples is <head, relation, tail>.
●Example
○PM Modi to visit UAE in Jan marking 50 yrs of diplomatic ties.
■one of the possible meaningful triple would be:
<PM Modi, to visit, UAE>
37
Why OIE is important?
●Its ability to extract triples from large amounts of texts in an unsupervised
manner.
●It serves as an initial step in building or augmenting a knowledge graph out of
an unstructured text.
●OIE tools have been used to solve downstream applications like
Question-Answering, Text Summarization, and Entity Linking.
38
OIE in Indian language needs a special attention
●Take an English sentence
○Ram ate an apple
○A sensible triple would be
■<Ram, ate, an apple>
○A nonsensical triple would be
■<an apple, ate, Ram>
●Now look at a Hindi sentence (same meaning)
○र!म न/ स/ब ख!य!
○Both these triples would be sensible
■<र!म न/, ख!य!, स/ब>
■<स/ब, ख!य!, र!म न/>
39
How generated triples are evaluated?
●Manual Annotations Valid
Well-formed
Reasonable
Concrete
True
Tabular representation
Image representation
40
How generated triples are evaluated?
●Manual Annotations
●Automatic Evaluations
41
sent_id:1 वह ऑ?/gलय! क/ पहल/ $ध!न म2" क/ ?प मE क!य[रत थ/ और ऑ?/gलय! क/ उ?च ?य!य!लय क/ स2?थ!पक ?य!य!ध"श बन/ ]
(He served as the first Prime Minister of Australia and became a founding justice of the High Court of Australia .)
------ Cluster 1 ------
वह --> क!य[रत थ/ --> [ऑ?/gलय! क/]{a} [पहल/] $ध!न म2" क/ ?प मE
(He --> served --> as [the] [first] Prime Minister [of Australia]{a})
वह --> बन/ --> [ऑ?/gलय! क/ उ?च ?य!य!लय क/]{b} [स2?थ!पक] ?य!य!ध"श
(He --> became --> [founding] justice [of the High Court of Australia]{b})
{a} ऑ?/gलय! क/ --> property --> [पहल/] $ध!न म2" क/ ?प मE |OR| वह --> [पहल/] $ध!न म2" क/ ?प मE क!य[रत थ/ --> ऑ?/gलय! क/
({a} of Australia --> property --> as [first] Prime Minister |OR| He --> served as [the] [first] Prime Minister --> of Australia)
{b} [ऑ?/gलय! क/]{c} उ?च ?य!य!लय क/ --> property --> [स2?थ!पक] ?य!य!ध"श |OR| वह --> [स2?थ!पक] ?य!य!ध"श बन/ --> [ऑ?/gलय! क/]{c} उ?च ?य!य!लय क/
({b} of High Court [of Australia]{c} --> property --> [founding] justice |OR| He --> became [founding] justice --> of High Court [of
Australia]{c})
{c} ऑ?/gलय! क/ --> property --> उ?च ?य!य!लय क/
({c} of Australia --> property --> of High Court)
------ Cluster 2 ------
वह --> [पहल/] $ध!न म2" क/ ?प मE क!य[रत थ/ --> ऑ?/gलय! क/
(He --> served as [the] [first] Prime Minister --> of Australia)
वह --> [स2?थ!पक] ?य!य!ध"श बन/ --> [ऑ?/gलय! क/]{a} उ?च ?य!य!लय क/
(He --> became [founding] justice --> of High Court [of Australia]{a})
{a} ऑ?/gलय! क/ --> property --> उ?च ?य!य!लय क/
({a} of Australia --> property --> of High Court)
Hindi-BenchIE
42
Selected Literature
●Faruqui, Manaal, and Shankar Kumar. "Multilingual Open Relation Extraction Using Cross-lingual Projection." Proceedings of
the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies. 2015.
○Supported Hindi but it was translation based
○Didn’t release code but released triples
●Rao, Pattabhi RK, and Sobha Lalitha Devi. "EventXtract-IL: Event Extraction from Newswires and Social Media Text in Indian
Languages@ FIRE 2018-An Overview." FIRE (Working Notes) (2018): 282-290.
○Event specific and domain specific IE
●Ro, Youngbin, Yukyung Lee, and Pilsung Kang. "Multiˆ2OIE: Multilingual Open Information Extraction Based on Multi-Head
Attention with BERT." Findings of the Association for Computational Linguistics: EMNLP 2020. 2020.
○Modelled triple extraction as sequence labelling problem through BERT embeddings (end-to-end)
○Identify relations, and then their head-tail
43
Selected Literature
●Faruqui, Manaal, and Shankar Kumar. "Multilingual Open Relation Extraction Using Cross-lingual Projection." Proceedings of
the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies. 2015.
○Supported Hindi but it was translation based
○Didn’t release code but released triples
●Rao, Pattabhi RK, and Sobha Lalitha Devi. "EventXtract-IL: Event Extraction from Newswires and Social Media Text in Indian
Languages@ FIRE 2018-An Overview." FIRE (Working Notes) (2018): 282-290.
○Event specific and domain specific IE
●Ro, Youngbin, Yukyung Lee, and Pilsung Kang. "Multiˆ2OIE: Multilingual Open Information Extraction Based on Multi-Head
Attention with BERT." Findings of the Association for Computational Linguistics: EMNLP 2020. 2020.
○Modelled triple extraction as sequence labelling problem through BERT embeddings (end-to-end)
○Identify relations, and then their head-tail
Research Gap:
(1)The field of Open Information Extraction (OIE) from unstructured text in Indic languages has not been
explored much. Moreover, the effectiveness of existing multilingual OIE techniques has to be evaluated on
Indic languages.
(2)There is a scarcity of annotated resources for automatic evaluation of automatically generated triples for
Indic languages.
(3)Construction of knowledge-graph using extracted triples from unstructured text in Indian languages needs
to be explored as well.
44
IndIE: our Indic OIE tool
Raw Text
Output
Sentence segmented text,
Triples for each sentence,
Execution Time
Off-the-shelf sentence segmentation and
dependency parser by Stanford
(a) XLM-roberta fine-tuned on chunk
annotated data
(b) Creating Merged-phrases
Dependency Tree (MDT)
3: Chunk tags
3: Dependency
tree
Greedy algorithm
based on
hand-crafted rules
(c) Triple extraction
2: Sentences
1: Shallow parsing
4: Passing MDT
5: Result
accumulation
45
Chunker
46
Question-Answering
Retrieval-based
Open
Information
Extraction
Pronoun
resolution
ChunkingImprove Contextual Embeddings
Chunker Implementation 2/2
Transformer
Tokenizer
Pretrained
Transformer
Feed-forward
Layer
Tokens Token IDs
Token-level
predictions
Ground-Truth
Calculating
Loss
(a) Initial
Embeddings
(b) Taking
average
(c) Final
Embeddings
48
Our approach
Results: Chunker
Classification LayersCommon approach in literatureA different approach Our approach
1 82±10 (50±20) 89±0.5 (62±1.0) 91±0.0 (65±0.5)
2 86±1.8 (51±6.2) 89±0.5 (54±7.4) 90±0.5 (54±4.5)
3 79±14 (43±13) 82±11 (41±12) 90±0.5 (48±2.2)
Table 3. A comparison of three approaches for solving the sub-word token embeddings for chunking task. Four different random seeds
were used to calculate the mean and standard deviation for the given samples. All the experiments were run on the combined data of
TDIL and UD. The numbers written outside round brackets represent the accuracy, whereas numbers inside round brackets represent
the macro average.
Model Hindi English Urdu Nepali Gujarati Bengali
XLM 78% 60% 84% 65% 56% 66%
CRF 67% 56% 71% 58% 53% 53%
Table 4. A comparison of (fine-tuned) XLM chunker and CRF chunker on the languages which are removed from training-set.
The numbers represent the accuracy obtained by each model when sentences from the given language are used only in the
test-set.
49
Results: Triple Evaluation (manual)
Image options ArgOE M&K Multi2OIE PredPatt IndIE
No information 17% 28% 71% 5% 4%
Most Information 22% 44% 24% 29% 17%
All information 61% 28% 5% 66% 79%
Table 5. Percentage of sentences having no/most/all information in the image representation of their generated triples. The method which
generates maximum triples with ‘All information’ is considered the best method.
#Triple ArgOE M&K Multi2OIE PredPatt IndIE
Total 45 180 50 40 272
Valid 38 142 10 39 252
Well-formed 32 75 9 36 240
Reasonable 26 69 6 32 227
Concrete 21 53 4 25 158
True 19 51 4 25 152
Table 6. Number of triples extracted by each OIE method for 106 Hindi sentences.
50
Results: Triple Evaluation (automatic)
ArgOE M&K Multi2OIEPredPattIndIE
Precision0.26 0.14 0.12 0.37 0.62
Recall 0.06 0.16 0.03 0.08 0.62
F1-score 0.10 0.15 0.05 0.14 0.62
Table 7. Performance of different OIE methods on Hindi-BenchIE golden set consisting of 75
Hindi sentences. It is observed that IndIE outperforms other methods on all three metrics.
51
Limitations
●Small size of evaluation datasets.
●The contextual embeddings (from transformers) are known to not take the
syntactic structure of sentence into account.
●As a result, it has been observed that syntactic information is either absent in
the transformer embeddings or it is not utilized while making the predictions.
●Therefore, there is a need to explore the possibility of incorporating syntactic
(dependency) information in transformer embeddings.
52
Improve Contextual Embeddings
53
Question-Answering
Retrieval-based
Open
Information
Extraction
Pronoun
resolution
ChunkingImprove Contextual Embeddings
DaT Embedding: Graph
12345
101100
200011
310000
401000
501000
Adjacency Matrix A
12345
120000
203000
300100
400010
500001
Degree Matrix D
1
2 3
4 5
55
DaT Embedding: Graph Convolution Network (GCN)
12345
101100
200011
310000
401000
501000
Adjacency Matrix A
12345
120000
203000
300100
400010
500001
Degree Matrix D
1
2 3
4 5
56
DaT Embedding: GCN message passing
57
Selected Literature
●Jie, Zhanming, Aldrian Obaja Muis, and Wei Lu. "Efficient dependency-guided named entity recognition." Thirty-First AAAI
Conference on Artificial Intelligence. 2017.
●Marcheggiani, Diego, and Ivan Titov. "Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling."
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.
●Zhang, Yuhao, Peng Qi, and Christopher D. Manning. "Graph Convolution over Pruned Dependency Trees Improves Relation
Extraction." Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018.
58
Selected Literature
●Jie, Zhanming, Aldrian Obaja Muis, and Wei Lu. "Efficient dependency-guided named entity recognition." Thirty-First AAAI
Conference on Artificial Intelligence. 2017.
●Marcheggiani, Diego, and Ivan Titov. "Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling."
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.
●Zhang, Yuhao, Peng Qi, and Christopher D. Manning. "Graph Convolution over Pruned Dependency Trees Improves Relation
Extraction." Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018.
Research Gap: Incorporating dependency structure of a sentence in generating dependency-aware
contextual embeddings is an under-explored field in Indic-NLP.
59
Future Plans
●End-to-End Pronoun Resolution in Hindi
●Indic-SpanBERT
●Long-context Question Answering
64
Pronoun Resolution
65
Question-Answering
Retrieval-based
Open
Information
Extraction
Pronoun
resolution
ChunkingImprove Contextual Embeddings
Pronoun Resolution
Every speaker had to
present his paper .
Barack Obama visited
India. Modiji came to
receive Obama.
He was brave. He was
Ashoka the great.
If you want to hire them,
then all candidates must
be treated nicely.
Ram went to forest
with his wife
Krishna was an avatar of
Vishnu.
CEO of Reliance, Mukesh
Ambani, inaugurated the
SOM building.
66
Dataset
Mujadia, Vandan, Palash Gupta, and Dipti Misra Sharma. "Coreference Annotation Scheme and Relation Types for Hindi."
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 2016.
67
Table 10. Corpus Details
Table 11. Distribution of co-referential entities
Coref chain
68
Unique
mention ID
0: Intermediate token
1: End token
Unique chain ID
They modify the mention
Source
Selected Literature
●Lee, Kenton, et al. "End-to-end Neural Coreference Resolution." Proceedings of the 2017 Conference on Empirical Methods in
Natural Language Processing. 2017.
●Joshi, Mandar, et al. "Spanbert: Improving pre-training by representing and predicting spans." Transactions of the Association
for Computational Linguistics 8 (2020): 64-77.
●Dakwale, Praveen, Vandan Mujadia, and Dipti Misra Sharma. "A hybrid approach for anaphora resolution in hindi." Proceedings
of the Sixth International Joint Conference on Natural Language Processing. 2013.
●Devi, Sobha Lalitha, Vijay Sundar Ram, and Pattabhi RK Rao. "A generic anaphora resolution engine for Indian languages."
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 2014.
●Sikdar, Utpal Kumar, Asif Ekbal, and Sriparna Saha. "A generalized framework for anaphora resolution in Indian languages."
Knowledge-Based Systems 109 (2016): 147-159.
69
Selected Literature
●Lee, Kenton, et al. "End-to-end Neural Coreference Resolution." Proceedings of the 2017 Conference on Empirical Methods in
Natural Language Processing. 2017.
●Joshi, Mandar, et al. "Spanbert: Improving pre-training by representing and predicting spans." Transactions of the Association
for Computational Linguistics 8 (2020): 64-77.
●Dakwale, Praveen, Vandan Mujadia, and Dipti Misra Sharma. "A hybrid approach for anaphora resolution in hindi." Proceedings
of the Sixth International Joint Conference on Natural Language Processing. 2013.
●Devi, Sobha Lalitha, Vijay Sundar Ram, and Pattabhi RK Rao. "A generic anaphora resolution engine for Indian languages."
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 2014.
●Sikdar, Utpal Kumar, Asif Ekbal, and Sriparna Saha. "A generalized framework for anaphora resolution in Indian languages."
Knowledge-Based Systems 109 (2016): 147-159.
Research Gap:
(1)A publicly available tool for coreference resolution in Indic languages is required to be built using
the state-of-art coreference resolution techniques.
(2)Pretraining of transformer models using different subword tokenization methods and pretraining
tasks are to be investigated extensively.
70
Publications
1.Mishra, Ritwik, Simranjeet Singh, Rajiv Ratn Shah, Ponnurangam Kumaraguru, and
Pushpak Bhattacharya. "IndIE: A Multilingual Open Information Extraction Tool For Indic
Languages". 2022. [Submitted to a special issue of TALLIP]
2.Mishra, Ritwik, Simranjeet Singh, Jasmeet Kaur, Rajiv Ratn Shah, and Pushpendra Singh.
"Exploring the Use of Chatbots for Supporting Maternal and Child Health in
Resource-constrained Environments". 2022. [Draft ready. Under internal review]
Miscellaneous
●Mishra, Ritwik, Ponnurangam Kumaraguru, Rajiv Ratn Shah, Aanshul Sadaria, Shashank
Srikanth, Kanay Gupta, Himanshu Bhatia, and Pratik Jain. "Analyzing traffic violations
through e-challan system in metropolitan cities (workshop paper)." In 2020 IEEE Sixth
International Conference on Multimedia Big Data (BigMM), pp. 485-493. IEEE, 2020.
78
Acknowledgements
●I would like to express my gratitude to pillars group members for their valuable
guidance.
●I would like to thank Simranjeet, Samarth, Ajeet, and Jasmeet for being diligent
co-authors.
●I would also like to thank Prof Pushpak Bhattacharyya, and CFILT lab members for
providing me great insights and hosting me at IIT Bombay under Anveshan Setu
program.
●I would also like to University Grant Commission (UGC) Junior Research Fellowship
(JRF) / Senior Research Fellowship (SRF) for funding my PhD program
79