Machine Translation System: Chhattisgarhi to Hindi

padmametta7 1,222 views 67 slides Jun 02, 2017
Slide 1
Slide 1 of 67
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67

About This Presentation

It gives brief description of MTS


Slide Content

Trends in Machine Translation M.V. Padmavati

A case study Maya wants to pursue Ph.D from a foreign university She is not interested in USA She gets an offer from France She went to France

Maya in France The official language is French. Few people know English. Now she got a written agreement from the institute through mail. The document was in French. Possible Solutions?

Options Maya Have Search for a person who knows both the languages The person should have time The person may charge for the work Search for machine translator Use it any time No charge for the work Ex. Google Translate

Overview What is Machine Translation (MT)? Automated system Analyzes text from Source Language (SL) Produces “equivalent” text in Target Language (TL) Ideally without human intervention Source Language Target Language

Machine Translation Machine Translation (MT) , is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another. Good Morning शुभ प्रभात

A utomatic translation of all kinds of documents at a quality equaling that of the best human translators. In any translation, meaning of the statement is to be preserved. Right words and order.

Problems during Machine Translation

Word Order Hindi is sometimes called an “SOV” language. <subject> <object> <verb> But typical word order of English sentences, is SVO. <subject> <verb> <object> Maya likes mangoes माया को आम पसंद है

Word Sense A word may have more than one sense. Choosing the appropriate target word according to context is very important.

Pronoun Resolution Their - उनका / उनकी /उनके

Idioms नौ दो ग्यारह होना To escape Nine two becomes eleven

John went to office जॉन चला गया के लिए कार्यालय Is Dictionary Sufficient?

Approaches in MT Approaches in MT can be classified into five categories: Direct MT Rule-based MT Transfer Based MT Interlingua based MT Corpus-based MT Statistical MT Example based MT Knowledge-based MT Neural MT

A brief history of MT (1966-1980): A virtual end to MT research The 1980s: Rule based and example based MT The 1990s: Statistical MT The 2000s: Hybrid MT 2015: Neural MT

Direct Machine Translation Input (English Sentence) - Maya slept in the garden. Words translation – माया सो गई में बाग | Syntactic rearrangement - माया बाग में सो गई | Besides simple word translation and ordering, suffix handling and preposition handling is needed to make the translation acceptable. It is called as idiomatization .

It carries out translation word by word using bilingual dictionary usually followed by some syntactic rearrangement Direct Machine Translation

Direct MT is very difficult if the SL and TL does not share near syntactical as well as morphological phenomena. For a Hindi to English or English to Hindi translation system, such a word by replacement and idiomatization will not produce understandable MT output. Direct Machine Translation

Limitations of direct MT Does not considers the structure and relationship between words There is no attempt to disambiguate the sense. No adaptability -The system which is developed for a particular language pair will not be suitable for another language pair.

Morphology How a verb infects because of gender, tense and case कर (Root word)- करता करती करते किये करना How to identify and change root word based on gender, tense and case is called morphological analysis.

Rule-Based Machine Translation Based on the specification of rules for morphology, syntax, lexical selection, semantic analysis, transfer and generation process. AnglaHindi MTS was developed by IIT, Kanpur in year 2003 is based on Rule Based MT approach.

Interlingua Based MT Some systems make use of a so-called “interlingua” or intermediate language The transfer stage is divided into two steps, one translating a source sentence into the interlingua and the other translating the result of this into an abstract representation in the target language

UNL Based MT: the scenario UNL ENGLISH HINDI FRENCH RUSSIAN ENCONVERSION DECONVERSION

Machine Translation Work at IIT Kanpur ANGLABHARTI represents a machine-aided translation methodology specifically designed for translating English to Indian languages . Anglabharti uses a pseudo-interlingua approach. It analyses English only once and creates an intermediate structure with most of the disambiguation performed The intermediate structure is then converted to each Indian language through a process of text-generation. The effort in analyzing the English sentences is about 70% and the text-generation for the rest of the 30%. additional 30% effort, a new English to Indian language translator can be built.

Rule Based Machine Translation

Eng POS Tagger morph & chunker Sentence Parser English Sentence Target Language Independent Parsed Output Word Sense Disambiguation (WSD) Word senses marked Transfer Grammar Rule Application Bilingual work E-I Dictionary lookup Tense, Aspect Modality Lookup Indian Language Generator Target Language Dependent IL Sentence

27 PoS Tagger- On inserting an input source sentence the PoS tagger will tag each word with a part of speech. Parser – It will generate a parse tree containing each word in form of node along with part of speech tag. Reordering Module - Reordering module have “Transfer Link Rule File” which gives information about how the source structure is transformed to the target structure. Lexicalization Module - The target equivalents are found in the root word lexicon along with part of speech category. Synthesization Module – The final and most important stage of proposed MT system is synthesizing the target lexicons to convert into target sentence. Function of Various Modules in Architecture

Part of Speech Tagging Part of Speech tagging is the process of identifying the part of speech corresponding to each word in the text, based on both its definition, as well as its context (i.e. relationship with adjacent and related words in a phrase or sentence.) E.g. if we consider the sentence ‘The white dog ate the biscuits’ we have the following tags The [DT] white [JJ] dog [NN] ate [VBD] the [DT] biscuits [NN]

Structural Transfer-RBMT

30 Issues in Chhattisgarhi to Hindi Machine Translation The following are some of the issues to consider for the design of Chhattisgarhi to Hindi machine translator: Lexical differences: Sometimes, a word used in one language has no single-word equivalent in another language which results into lexical differences between languages. Example 1: The word in Chhattisgarhi has three different meaning in Hindi. A±BR>- g§. 1.E|R> Zo H$s {H«$`m `m ^ md 2.AH$‹S> 3.K‘§S> 4. Jd © Gender resolution: In Hindi there are two types of gender masculine and feminine, but in Chhattisgarhi in interrogative sentences it is difficult to identify the gender.

31 Example 2: The following interrogative sentence in Chhattisgarhi can be written in two different ways in Hindi depending on the gender. ते हा जा थस का ? 1.क्या तुम जा रही हो? 2.क्या तुम जा रहे हो? Increasing of words: During translation from Chhattisghari to Hindi there are some cases of increase in the number of words in the target language. Example 3: eSnku e ikgV [ kMs ~ gSA eSnku esa HkSlks dk lewg [ kM~k gSA Issues in Chhattisgarhi to Hindi Machine Translation (contd...)

32 Decreasing of words: During translation from Chhattisghari to Hindi there are some cases of decrease in the number of words in the target language. Example 4: es g ,d Bu vkek [kk;s gqWA eSa ,d vke [kk;k gw A Issues in Chhattisgarhi to Hindi Machine Translation (contd...)

33 The conversion of Chhattisgarhi to Hindi sentence can be illustrated with the help of following example : oks gk ?kj tkFksA => og ?kj tkrk gSA Following will be the stages of translation: 1 st stage: getting basic part-of-speech information of each source word: oks = loZuke ; gk = foHkfDr ; ?kj = laKk ; tkFks = fdz;k 2 nd stage: getting syntactic information about the verb “ tkFks ”: Here: tkFks – Present Simple, 3rd Person Singular, Active Voice 3 rd stage: parsing the source sentence: ( loZuke ) ( foHkfDr ) ( laKk ) ( fdz;k ) Proposed Methodology

34 4 th stage: translate Chhattisgarhi words into Hindi oks (category = loZuke ) => og (category = loZuke ) gk (category = oHkfDr ) => tkrk (category = fdz;k ) ?kj (category = laKk ) => ?kj (category = laKk ) tkFks (category = fdz;k ) => gS (category = l . fdz;k ) 5 th stage: Mapping dictionary entries into appropriate forms (Synthesization or Target Sentence Generation): oks gk ?kj tkFksA => og ?kj tkrk gSA Proposed Methodology(contd...)

35 Rulebase for conversion ( loZuke ) ( oHkfDr )( laKk )( fdz;k )=>( loZuke )( laKk ) ( fdz;k )( l . fdz;k ) 1 2 3 4 1 2 3 4 (Source Rulebase) (Target Rulebase ) Reordering 1:1 || 2:3 || 3:2 || 4:4 -> Transfer Link Rule File Proposed Methodology(contd...)

Snap shots of Chhattisgarhi to Hindi Dictionary

Complete Chhattisgarhi Dictionary

Snap shots of Chhattisgarhi POS Tagger

Snap shots of Chhattisgarhi Morphological Analyzer

Corpus-based MT Corpus based MT systems require sentence-aligned parallel text for each language pair. The corpus based approach is further classified into Statistical Machine Translation Example Based Machine Translation

What is corpus and how it is collected a collection of structured text to study linguistic properties Plural of corpus is corpora Collection of corpus of the different languages Collection of translation corpus (English to Hindi dictionaries and translations etc.) Use n-grams - an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

Collection of Translated Corpora Harry Potter in English Harry Potter in Hindi Machine Learning Magic Probabilistic Model

Basic statistics- SMT 0 <= P(A) <=1 P(A) Probability that word A present in the text P(A,B) Probability that words A and B present in the text P(A|B) Probability that word A presents in the text when B is already present in the text

Basic statistics Conditional probability

Basic Statistics Use definition of conditional probability to derive the chain rule

Goal- SMT Translate. I’ll use English(E) into Hindi(H) as the running example.

Approach: Statistics We are trying to model P(H|E) I give you a English sentence You give me back Hindi How are we going to model this? We could use Bayes rule:

Why Bayes rule at all? Why not model P(H|E) directly? P(E|H)P(H) decomposition allows us to be sloppy P(H) worries about good Hindi P(E|H) worries about English that matches Hindi text The two can be trained independently

Where will we get P(E|H)? Books in Hindi Same books, in English Machine Learning Magic P(E|H) model We call collections stored in two languages parallel corpora or parallel texts Want to update your system? Just add more text!

English to Hindi SMT Let's consider the example of English to Hindi SMT system. Every Hindi sentence h is a possible translation of an English sentence e . The probability that ' गाय खास खाता है। ' is translation of 'Murthy eats apple' is low as compared to the probability of ' रवि खाना खाता है ' being the translation of the sentence. Every pair of sentence (E,H) a probability, P(H|E), which is the probability that a translator when presented with an English sentence E, will produce H as its Hindi translation. We can assume that when a native speaker of Hindi produces an English sentence he will be having a Hindi sentence in mind and will be translating it in to English mentally. The goal of SMT is to find the sentence H that the native speaker in his mind when he produces E .  

The two components in SMT are Language Model(LM) and Translation Model(TM). A language model gives the probability of a sentence. These probabilities are calculated with N-Gram techniques. The translation model helps to compute the conditional probability P(E|H). it is trained from a parallel corpus of Hindi/English pairs. English to Hindi SMT

Working of SMT

Statistical Machine Translation (SMT) The general idea in SMT system is that the translation will be from the most likely translated word. The system consists of three different models. The Language Model (LM) computes the probability of the target language ‘ T’ as probability P(T) . The Translation Model (TM), helps to compute the conditional probability of target sentences given the source sentence, P(T|S). Decoder maximizes the product of LM and TM probabilities.

RBMT Vs SMT RBMT can achieve good results but the training and development costs are very high for a good quality system. In terms of investment, the customization cycle needed to reach the quality threshold can be long and costly. RBMT systems are built with much less data than SMT systems. Language is constantly changing, which means rules must be managed and updated where necessary in RBMT systems. SMT systems can be built in much less time and do not require linguistic experts to apply language rules to the system. SMT models require state-of the-art computer processing power and storage capacity to build and manage large translation models. SMT systems can mimic the style of the training data to generate output based on the frequency of patterns allowing them to produce more fluent output.

Example Based Machine Translation It uses previous translation examples to generate translations for an input provided. When an input sentence is presented to the system, it retrieves a similar source sentence from the example-base and its translation. The system then adapts the example translation to generate the translation of the input sentence.

Knowledge based MT Early MT systems are characterized by the syntax. Semantic features are attached using AI techniques.

Neural Machine Translation It uses a large  neural network for deep learning Google and Microsoft translation services now use NMT from December 2016. It requires less corpora than SMT

Deep Learning Deep learning is essentially a set of techniques that help you to parameterize deep neural network structures, neural networks with many, many layers and parameters.

Recurrent Neural Network   A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed cycle. This creates an internal state of the network which allows it to exhibit dynamic temporal behaviour. 

Encoder-Decoder Encode the source sentence x : It analyzes the source sentence and the result of the analysis is a mysterious sequence of vectors Decode that to target sentence y Neural Machine Translation

Neural Machine Translation मैं एक छात्र हूँ OpenNMT - Open-Source Neural Machine Translation

Automatic Evaluation of Machine Translator BLEU: BLEU was one of the first metrics to report high correlation with human judgments of quality. The metric is currently one of the most popular in the field NIST: The NIST metric is based on the  BLEU  metric, but with some alterations. Where  BLEU  simply calculates  n-gram  precision adding equal weight to each one, NIST also calculates how informative a particular  n-gram  is. That is to say when a correct  n-gram  is found, the rarer that n-gram is, the more weight it is given. For example, if the bigram "on the" correctly matches, it receives lower weight than the correct matching of bigram "interesting calculations," as this is less likely to occur. NIST also differs from  BLEU  in its calculation of the brevity penalty, insofar as small variations in translation length do not impact the overall score as much.

Where are we now? Huge potential/need due to the internet, globalization and international politics. Quick development time due to SMT, the availability of parallel data and computers. Translation is reasonable for language pairs with a large amount of resource. Start to include more “minor” languages.

Indian Institutes with Major work in MT IIIT Hyderabad - Anusaaraka - Prof. Rajeev Sangal Centre for Development of Advanced Computing (CDAC), Pune- Mantra machine translation system: IIT, Bombay- Prof. Pushpak Bhattacharyya working on machine translation system from English to Marathi and Bengali using the UNL (universal networking languages-interlingua) formalism Government of India, through its Technology Development in Indian Languages (TDIL) Project IIT Kanpur – AnglaBharti (English to Indian Languages)

Machine Translation: India Problem #1 Too many language pairs! Implication: Language Barrier will continue to be a problem. Problem #2 Fragmentation of efforts No consolidated effort at solving MT problems Problem #3 Lack of NLP tools Lack of Corpora Lack of standardized methods of evaluation, encoding, etc. Highly Specialized Poor quality systems, No reusable components, No real learning from each other’s work Solutions! Problem #1: Statistical Machine Translation Problem #2: Collaborative work (2-3 teams) Problem #3: Common Tools Framework plus Standards

Thank You
Tags