mt.pptMachine TranslationMachine TranslationMachine TranslationMachine TranslationMachine Translation

shruti954781 14 views 72 slides Mar 05, 2025
Slide 1
Slide 1 of 72
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72

About This Presentation

Machine TranslationMachine TranslationMachine TranslationMachine TranslationMachine TranslationMachine TranslationMachine Translation


Slide Content

Machine Translation across
Indian Languages

Dipti Misra Sharma
LTRC, IIIT
Hyderabad
Patiala
15-11-2013

Outline
• Introduction
• Information Dynamics in language
• Machine Translation (MT)
•Approaches to MT
•Practical MT systems
• Challenges in MT
•Ambiguities
•Syntactic differences in L1 an L2
• MT efforts in India
–Sampark : IL to IL MT systems
– Objective
– Design
– Issues
• Conclusions

Introduction
Natural Language Processing (NLP) involves
 Processing information contained in natural
languages
 Natural as opposed to formal/artificial
Formal languages : Programming languages, logic,
mathematics etc
Artificial : Esperanto

Natural Language Processing (NLP)
Helps in
 Communication between
Man-machine
 Question answering systems,
eg interactive railway reservation
Man – man
 Machine translation

Communication
Transfer of information from one to the other
Language is a means of communication
Therefore, one can say
It encodes what is communicated <information>
We apply the processes of
Analysis (decoding) for understanding
Synthesis (encoding) for expression (speaking)

What do we communicate ?
Information
Spain delivered a football masterclass at Euro 2012
Intention <purpose>
Emphasis/focus
Euro 2012 bagged/won by Spain
Spain bags Euro 2012
•Introduces variation

How do we communicate ? Contd..
Arrangement of sentences (Discourse)
Sentences or parts of sentences are related to each other to
provide a cohesive meaning
*Considered as one of the best wild life sanctuaries in the
country. It is a national park covering an area of about 874 km.
Bandipur National park is a beautiful tourist spot.
Bandipur National park is a beautiful tourist spot and
considered as one of the best wild life sanctuaries in the
country. It is a national park covering an area of about 874 km
Languages differ in the way they organise information in
these entities
All of these interact in the organisation of information

Information Dynamics in Language (1/4)
• Languages encode information
Hindi: cuuhe maarate haiM kutte
'rat-pl' 'kill-hab' 'pres-pl' 'dog-pl'
 rats kill dogs
 Hindi sentence is ambiguous
 Possible interpretations
Dogs kill rats
Rats kill dogs
However,
English sentence is not ambiguous

Information Dynamics in Language (2/4)
Ambiguity in Hindi is resolved if,
cuuhe maarate haiM kuttoM ko
rats kill-hab pres-pl dogs-obl acc
 Hindi encodes information in morphemes
 English encodes information in positions
Languages encode information differently

• English does not explicitly mark accusative case
(except in pronouns) – no morpheme
• No lexical item/morpheme for yes no questions
(Eng: Is he coming ? Hindi : kyaa vah aa rahaa hai?)
• Position plays an important role in encoding
information in English
• Subject is sacrosanct
• Hindi encodes information morphologically

Information Dynamics in Language (3/4)
Another example,
This chair has been sat on
– The chair has been used for sitting
– Someone sat on this chair, and it is known
– The sentence does not mention someone
Languages encode information partially

Information Dynamics in Language (4/4)
English pronouns he, she, it
Hindi pronoun vaha
He is going to Delhi ==> vaha dilli jaa rahaa hai
She is going to Delhi ==> vaha dillii jaa rahii hai
It broke ==> vaha TuuTa ??
Information does not always map fully from one language into
another
Conceptual worlds may be different
Gender Information

Information in Language
• Languages encode information differently
• Languages code information only partially
• Tension between BREVITY and PRECISION

Human beings use
 World knowledge
 Context (both linguistic and extra-linguistic)
 Cultural knowledge and
 Language conventions
to resolve ambiguities
Can all this knowledge be provided to the
machine ?

Languages differ
• Script (For written language)
• Vocabulary
• Grammar
These differences can be considered as a
measure of language distance

Language Distance
Script -------------- Vocabulary----------Grammar
Urdu-> Hindi
Telugu -> Hindi Telugu->Hindi
English -> Hindi English-> Hindi English->Hindi

Machine Translatoion
Machine translation aims at
 automatic translation of
a text in source language
 

 to   
a text in the target
language.
Mohan gave Hari a book -> Mohan ne Hari ko kitAba dI

English to Hindi : An Example
SL (Eng) sentence
 :
 
I
 
 met 
a boy who plays cricket with you
everyday
Mapped to TL(Hin) : I a boy met who everyday with you cricket
plays
TL synthesis
   :
 mEM eka laDake se milA jo roza tumhAre sAtha
kriketa khelatA hE
OR
mEM roza tumhAre sAtha kriketa khelanevAle eka
laDake se milA
OR
meM eka Ese laDake se milA jo roza tumhAre sAtha
kriketa khelatA hE

Machine Translation : Challenges
• Languages encode information differently
• Language codes information only partially
• Tension between BREVITY and PRECISION
• Brevity wins leading to inherent ambiguity at different levels

Linguistic Issues in MT (1/2)
Look at the word 'plot' in the following examples
(a) The plot having rocks and boulders is not good.
(b) The plot having twists and turns is interesting.
'plot' in (a) means 'a piece of land' and
in (b) 'an outline of the events in a story'

Linguistic Issues in MT (2/2)
 Ambiguity in Language
• Lexical level
 Sentence level
 Structural differences between SL and TL

Lexical ambiguity
Lexical ambiguity can be both for
Content words – nouns, verbs etc
Function words – prepositions, TAMs etc
 Content words ambiguity is of two types
Homonymy
Polysemy

Homonymy
A word has two or more unrelated senses
Example :

I was walking on the bank (river-bank)
I deposited the money in the bank (money-bank)

Polsysemy
'Act', an English noun
1. It was a kind act to help the blind man across the
road (kArya)
2. The hero died in the Act four, scene three (aMka)
3. Don't take her seriously, its all an act (aBinaya)
4. The parliament has passed an Act (dhArA)

Function words can also pose
problems (1/5)
 Prepositions
 English prepositions in the target language
 Tense Aspect Modality (TAM)
 Lexical correspondence of TAM

Function words can also pose problems
(2/5)
Function words can also be ambiguous
For example – English preposition
  '
in'
 
                   (a)  I met him
in the garden
 
                          mEM usase bagIce
meM milA
 
                   (b)  I met him
in the morning
 
                           mEM usase subaha
0 milA
'Ambiguity' here refers to the 'appropriate correspondence' in the
target language.

Function words can also pose problems(3/5)Function words can also pose problems(3/5)
He bought a shirt with tiny collars.
usane chote kOlaroM vAlI kamIza kharIdI
‘he tiny collars with shirt bought’
‘with’ gets translated as ‘vAlI’ in hindi
He washed a shirt with soap.
usane sAbuna se kamIza dhoI
‘he soap with shirt washed’
‘with’ gets translated as ‘se’ .

Function words can also pose problems
(4/5)
TAM Markers mark tense, aspect and modality
Consist of inflections and/or auxiliary verbs
in Hindi
An important source of information
Narrow down the meaning of a verb (eg.
lied, lay)

Function words can also pose problems
(4/5)
TAM Markers mark tense, aspect and modality
Consist of inflections and/or auxiliary verbs
in Hindi
An important source of information
Narrow down the meaning of a verb (eg.
lied, lay)

Function words can also pose problems
(5/5)
English Simple Past vs Habitual'
1a. He stayed in the guest house during his visit to our University in
Jan (rahA)
1b. He stayed in the guest house whenever he visited us (rahatA
thA)
2a. He went to the school just now (gayA)
2b. He went to the school everyday (jAtA thA)

Sentence level ambiguity
o I met the girl in the store
 
     +
Possible readings
 
      
 

a)  I met the girl who works in the store
 
        
b) 

I met the girl while I was in the store
 
       
  o Time flies like an arrow.
 
   +
Possible parses:
a) Time flies like an arrow (N V Prep Det N)
b) Time flies like an arrow (N N V Det N)
c) Time flies like an arrow (V N Prep Det N) (flies are like an
arrow)
d) Time flies like an arrow (V N Prep Det N) (manner of
timing)

Differences in SL and TL
Lexical level
(a) One word may translate into different words in different
contexts (WSD)
English 'plot' → zamiin, kathanak
(b) A SL word may not have a corresponding word in the
TL (Gaps)
   
English 'reads' in 'This book reads very well'
(d)
 Pronouns across Indian languages
Hindi 'vaha' → Telugu 'adi', 'atanu', 'aame'
 
  

Differences in SL and TL
Structural differences
(a)
 word order (English – Hindi)
(b)
 nominal modification (Hindi – Tamil, Telugu etc)       
 
   (i)   relative clause vs relative participles
Telugu 'nenu tinnina camcaa'
Hindi : *meraa khaayaa cammaca
Maine jis cammaca se khaayaa hai vah
cammac
 
   (ii) missing copula (Hindi – Telugu, Bengali, Tamil
etc)
Telugu : raamudu mancivaadu
Hindi : Ram acchaa ladakaa hai

Human beings use
World Knowledge
Context
Cultural knowledge and
Language conventions
To resolve ambiguities and interpret meaning

What to do for the machine ?
Challenging problem!!!
 Providing all the knowledge may:
- take too much of time and effort
- be difficult/become complex
- not be possible (world knowledge acquired from
experience)
 Therefore,
 Break the problem into smaller problems
Choose the solution as per the nature of
problem
Build language resources to the extent possible
and continue to add to it
 Engineer knowledge efficiently

Approaches to MT (1/2)
 Rule-based or Transfer based
 Uses linguistic rules to map SL and TL, such as
•Maps grammatical structures
•Disambiguation rules
• Knowledge-based  
                  
•Extensive knowledge of the domain
•Concepts in the language
•Ability to reason

Approaches to MT (2/2)
•Example-based
• Mapping is based on stored example translations
• Translation memory based
• Uses phrases/words from earlier translation as
examples
 Statistical
Does not formulate explicit linguistic knowledge
Develops rules based on probabilities
 Hybrid
Mixes two or more techniques

A Glance at MT Efforts in
India (1/4)
 Domain Specific
 Mantra system (C-DAC, Pune)
 Translation of govt. appointment letters
 Uses Tree Adjoining Grammar
 Public health compaign documents
Angla Bharati approach (C-DAC Noida & IIT Kanpur)

A Glance at MT Efforts in
India (2/4)
 Application Specific
 Matra (Human aided MT) (NCST,now C-DAC, Mumbai)
 General Purpose (not yet in use)
 Angla Bharati approach (IIT Kanpur )
 UNL based MT (IIT Bombay)
 Shiva: EBMT (IIIT Hyderabad/IISc Bangalore)
 Shakti: English-Hindi MT system (IIIT Hyderabad)

MT Efforts in India (3/4)
Major Government funded MT projects in consortium mode
 Indian Language to Indian Language Machine Translation
(ILMT) (Lead Institute - IIIT, Hyderabad)
 English to Indian Language Machine Translation
Mantra, Shakti etc (Lead inst - C-DAC, Pune)
Anglabharati (Lead inst – IIT, Kanpur)
 Sanskrit to Hindi MT System (Lead Inst – University of
Hyderabad)

MT Efforts in India (4/4)
Anusaaraka : Language Accesspr cum MT System
(IIIT, Hyderabad, Chinmaya Shodh Sansthan)

Our Focus
Sampark : Indian Language to Indian Language
MT systems
<sampark.org.in>

Sampark : Indian Language to
Indian Language MT Systems
•Consortium mode project
•Funded by DeiTY
•11 Partiicpating Institutes
•Nine language pairs
•18 Systems

Participating institutions
IIIT, Hyderabad (Lead institute)
University of Hyderabad
IIT, Bombay
IIT, Kharagpur
AUKBC, Chennai
Jadavpur University, Kolkata
Tamil University, Thanjavur
IIIT, Trivandrum
IIIT, Allahabad
IISc, Bangalore
CDAC, Noida

Objectives
Develop general purpose MT systems from one IL to another
for 9 language pairs
Bidirectional
Deliver domain specific versions of the MT systems. Domains are:
Tourism and pilgrimage
One additional domain (health/agriculture, box office reviews, electronic
gadgets instruction manuals, recipes, cricket reports)
By-products basic tools and lexical resources for Indian languages:
POS taggers, chunkers, morph analysers, shallow parsers, NERs, parsers
etc.
Bidirectional bilingual dictionaries, annotated corpora, etc.

Language Pairs (Bidirectional)
Tamil-Hindi
Telugu-Hindi
Marathi-Hindi
Bengali-Hindi
Tamil-Telugu
Urdu-Hindi
Kannada-Hindi
Punjabi-Hindi
Malayalam-Tamil

User Scenario
•Web based system for tourism/ pilgrimage domain.
•A common traveler/tourist/piligrim to access info in his
language.
•Access to selected Government portals in
agriculture/health
•Automatic MT in domain
•General purpose web based translation
•Potential to attach to major search engines such as Google,
Yahoo, Microsoft, Web-duniya

Design and Approach

Largely transfer based
– Analysis, Transfer, Generate

Modular (module could be

Pipeline architecture



Hybrid – some modules statistical, some rule
based

Analysis : Shallow parser

No deep parsing in the first phase

Approach

Largely transfer based
– Analysis, Transfer, Generate

Modular
–Modules could be statistical or rule based depending on
the nature of problem (Hybrid)

Pipeline architecture

Analysis : Shallow parsing followed by a simple
parser

Design
o Design decisions based on
- the commonality in Indian languages
- easy to extend to other languages
o Phase the development
- Phase 1
o Analysis at sentence level
o Shallow parser
o Simple parser
o Transfer : map lexicon, structures, script
o Generate the target

Design Contd
Phase 2

Extend the analysis to discourse level

Anaphora resolution

Relations between clauses (discourse
connectives)

Word Sense Disambiguation (WSD)

Named Entity Recognition (NER)

Multi Word Expressions (MWE)

Explore SMT for transfer rules

Transfer based MT
Source Sentence
Source Analysis
Analysis
Analysis in Target
Language
Target Sentence
Transfer
Generation

Form
(Input sentence/text)
Meaning
Analysis
Form
Generation
L1 L1
Various types of linguistic information helps in arriving from form to meaning
It is complex.
Modularization helps in simplifying it.

Modularize
Word
Structure
In context
Morph Analyser
Syntactic
What is functions as
Semantic
What it means
(POS tagger)
(WSD)
Relations between words
Local (local word grouping,/ chunking)
Non-local (Subject,object/karaka)

Form
(Input sentence/text)
Meaning
Analysis
Form
Generation
Semantic analysis
POS
Chunking
parsing
Morph Analysis
Formal semantics
All this information is implicit in language.
How to make it explicit?
Build resources – Dictionaries, Verb
frames, Treebanks

Sampark Architecture

Details

Standards

Annotation standards – POS and Chunk

Input – output of each module

Representation - SSF

Data format – Dictionaries

Emphasis on proper software engineering

Development environment – Dashboard

Blackboard architecture

CVS for version control

etc.

Machine Learning: Separating engines
from language data
Module for Task (T) Sentence in Language (L)
Training data
(lang. L)
Engine for task T
Out
Manual
Correction

Horizontal Tasks
H1 POS Tagging & Chunking engine
H2 Morph analyser engine
H3 Generator engine
H4 Lexical disambiguation engine
H5 Named entity engine
H6 Dictionary standards
H7 Corpora annotation standards
H8 Evaluation of output (comprehensibility)
H9 Testing & integration

Vertical Tasks for Each Language
V1 POS tagger & chunker
V2 Morph analyzer
V3 Generator
V4 Named entity recognizer
V5 Bilingual dictionary – bidirectional
V6 Transfer grammar
V7 Annotated corpus
V8 Evaluation
V9 Co-ordination

Vertical Tasks for Each Language
V1 POS tagger & chunker
V2 Morph analyzer
V3 Generator
V4 Named entity recognizer
V5 Bilingual dictionary – bidirectional
V6 Transfer grammar
V7 Annotated corpus
V8 Evaluation
V9 Co-ordination

An Example : Hindi to Panjabi System
ਭਾ
ਰਤ ਵਿੱਚ ਆਰੀਆਂ ਦਾ ਆਗਮਨ ਈਸਾ ਦਾ ਕੋਈ
1500 ਸਾ
ਲ ਪੂਰਵ ਹੋਇਆ
.

ਰੀਆਂ ਦਾ ਪਹਲੀ ਖੇਪ ਰਿਗਵੈਦਿਕ ਆਰੀਆ ਕਹਾ ਹੈਂ
.
ਰਿ
ਗਵੇਦ ਦਾ ਰਚਨਾ ਇਹ ਸਮਾਂ ਹੋਈ
.
ਰਿ
ਗਵੇਦ ਦਾ ਕਈ ਬਾਤੇ ਅਵੇਸਤਾ ਨਾਲ ਮਿਲਦੀ ਹਨ
.

ਵੇਸਤਾ ਈਰਾਨੀ ਭਾਸ਼ਾ ਦਾ ਪ੍ਰਾਚੀਨਤਮ ਗ੍ਰੰਥ ਹੈਂ
.
भा
रत में आर्यों का आगमन ईसा के कोई
1500 व
र्ष पूर्व हुआ ।

र्यों की पहली खेप ऋग्वैदिक आर्य कहलाती है ।
ऋग्
वेद की रचना इसी समय हुई ।
ऋग्
वेद की कई बाते अवेस्ता से मिलती हैं ।

वेस्ता ईरानी भाषा के प्राचीनतम ग्रंथ है ।

Panjabi to Hindi

रदार उपासक सिंह भारत का एक प्रमुख स्वतंत्रता संगरामिया था
.

मर बिंब बन जाने की कला में उन की कोई सानी नहीं
.
उन
ने केंद्रीय असंबली की बैठक में बम फेंक कर भी भागने से अस्वीकार कर
दि
या था
.

पासक सिंह को
23 मा
र्च
1931 को
उन के साथियों
, रा
जगुरू और सुखदेव
का
से फ़ांसी और लटका दिया गया था
.
सं
पूर्ण देश ने उन की शहादत को याद किया
.

ਰਦਾਰ ਭਗਤ ਸਿੰਘ ਭਾਰਤ ਦੇ ਇੱਕ ਪ੍ਰਮੁੱਖ ਅਜ਼ਾਦੀ ਸੰਗਰਾਮੀਏ ਸਨ।
ਅਮਰ
ਬਿੰਬ ਬਣ ਜਾਣ ਦੀ ਕਲਾ ਵਿੱਚ ਉਨ੍ਹਾਂ ਦਾ ਕੋਈ ਸਾਨੀ ਨਹੀਂ।

ਨ੍ਹਾਂ ਨੇ ਕੇਂਦਰੀ ਅਸੰਬਲੀ ਦੀ ਬੈਠਕ ਵਿੱਚ ਬੰਬ ਸੁੱਟ ਕੇ ਵੀ ਭੱਜਣ ਤੋਂ ਇਨਕਾਰ ਕਰ ਦਿੱਤਾ ਸੀ।

ਗਤ ਸਿੰਘ ਨੂੰ
23 ਮਾ
ਰਚ
1931 ਨੂੰ
ਉਨ੍ਹਾਂ ਦੇ ਸਾਥੀਆਂ
, ਰਾ
ਜਗੁਰੂ ਅਤੇ ਸੁਖਦੇਵ ਦੇ ਨਾਲ ਫ਼ਾਂਸੀ
ਤੇ
ਲਟਕਾ ਦਿੱਤਾ ਗਿਆ ਸੀ।
ਸਾ
ਰੇ ਦੇਸ਼ ਨੇ ਉਨ੍ਹਾਂ ਦੀ ਸ਼ਹਾਦਤ ਨੂੰ ਯਾਦ ਕੀਤਾ।

Panjabi to Hindi

रदार उपासक सिंह
(NER) भा
रत का एक प्रमुख स्वतंत्रता संगरामिया था
.

मर बिंब
(WSD) ब
न जाने की कला में
उन
की कोई सानी
(Agreement)

हीं
.
उन
ने
(word generation) कें
द्रीय असंबली की बैठक में बम फेंक कर भी
भा
गने से अस्वीकार कर दिया था
.

पासक सिंह को
23 मा
र्च
1931 को
उन के साथियों
, रा
जगुरू और सुखदेव
का

से

(function word substitution) फ़ां
सी और लटका दिया गया था
.
सं
पूर्ण देश ने उन की शहादत को याद किया
.

Evaluation
Testing, system integration, and evaluation team –
Involvement of industry
•Regular In-house subjective evaluation
•Third party evaluation on system submission

Achievements of ILMT Project Phase I
18 MT systems built among Indian languages
Shallow parser for all 9 Indian languages
Lexical resources for all 9 languages
Largely built from scratch
Developed standards for all stages
Developed open architecture

Achievements -Deployment
Deployed and running over web – 8 systems
(sampark.org.in)
Others deployed over ILMT test site
 4 more ready to go to Sampark soon
 Rest are being evaluated and tested internally
(require a few more months to go to Sampark site after reaching quality
levels)
Constant qualilty improvement going on for various existing modules
New modules are under testing and would be soon integrated

Future Tasks
 Enhance the quality of MT output
Enhancing dictionaries
Increasing coverage of grammar
Adding new technology to ILMT systems
Full sentence parsing
Discourse processing - anaphora
Target some users

Some Possibilities
Possible tie up with search engines companies
Possible tie up with content companies such as -
Dainik Jagran, Web duniya, Rediff, Yahoo
Identify translation bureaus and agencies
Build MT workbench for their use, their domains, etc.
 Poised for major public impact with a unique
technology.

Future Systems
 Add language pairs
 Gujrati – Hindi
 Kashmiri – Hindi
 Manipuri – Hindi
 Oriya – Hindi
 Etc

Future Systems
 Add language pairs
 Gujrati – Hindi
 Kashmiri – Hindi
 Manipuri – Hindi
 Oriya – Hindi
 Etc

CONCLUSION
Developing MT systems, though a challenging task,
is a useful effort particularly in the multilingual
context of India
Tags