Tamil Language Computing: The present and the Future
iamsarves
276 views
27 slides
May 30, 2024
Slide 1 of 27
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
About This Presentation
This presentation outlines the present and future of Tamil computing.
Size: 646 KB
Language: en
Added: May 30, 2024
Slides: 27 pages
Slide Content
Tamil Language Computing:
the Present and the Future
Dr. K. Sarveswaran
Chairperson : Section - B, Jaffna Science Association.
Department of Computer Science, University of Jaffna. Sri Lanka.
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Language Computing
What it is
Enabling machines to understand, analyse, generate, and
communicate in natural language!
-> Natural language: Text and Speech
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Language Computing
Fields
●Natural Language Processing
○a field of Artificial Intelligence (AI)
○more technology focused
●Computational Linguistics
○understanding languages
○linguistics insights
●Language enabling
○encoding / input methods
○localisation, IDNs
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Language Computing
Why
●Makes our life easy:
○language computing is everywhere.
○break language barriers.
○make processes more efficient and fast.
○analyse/construct knowledge.
●High commercial value.
●Supports for research/exploration:
○language/linguistics.
○humanity.
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Until very recent times
Structured information
Structured informationNatural Language
Commands / Prog-Lang
Recent advancements
Structured information
Structured information
Natural Language
Natural Language
Self-
Learning
Knowledge sources
Instructions
on
how/what to
learn
Language Computing
The big picture
Building applications:
●We use applications everyday
●Examples:
○Google Search
○Machine Translators
○Content generators
○Google Assistant
●Built by technologists
Building computational resources:
●Computers use these to learn languages
○Corpora
○Analysers / Parsers
○Dictionaries
○Annotated voice recordings
●Analysers/parsers to create annotations
●Large Language Models
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Language enabling work / Foundational work
●
The present!
Foundational work
●Encoding - Unicode.
●Several input methods.
●Unicode fonts are available.
●Applications support Unicode.
●However,
○even now, conference papers are in ASCII.
○Theses are written in ASCII.
○Conversion issues.
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Computational Resources
Introduction - raw resources
●Monolingual/Parallel Corpora
○Plenty - but, much less than English
■AI4Bharat - https://ai4bharat.org/
■Kaggle -
https://www.kaggle.com/datasets/neechalkaran/venmurasu
●Dictionaries / word lists
○Wikidata
○Glossaries
○Not many are available
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Computational Resources
Introduction - level of annotations
●Linguistic annotation
○Phonology - how phonemes make up a sound.
○Morphology - how morphemes make up a word.
○Syntax - how words make up a sentence.
○Semantics - ways in which a language conveys meaning.
○Pragmatic - ways in which a language is used in a context.
●Metadata annotation
○Date, Author, Genre, Source…
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Annotated Resources
Morphology
●A few resources
●Morphological analysers / generators
○useful to generate data.
○learn morphology.
●Still no derivational analysis.
●No standards (Universal Morphology scheme does not capture
Tamil well)
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Computational (annotated) Resources
Part of Speech (POS)
●Part of Speech (POS) tagged corpus
தமிழ்/PROPN எங்கள்/PRON உயிருக்கு/NOUN ேநர்/NOUN ./PUNCT
●Many annotation schemes
○BIS (Bureau of Indian Standards) /UPOS (Universal POS)
/Native ones
●Challenging task
○context is important
○require more research - mixed categories, adjectivial
●Few POS taggers available
○there are open source & off-the-shelf taggers
○need annotated data
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Computational (annotated) Resources
Grammar
●Dependency grammar works well for our languages.
●Two formalisms are used for annotations, so far:
○Lexical Functional Grammar
○the Universal Dependencies
●Significant amount of research required in identifying structures.
○E.g.
■gapping constructions:
கண் ணன் ெகாழும்புக்கும் ராதா கண் டிக்கும் ேபானார்கள்
■multiword tokens: அவன் அைமச்சராகப் பார்த்தான்
■verbal constructions.
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Computational (annotated) Resources
Grammar - The Universal Dependencies
1 அங்கு அங்கு ADV _ _ 4 advmod _ _
2 நிைறய நிைறய ADJ _ _ 3 amod _ _
3 வீடுகள் வீடு NOUN _ _ 4 nsubj _ _
4 இருக்கின்றன இரு VERB _ _ 0 root _ _
5 . . PUNCT _ _ 4 punct _ _
இருக்கின்றன =>
Gender=Neut|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Polite=Form|Tense=Pres|VerbForm=Fin|Voice=Act
●The Universal Dependencies
○captures morphology and dependency syntax
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Computational (annotated) Resources
Grammar - Lexical Functional Grammar (LFG)
●Captures morphology, dependency syntax, and constituency
structure.
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Language Models
Introduction
●A computational resource - Statistical model?
●Trained on datasets large - includes nearly everything that has
been written on the internet or available in digital form.
●Some are multilingual.
●Large language models recognize, summarize, translate, predict
and generate text and other content.
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Language Models
Introduction
●Various algorithms are available to train these models.
●Different kinds of models:
○convert the given text to numbers which can be understood
by machines (encoder).
○convert the given text to another text form (encoder-decoder).
○covert the given numbers to text (decoder).
●These models can be customised for various tasks.
○For instance, Fairseq can be customised to do translation.
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Language Models
Examples
●Example:
○https://huggingface.co/gpt2
○https://huggingface.co/abinayam/gpt-2-tamil
○https://huggingface.co/xlm-roberta-base
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Language Models
Challenges
●Require
○HUGE amount of data - QUALITY data!!
○Computational power
■~$4M for training GPT-3
1
■ChatGPT running cost - ~$100,000/day
1
●Training processes are not transparent
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
1
https://www.cnbc.com/2023/03/13/chatgpt-and-generative-ai-are-booming-but-at-a-very-expensive-price.html
Speech processing
Introduction
●There are commercial solutions
○Google Assistant
○processes is not transparent
●Some open data available
○https://commonvoice.mozilla.org/ta
●No (or very few) annotated voice data
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
1
https://www.cnbc.com/2023/03/13/chatgpt-and-generative-ai-are-booming-but-at-a-very-expensive-price.html
Appliciations
not a complete list!
●Machine Translators - Google Translate / SiTa
●Spell Checkers - http://vaani.neechalkaran.com/
●Sentiment analysers -
https://huggingface.co/Vasanth/tamil-sentiment-distilbert
●Text generators / synthasisers / summarisers
●Text to Voice -
https://www.narakeet.com/languages/tamil-text-to-speech/
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
The future
Recommendation
Focus on quality language data and research collaboration
●We need quality data for training and testing
○garbage in, garbage out.
●No bench mark datasets.
●Require more linguistic studies
○we need to have the understanding of data.
●Focus on local and dialect varieties
○languages have own qualities.
●Use Tamil as much as possible.
●Publicise resources
●More collaboration
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Recommendation
Focus on low-resource technologies
●Invest on low-resource language technologies/approaches
○Multilingual learning
○Transfer learning
○Data augmentation
○Domain adaptation
●Identify features to be tuned for Tamil
●Neuron-level interpretation
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Recommendation
Focus on speech technologies
●Voice recognition will be a key part of the future of
communication
○more data.
○research on prosody, etc.
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023