Tamil Language Computing: The present and the Future

iamsarves 276 views 27 slides May 30, 2024
Slide 1
Slide 1 of 27
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27

About This Presentation

This presentation outlines the present and future of Tamil computing.


Slide Content

Tamil Language Computing:
the Present and the Future
Dr. K. Sarveswaran
Chairperson : Section - B, Jaffna Science Association.
Department of Computer Science, University of Jaffna. Sri Lanka.
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Language Computing
What it is
Enabling machines to understand, analyse, generate, and
communicate in natural language!

-> Natural language: Text and Speech
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Language Computing
Fields
●Natural Language Processing
○a field of Artificial Intelligence (AI)
○more technology focused
●Computational Linguistics
○understanding languages
○linguistics insights
●Language enabling
○encoding / input methods
○localisation, IDNs
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Language Computing

Why
●Makes our life easy:
○language computing is everywhere.
○break language barriers.
○make processes more efficient and fast.
○analyse/construct knowledge.
●High commercial value.
●Supports for research/exploration:
○language/linguistics.
○humanity.
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Until very recent times
Structured information
Structured informationNatural Language
Commands / Prog-Lang

Recent advancements

Structured information
Structured information
Natural Language
Natural Language
Self-
Learning
Knowledge sources
Instructions
on
how/what to
learn

Language Computing
The big picture
Building applications:
●We use applications everyday
●Examples:
○Google Search
○Machine Translators
○Content generators
○Google Assistant
●Built by technologists
Building computational resources:
●Computers use these to learn languages
○Corpora
○Analysers / Parsers
○Dictionaries
○Annotated voice recordings
●Analysers/parsers to create annotations
●Large Language Models
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
Language enabling work / Foundational work



The present!

Foundational work
●Encoding - Unicode.
●Several input methods.
●Unicode fonts are available.
●Applications support Unicode.
●However,
○even now, conference papers are in ASCII.
○Theses are written in ASCII.
○Conversion issues.
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Computational Resources
Introduction - raw resources
●Monolingual/Parallel Corpora
○Plenty - but, much less than English
■AI4Bharat - https://ai4bharat.org/
■Kaggle -
https://www.kaggle.com/datasets/neechalkaran/venmurasu
●Dictionaries / word lists
○Wikidata
○Glossaries
○Not many are available
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Computational Resources
Introduction - level of annotations
●Linguistic annotation
○Phonology - how phonemes make up a sound.
○Morphology - how morphemes make up a word.
○Syntax - how words make up a sentence.
○Semantics - ways in which a language conveys meaning.
○Pragmatic - ways in which a language is used in a context.
●Metadata annotation
○Date, Author, Genre, Source…
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Annotated Resources

Morphology
●A few resources

●Morphological analysers / generators
○useful to generate data.
○learn morphology.
●Still no derivational analysis.
●No standards (Universal Morphology scheme does not capture
Tamil well)
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Computational (annotated) Resources

Part of Speech (POS)
●Part of Speech (POS) tagged corpus
தமிழ்/PROPN எங்கள்/PRON உயிருக்கு/NOUN ேநர்/NOUN ./PUNCT
●Many annotation schemes
○BIS (Bureau of Indian Standards) /UPOS (Universal POS)
/Native ones
●Challenging task
○context is important
○require more research - mixed categories, adjectivial
●Few POS taggers available
○there are open source & off-the-shelf taggers
○need annotated data
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Computational (annotated) Resources

Grammar
●Dependency grammar works well for our languages.
●Two formalisms are used for annotations, so far:
○Lexical Functional Grammar
○the Universal Dependencies
●Significant amount of research required in identifying structures.
○E.g.
■gapping constructions:
கண் ணன் ெகாழும்புக்கும் ராதா கண் டிக்கும் ேபானார்கள்
■multiword tokens: அவன் அைமச்சராகப் பார்த்தான்
■verbal constructions.
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Computational (annotated) Resources

Grammar - The Universal Dependencies
1 அங்கு அங்கு ADV _ _ 4 advmod _ _
2 நிைறய நிைறய ADJ _ _ 3 amod _ _
3 வீடுகள் வீடு NOUN _ _ 4 nsubj _ _
4 இருக்கின்றன இரு VERB _ _ 0 root _ _
5 . . PUNCT _ _ 4 punct _ _
இருக்கின்றன =>
Gender=Neut|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Polite=Form|Tense=Pres|VerbForm=Fin|Voice=Act
●The Universal Dependencies
○captures morphology and dependency syntax
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Computational (annotated) Resources

Grammar - Lexical Functional Grammar (LFG)
●Captures morphology, dependency syntax, and constituency
structure.
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Language Models

Introduction
●A computational resource - Statistical model?
●Trained on datasets large - includes nearly everything that has
been written on the internet or available in digital form.
●Some are multilingual.
●Large language models recognize, summarize, translate, predict
and generate text and other content.
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Language Models

Introduction
●Various algorithms are available to train these models.
●Different kinds of models:
○convert the given text to numbers which can be understood
by machines (encoder).
○convert the given text to another text form (encoder-decoder).
○covert the given numbers to text (decoder).
●These models can be customised for various tasks.
○For instance, Fairseq can be customised to do translation.
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Language Models

Examples
●Example:
○https://huggingface.co/gpt2
○https://huggingface.co/abinayam/gpt-2-tamil
○https://huggingface.co/xlm-roberta-base
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Language Models
Challenges
●Require
○HUGE amount of data - QUALITY data!!
○Computational power
■~$4M for training GPT-3
1
■ChatGPT running cost - ~$100,000/day
1

●Training processes are not transparent
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
1
https://www.cnbc.com/2023/03/13/chatgpt-and-generative-ai-are-booming-but-at-a-very-expensive-price.html

Speech processing
Introduction
●There are commercial solutions
○Google Assistant
○processes is not transparent
●Some open data available
○https://commonvoice.mozilla.org/ta
●No (or very few) annotated voice data
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023
1
https://www.cnbc.com/2023/03/13/chatgpt-and-generative-ai-are-booming-but-at-a-very-expensive-price.html

Appliciations
not a complete list!
●Machine Translators - Google Translate / SiTa
●Spell Checkers - http://vaani.neechalkaran.com/
●Sentiment analysers -
https://huggingface.co/Vasanth/tamil-sentiment-distilbert
●Text generators / synthasisers / summarisers
●Text to Voice -
https://www.narakeet.com/languages/tamil-text-to-speech/
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

The future

Recommendation
Focus on quality language data and research collaboration
●We need quality data for training and testing
○garbage in, garbage out.
●No bench mark datasets.
●Require more linguistic studies
○we need to have the understanding of data.
●Focus on local and dialect varieties
○languages have own qualities.
●Use Tamil as much as possible.
●Publicise resources
●More collaboration
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Recommendation
Focus on low-resource technologies
●Invest on low-resource language technologies/approaches
○Multilingual learning
○Transfer learning
○Data augmentation
○Domain adaptation
●Identify features to be tuned for Tamil
●Neuron-level interpretation
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

Recommendation
Focus on speech technologies
●Voice recognition will be a key part of the future of
communication
○more data.
○research on prosody, etc.
29
th
Annual Sessions Jaffna Science Association 29-31 March 2023

[email protected]
Thank you