Computational Representation of Language

ssuserf3a6e7 7 views 12 slides Sep 16, 2025
Slide 1
Slide 1 of 12
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12

About This Presentation

Computational Representation of Language this is particularly one-hot enconding technique


Slide Content

Module 1 Introducing Deep Learning Computational Representation of Language

Computational Representation of Language In order for deep learning models to process language, we have to supply that language to the model in a way that it can digest. For all computer systems, this means a quantitative representation of language, such as a two dimensional matrix of numerical values. Two popular methods for converting text into numbers are one-hot encoding word vectors.

One-Hot Representation of Words The traditional approach to encoding natural language numerically for processing it with a machine is one-hot encoding below figure. In this approach, the words of natural language in a sentence (e.g., “the,” “bat,” “sat,” “on,” “the,” and “cat”) are represented by the columns of a matrix. Each row in the matrix, meanwhile, represents a unique word. One-hot encoding of words, such as this example, Predominate the traditional machine learning Approach to natural language processing.

One-Hot Representation of Words If there are 100 unique words across the corpus of documents you’re feeding into your natural language algorithm, then your matrix of one-hot encoded words will have 100 rows, If there are 1,000 imoqie words across your corpus, then there will be 1,000 rows in you one-hot matrix, and so on. Cells within one-hot matrices consist of binary values, that is, they are a 0 or 1. Each column contains at most a single 1, but is otherwise made up of 0s, meaning that one-hot matrices are sparse. Values of one indicate of presence of a particular word(row) at a particular position (column) within the corpus. Our entire corpus has only six words, five of which are unique. As indicated by the cells containing 1s in the first row of the matrix.

Example: the bat ate the food dog - Total words in the corpus: 6 words Unique words: {the, bat, ate, food, dog} - (only 5 unique words because "the" repeats) One-Hot Matrix Setup Rows = Unique words = 5 Columns = Total word positions in corpus = 6 One-Hot Representation of Words

One-Hot Representation of Words Fill in the Matrix: Now we mark where each word appears: Word 1 = the → Row 1, Col 1 = 1 Word 2 = bat → Row 2, Col 2 = 1 Word 3 = ate → Row 3, Col 3 = 1 Word 4 = the → Row 1, Col 4 = 1 Word 5 = food → Row 4, Col 5 = 1 Word 6 = dog → Row 5, Col 6 = 1

One-Hot Representation of Words Key Observations Sparse Matrix : Most entries are 0, only one 1 per column. Repetition captured : The word "the" has two 1s (in column 1 and column 4). Simple but limited : Doesn’t capture similarity between words (e.g., "dog" and "bat" are equally distant from "the"). Grows very large for big vocabularies (e.g., 50,000 words = 50,000 rows!).

One-Hot Representation of Words - Why In natural language and other categorical data, models cannot directly process words like "dog" or "bat". We need to convert categorical values into numbers. One-hot encoding provides a simple numeric representation where: Each unique category/word gets its own dimension. Only one dimension is active (1), all others are zero. It acts like a bridge between raw text and machine learning models.

One-Hot Representation of Words - Benefits 1. Simple and Intuitive Easy to implement and understand. Direct mapping between word and vector position. 2. Preserves Uniqueness Each word/category is represented uniquely. No ambiguity. 3. Suitable for Small Vocabulary Works well for datasets with a limited number of categories. 4. Non-Numeric Data → Numeric Enables ML models (which require numbers) to work with categorical/textual data.

One-Hot Representation of Words - Applications Natural Language Processing (NLP) Representing words in text before moving to embeddings. Example: Representing the words in "the cat sat" as one-hot vectors. 2. Machine Learning on Categorical Data For categorical features (like country = {India, USA, UK}) in classification/regression tasks. 3. Image Processing (Labels) Class labels (e.g., dog, cat, horse) are often one-hot encoded for training neural networks. 4. Recommender Systems Users/items represented as one-hot in collaborative filtering models.

One-Hot Representation of Words – Pros & Cons Simplicity: Easy to compute, interpret, and implement. No Ordinal Assumption: Unlike "label encoding", it does not impose a false ranking (e.g., Cat=1, Dog=2 doesn’t mean Dog > Cat). Model-Friendly: Many ML models (like logistic regression, neural networks) work better with one-hot encoded categorical features.

THANK YOU