data science and analytics in computer science

uthradevia5 10 views 10 slides Aug 24, 2024
Slide 1
Slide 1 of 10
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10

About This Presentation

collect raw text,TFIDF


Slide Content

NADAR SARASWATHI COLLEGE OF ARTS AND SCIENCE DATA SCIENCE & ANALYTICS A .Uthra D evi II M.Sc Computer Science

COLLECT RAW TEXT The data science team investigates the problem, understands the necessary data source, and formulates initial hypotheses Data must be collected before anything can happen The data science tam starts by actively monitoring various websites for user generated contents. the user generated contents being collected being collected could be related articles from news portals and blogs. Comments on ACME’s products from online shops or reviews sites, or social media posts that contain keywords phone or be book.

Regardless of data source, the team would deal with semi structured data such as HTML web pages. Really simple syndication feeds. XML, or JavaScript object notation files. Enough structure needs to be imposed to find the part of the raw text that the team really cares about. Many news portals and blogs provide data feeds that are in an open standard format. Such as RSS or XML. Regular expressions can find words and strings that match particular patterns in the text effectively and efficiently.

REPRESENT TEXT In this data representation step, raw text is first transformed with text normalization techniques: Tokenization Case folding Tokenization or tokenizing is the task of separating words from the body of text. Raw text is converted into collections of tokens after the tokenization, where each token is a word. A common approach is tokenizing on space. For example, with the tweet shown previously.

TOKENIZATION: A common approach is tokenizing on spaces. Example: text analysis sometimes called text analysis Another way is to tokenize the text based on punctuation marks and spaces. Data science and big data analytics, ”has become well accepted across academia and the industry.

CASE FOLDING: I t reduces all letters to lowercase Example: text analysis sometimes called text analytics=text analysis sometimes called text analytics. One needs to be cautious applying case folding to tasks such as information extraction, sentiment analysis, and machine translation. If case folding must be present. One way to reduce problems is to create a lookup table of words not to be case folded.

TF-IDF TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set).

TF-IDF FORMULA

IDF Besides stop words, words that are more general in meaning tend to appear more often, thus having higher term frequencies. The additional variable should reduce the effect of the term frequency as the term appears in more documents. The highest corpus-wide term frequencies(TF) The highest document frequencies(DF) The highest inverse document frequencies(IDF)

THANK YOU