Text summarization

AkashKarwande 8,015 views 17 slides Dec 30, 2017
Slide 1
Slide 1 of 17
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17

About This Presentation

Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
ph...


Slide Content

Text Summarization Using NLP Presented by, Akash N. Karwande (2016MNS011) Guided by Prof. R.K. Chavan

Introduction The goal of summarization is to produce a shorter version of a source text by preserving the meaning and the key contents of the original document. A well written summary can significantly reduce the amount of work needed to digest large amounts of text.

Types of Text summarization There are two types summaries Extractive summaries Abstractive summaries

Extractive summaries Extractive summaries are created by reusing portions (words, sentences, etc.) of the input text document The system extracts text from the entire collection, without modifying the text document. Most of the summarization research today is on extractive summarization.

Abstractive summaries Requires deep understanding and reasoning over the text It Provides own summary over input text without using same word or sentence in the input text Determines the actual and short meaning of each element, such as words ,sentences and paragraphs

Natural Language Toolkit leading platform for building Python programs to work with human language data NLP is a field of computer science, artificial intelligence (also called machine learning), and linguistics processing Interactions between computers and human (natural) languages It provides suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning

Continued… Following are the NLTK Libraries used in text summarization Word tokenizer Sentence tokenizer stopwords BeautifulSoup numpy library Tagging Parsing

Linguistic Preprocessing for Automatic Summarization Fig.Pipeline architecture of an information extraction process

Sentence Segmentation Converts raw text into sentences List of strings Sentence tokenizer Input Text: John owns a car. It is a Toyota. Output: Segm1: John owns a car. Segm2: It is a Toyota.

Tokenization Identifies the word tokens from given sentence Provides a list of tokens as output Word tokenizer Input: John owns a car. Output: [[John], [owns], [a], [car], [.]]

Part of speech tagging (POS Tagging) Assigns appropriate part of speech tag to each word POS is useful in extraction of nouns, adverbs, adjective, which provide some meaningful information about text Generates a list of tuples with POS annotation Input: [[John], [owns], [a], [car], [.]] Output: (NP (NNP John)) (VP (VBZ owns) (NP (DT a) (NN car))) (. .)

Entity detection Identification of predefined categories such as person, location, quantities, organizations etc NER provides the entity detection for linguistic processing NER system uses linguistic grammar-based techniques and also statistical model to identify the entity Input: (NP (NNP John)) (VP (VBZ owns) (NP (DT a) (NN car))) (. .) Output: John->Person

Relation detection Identifies the possible relation between two or more chunked sentences Co-reference chain provides a relation between two or more sentences Provides the link between pronouns and its corresponding nouns Replacement of the pronouns with proper nouns Input Text: John owns a car. It is a Toyota. (In form of parse tree) Output: "a car" -> "a Toyota"; "It" -> "a Toyota"

Conclusion Automatic Text Summarization has been shown to be useful for Natural Language Processing tasks such as Question Answering or Text Classification and other related fields of computer science such as Information Retrieval. And the access time for information searching will be improved.

Future work From our summarization result we have found that by reducing all sentences that do not contain any geographic information may lead to a loss of information, since there may exist links between that reduced sentences. Therefore, we will analyse this issue in detail, by studying graph based algorithms that capture the relationship between sentences.

References https://www.researchgate.net/publication/315667326 Extractive Based Automatic Text Summarization https://github.com/shreyans29/ The semicolon Data Analytics youtube tutorials on The Semicolon https://gist.github.com/shlomibabluki/5473521 summary_tool.py https://thetokenizer.com/2013/04/28/build-your-own-summary-tool/ https://glowingpython.blogspot.in/2014/09/text-summarization-with-nltk.html http://www.nltk.org/ NLTK 3.2.5 documentation

Thank You !