Final-speech based text summarizers.pptx

ujraj8767 16 views 41 slides Aug 09, 2024
Slide 1
Slide 1 of 41
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41

About This Presentation

Student of the college psv college and you can do it


Slide Content

SPEECH BASED TEXT SUMMARIZER

ABSTRACT Chunks of Information are available on the internet, it is most important to provide a solution to get information most efficiently and accurately. Text Summarization is the most popular problem in this modern era. The main objective of Text Summarization is extracting Summaries from Large chunks of text efficiently and accurately. Reading Time can be reduced with the use of Text Summarization and Audio file also generated for the summarized text. In this project, involves the processes of how to convert summarized text file into Audio file and summarized text PDF.

INTRODUCTION Text Summarization is summarizing huge chunks of text into shorter form without changing semantics. Text summarization has a huge demand in this modern world. The main advantage of Text Summarization is the reading time of the user can be reduced.. In this project, we focus on the implementation of Text Summarization using NLTK which is a standard python library with prebuilt functions and utilities for the ease of use and implementation. It is one of the most used libraries for natural language processing and computational linguistics. NLTK Natural Language Processing with Python Natural language processing (NLP) is a research field that presents many challenges such as natural language understanding. NLTK which is the Natural language processing method. In this project we use tokenize and stop words to summarize the text document. Stop words are common words like ‘the’, ‘and’, ‘I’, etc. that are very frequent in text, and so don’t convey insights into the specific topic of a document.

INTRODUCTION We can remove these stop words from the text in a given corpus to clean up the data, and identify words that are more rare and potentially more relevant to what we’re interested in.Text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words. Tokenization is used in natural language processing to split paragraphs and sentences into smaller units that can be more easily assigned meaning. The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words). With Natural language processing methods summarization are done. It converting the summarized text into Audio File Using pyttsx3 it is a text-to-speech conversion library in Python. Large Chunks of Text file is given as input and Summary is generated as PDF and converts the Summary into Audio file using pyttsx3. In this paper, we show how to convert summarized text file into Audio file and summarized text PDF.

SCOPE OF THE PROJECT The project is wide in scope all of the limitations stated below may seem to contradict that, but they are the only restrictions applied. This project looks text document summarization used to summaries produced are largely extracts of the document being summarize and the summarized document is available in mp3 format also. The parameters used are optimal for stories, news articles and so on , although that can be changed easily.

LITERATURE SURVEY PAPER 1 : TITLE : A Survey of Automatic Text Summarization: Progress, Process and Challenges AUTHOR: M. F. MRIDHA, AKLIMA AKTER LIMA, KAMRUDDIN NUR, SUJOY CHANDRA DAS, MAHMUD HASAN and MUHAMMAD YEAR: 2021 ABSTRACT : With the evolution of the Internet and multimedia technology, the amount of text data has increased exponentially. This text volume is a precious source of information and knowledge that needs to be efficiently summarized. Text summarization is the method to reduce the source text into a compact variant, preserving its knowledge and the actual meaning. Here we thoroughly investigate the automatic text summarization (ATS) and summarize the widely recognized ATS architectures. This paper outlines extractive and abstractive text summarization technologies and provides a deep taxonomy of the ATS domain. The taxonomy presents the classical ATS algorithms to modern deep learning ATS architectures. Every modern text summarization approach’s workflow and significance are reviewed with the limitations with potential recovery methods, including the feature extraction approaches, datasets, performance measurement techniques, and challenges of the ATS domain, etc. In addition, this paper concisely presents the past, present, and future research directions in the ATS domain

LITERATURE SURVEY PAPER 2: TITLE : An Overview of Automatic Text Summarization Techniques AUTHOR: Sanjan S Malagi , Rachana Radhakrishnan , Monisha R, Keerthana S YEAR: 2020 ABSTRACT : In this age, where an enormous amount of information is available on the Web, it is most basic to supply the advanced mechanism to remove the data quickly and most profitably. It is uncommonly troublesome for human beings to physically extricate the outline of expansive records of content. There is a huge number of textual content, image content available on the Internet. So there's an issue, looking for significant documents from the number of reports accessible, and retaining significant data from it. To solve these issues, the programmed content summarization is exceptionally much necessary. Text Summarization (TS) is the method of recognizing the foremost imperative important data in a report or set of related reports and abstracting them into a shorter form protecting its general implications or meanings of the sentences. It is in this way we are ready to summarize the substance so that it gets simpler to ingest the information, keeping up the substance, and understanding the information. A few content summarization approaches have been presented in the past for a long time for English and some other languages. The fundamental objective is to decrease a given body of content to a division of its estimate, keeping up coherence and semantics. In this paper, we present a brief study of the various methods in existence to achieve this summarization.

LITERATURE SURVEY PAPER 3 : TITLE : Text Extraction from Image, Word, PDF and Text-to-Speech Conversion AUTHOR: Pooja Bendale , Sanika Badhe , Sarthak Bhagat , Pravin Rahate YEAR: 2022 ABSTRACT: Speech is one of the most seasoned and most regular method for information exchange between human. Throughout the long term, Endeavours have been made to foster vocally intuitive PCs to acknowledge voice/speech synthesis. Clearly such a point of interaction would yield extraordinary advantages. For this situation a computer can incorporate text and give out a speech. Text-To-speech is an innovation that gives a method for changing composed text from a clear structure over to a communicated in language that is without any problem reasonable by the end client (Fundamentally in English Language).

LITERATURE SURVEY PAPER 4: TITLE : Text Extraction From Image and Text to Speech Conversion AUTHOR: Teena Varma , Stephen S Madari , Lenita L Montheiro , Rachna S Pooojary YEAR: 2021 ABSTRACT: The recent technological advancements in the field of Image Processing and Natural Language Processing are focusing on developing smart systems to improve the quality of life. In this work, an effective approach is suggested for text recognition and extraction from images and text to speech conversion. The incoming image is firstly enhanced by employing gray scale conversion. Afterwards, the text regions of the enhanced image are detected by employing the Maximally Stable External Regions (MSER) feature detector. The next step is to apply geometric filtering in combination with stroke width transform (SWT) to efficiently collect and filter text regions in an image. The non-text MSERs are removed by using geometric properties and stroke width transform. Subsequently, individual letter/alphabets are grouped to detect text sequences, which are then fragmented into words. Finally, Optical Character Recognition (OCR) is employed to digitize the words. In the final step we feed the detected text into our text-to-speech synthesizer (TTS) for text to speech conversion. The proposed algorithm is tested on images from documents to natural scenes. Promising results have been reported which prove the accuracy and robustness of the proposed framework and encourage its practical implementation in real world applications.

LITERATURE SURVEY PAPER 5 : TITLE : Detection of Hate Speech using Text Mining and Natural Language Processing AUTHOR: G. Priyadharshini YEAR : 2020 ABSTRACT: In today’s modern world, technology connected with humanity is doing wonderful things. On the other hand, people inclined to social networks where they have anonymity are bringing out the very nastiest of people in the form of hate speech. Social media hate speech is a serious societal problem which can contribute to magnify the violence ranging from lynching to ethical cleansing. One of the critical tasks of automatic detection of hate speech is differentiating it from the other context of offensive languages. The existing works to distinguish the two categories using the lexical methods showed very low performance metrics values which led to major misclassification. The works with supervised machine learning approaches indeed gave significant results in distinguishing hate and offensive but the presence or absence of certain words of both the classes can serve as both merit and demerit to achieve accurate classification. In this paper, a ternary classification of tweets into hate speech, offensive and neither is performed using multi class classifiers. Among the four classifiers: Logistic Regression, Random forests, Support Vector Machines (SVM) and Naïve Bayes. It can be seen that Random Forest classifier performs significantly well with almost all feature combinations giving maximum accuracy of 0.90 for TFIDF feature technique.

MOTIVATION OF THE PROJECT Text Summarization is an active field of research in Natural Language Processing communities. Text Summarization is increasingly being used in the educational sector such a, In word Processing tools. Many approaches differ on the behaviour of their problem formulations. Automatic text summarization is an important step for information management tasks. It solves the problem of selecting the most important portions of the text. High quality summarization requires sophisticated NLP techniques. And generates a Mp3 is an additional advantage.

OBJECTIVE The objective of the project is to understand the concepts of natural language processing and creating a tool for text summarization. The concern in automatic summarization is increasing broadly so the manual work is removed. The project concentrates creating a tool which automatically summarizes the document and generate an audio file for the summarized text.

EXISTING SYSTEM The existing system of summarization involves manual work so the person read the paragraph and understands it after that he write text as a summarized one . So it takes so much of time to summarize large texts. Audio for text files involves recording process so the person read a entire document while recording after that the audio file has been generated lot of time are needed for both processes of text summarization and audio file generation for text file.

DISADVANTAGES Requires large time Involves large manual works Chances of information lost due to manual summarization

PROPOSED SYSTEM The proposed system contains two processes only the user can upload the text document contains stories, articles or any other contents and it accepts only text documents. After uploading the text file summarization process started and summarization PDF also generated using Natural language processing(NLP) . After that the audio file of mp3 has been generated from the summarized PDF respectively. We can listen the audio to understand the text documents contents.

ADVANTAGES No need to read whole document for any person to get theme of document. Less time consuming to summarize. Less time consuming for audio creation Cost efficient process

SYSTEM CONFIGURATION H/W SYSTEM CONFIGURATION Processor – I3 , i5,i7 RAM – 8 Gb Hard Disk – 500 GB

SYSTEM CONFIGURATION S /W SYSTEM CONFIGURATION Operating System – Windows 7/8/10 Front End – GUI (Tkinter) Scripts – python language Tool – Python idle

BLOCK DIAGRAM

MODULES NAME NLP IN TEXT SUMMARIZATION TOKENIZATION AND STOPWORDS AUDIO FILE GENERATION

NLP IN TEXT SUMMARIZATION Natural language processing (NLP) is about developing applications and services that are able to understand human languages. Some Practical examples of NLP are speech recognition for, eg : Google voice search, understanding what the content is about or sentiment analysis etc in our project we use the NLTK, NLTK stands for Natural Language Toolkit. This toolkit is one of the most powerful NLP libraries which contain packages to make machines understand human language and reply to it with an appropriate response. Tokenization, Stopwords are some of these packages which will be discussed in this tutorial.

TOKENIZATION AND STOPWORDS Tokenization is one of the first step in any NLP pipeline. Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens. If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Sentence Tokenization'. Stop words are words that are so common they are basically ignored by typical tokenizes. By default, NLTK (Natural Language Toolkit) includes a list of 40 stop words, including: “a”, “an”, “the”, “of”, “in”, etc. The stopwords in nltk are the most common words in data. stop words occur in abundance, hence providing little to no unique information that can be used for classification or clustering, by using Natural Language processing the summarized text will be generated.

AUDIO FILE GENERATION In Audio file generation section, we used Pyttsx3 of PyAudio to generate Audio from the summarized pdf , the Mp3 formatted audio file is generated. pyttsx3 is a text-to-speech conversion library in Python. Unlike alternative libraries, it works offline and is compatible with both Python 2 and 3. An application invokes the pyttsx3.init () factory function to get a reference to a pyttsx3. Engine instance. it is a very easy to use tool which converts the entered text into speech. The pyttsx3 module supports two voices first is female and the second is male which is provided by “sapi5” for windows. It supports three TTS engines : sapi5 , nsss , espeak , in our project we used sapi5.

CODING import tensorflow as tf import keras from keras import applications import numpy as np import easygui import os import serial import tkinter as tk from tkinter import * from tkinter import filedialog from tkinter.filedialog import askopenfile import json import requests import string import re import nltk import itertools # nltk.download ('stopwords') # nltk.download (' wordnet ') # nltk.download (' punkt ') import pke from nltk.corpus import stopwords

from nltk.corpus import wordnet import traceback from nltk.tokenize import sent_tokenize from flashtext import KeywordProcessor import textwrap import fpdf from PIL import Image,ImageTk from pdf2image import convert_from_path from tkPDFViewer import tkPDFViewer as pdfsd import tkinter.messagebox import textract from pdf2image import convert_from_path import webbrowser from datetime import datetime from textwrap3 import wrap import torch import random import numpy as np import nltk # nltk.download (' punkt ') # nltk.download ('brown') # nltk.download (' wordnet ') from nltk.corpus import wordnet as wn

from nltk.tokenize import sent_tokenize # nltk.download ('stopwords') from nltk.corpus import stopwords import string import traceback from flashtext import KeywordProcessor import fpdf import pyttsx3 import fitz from nltk.tokenize import word_tokenize import threading my_w = tk.Tk () sw = my_w.winfo_screenwidth () sh = my_w.winfo_screenheight () w=sw-10 print( sw,sh ) my_w.geometry ('% dx%d ' %( w,sh )) my_w.title ('speech Based Text summarizer') my_font1=('times', 18, 'bold') bg = PhotoImage (file=' backgroundpic (2). png ') bgLabel = Label( my_w , image= bg ) bgLabel.place (x=0, y=0)

l1 = tk.Label ( my_w,text ='\n Upload Text file, get \n summarized text pdf \n & \n Listen summarized text \ n',width =45,font=('italic',20,'bold'), bg ='black', fg ='white',) l1.place(x=470, y=250,width=420) #-------------------- upload button ---------------------------------------------- b1 = tk.Button ( my_w , text='UPLOAD FILES', width=20,command = lambda:qsnpaper (), activebackground =' skyblue ',font=('italic',17,'bold') , bg ='black', fg ='#fc3a52') b1.place(x=80,y=600, width=180, height=40) #---------------------------------------------------------------------------------- print( tf .__version__) print(".\n. \n. \n. \n. \n. \n. \n. \n. \n. \n. \n. \n") print("upload text file for summarize.....") titleLabel = Label( my_w , text='Speech Based Text Summarizer', font=('italic', 25, 'bold '), bg ='black', fg ='#fc3a52', ) titleLabel.place (x=2, y=6,width=1366, height=80) def result(): text = upload_file () print("original Text \ n",text ) stopWords = set( stopwords.words (" english ")) words = word_tokenize (text ) # Creating a frequency table to keep the

# score of each word freqTable = dict () for word in words: word = word.lower () if word in stopWords : continue if word in freqTable : freqTable [word] += 1 else: freqTable [word] = 1 # Creating a dictionary to keep the score # of each sentence sentences = sent_tokenize (text) sentenceValue = dict () for sentence in sentences: for word, freq in freqTable.items (): if word in sentence.lower (): if sentence in sentenceValue : sentenceValue [sentence] += freq else:

sentenceValue [sentence] = freq sumValues = 0 for sentence in sentenceValue : sumValues += sentenceValue [sentence] # Average value of a sentence from the original text average = int ( sumValues / len ( sentenceValue )) # Storing sentences into our summary. summary = '' for sentence in sentences: if (sentence in sentenceValue ) and ( sentenceValue [sentence] > (1.2 * average)): summary += " " + sentence print("summarized text \ n",summary ) wrapper = textwrap.TextWrapper (width=80) # word_list = wrapper.wrap (text=summary) word_list = wrapper.fill (text=summary) summary= word_list # ques = summary

ques = summary.encode ('latin-1', 'replace').decode('latin-1') #print ( ques ) #print ( answer.capitalize ()) print ("\n") #print( quesful ) #print("Summarized Text") pdf = fpdf.FPDF (format='letter') pdf.add_page () pdf.set_font ("Arial", size=30) pdf.write (30," Summarized Text") pdf.ln () pdf.set_font ("Arial", size=12) for i in ques : pdf.write (10,str(i)) # pdf.ln (15) pdf.output ("summarized.pdf") return pdf def qsnpaper (): finalprint =result() l3 = tk.Label ( my_w,text ='\n press Play Audio & \n Check the Folder for \n \ n',font =('italic', 20, 'bold '), fg ='white', bg ="black") l3.place(x=470, y=250,width=420)

l5 = tk.Label(my_w,text='summarized.pdf \n summarized.mp3',font=('italic', 20, 'bold '), fg ='#fc3a52',bg="black") l5.place(x=560, y=350) #----------SEE PDF Button---------------------------------------------------------------------------------------- b2 = tk.Button ( my_w , text='SEE PDF', width=20,command = lambda:see_pdf (), activebackground =' skyblue ',font=('italic',17,'bold') , bg ='black', fg ='#fc3a52') b2.place(x=370,y=600, width=155, height=40)   #------------------------------------------------------------------------------------------------------------ #----------Play Audio---------------------------------------------------------------------------------------- b2 = tk.Button ( my_w , text='Play Audio', width=20,command = lambda:playaudio (), activebackground =' skyblue ',font=('italic',17,'bold') , bg ='black', fg ='#fc3a52') b2.place(x=670,y=600, width=155, height=40)   #-----------------------------------------------------------------------------------------------------------

#----------CLOSE Button----------------------------------------------------------------------------------------   b3 = tk.Button ( my_w , text='CLOSE', width=20,command = lambda:close (), activebackground =' skyblue ',font=('italic',17,'bold') , bg ='black', fg ='#fc3a52') b3.place(x=1190,y=600, width=140, height=40)   #------------------------------------------------------------------------------------------------------------ return 0 def upload_file (): f_types = [('Text Files', '*.txt'), ('Doc Files','*. docx ')] filename = tk.filedialog.askopenfilename (multiple= False,filetypes = f_types ) text=open( filename,"r+",encoding ='utf-8') text= text.read () return text def see_pdf (): df_location = r"D :\ Dowload D\OWN\text summarization and speech on wrk \summarized.pdf" webbrowser.open ( df_location ) def playaudio (): doc = fitz.open ('summarized.pdf') text = ""

for page in doc: text+= page.get_text () #print(text) engine = pyttsx3.init() engine.save_to_file (text, 'summarized.mp3') engine.say ("Listen the text carefully") engine.say (text) engine.say ("Summarized text is ended") engine.runAndWait () def close(): my_w.destroy () my_w.mainloop ()  

SCREENSHOTS

CONCLUSION This project mainly focused on creating a system that gets concise summaries of articles or large chunks of text of a text file. These implications of this would mean that knowledge gathering would be easier and time saving. This project will give free from reading and provides summarized PDF and audio file also we can listen to get the content from the text file . On implementing NLP (Natural language processing) Efficient results will be takes place. Access time for Information searching will be improved. Converting the Summarized text into Audio File Which is used at various Real Time Scenarios.

REFERENCES [1] J. N. Madhuri and R. Ganesh Kumar, "Extractive Text Summarization Using Sentence Ranking," 2019 International Conference on Data Science and Communication ( IconDSC ), Bangalore, India, 2019, pp.1- 3.doi: 10.1109/IconDSC.2019.8817040 [ 2] Dr. Geetha C Megharaj , Varsha Jituri , 2022, TFIDF Model based Text Summerization , INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY IJERT) RTCSIT – 2022 (Volume 10 – Issue 12 ) [3] Akshat Gupta , Dr. Ashutosh Singh , Dr. A. K Shankhwar , 2022, A Quantitative Analysis of Automatic Text Summarization (ATS), INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT) Volume 11, Issue 09 (September 2022 ) [4] N G Gopikakrishna , Parvathy Sreenivasan , Vinayak Chandran , Yadhu Krishna K P, Sanuj S Dev , Krishnaveni V V , 2021, Comparative Study on Text Summarization using NLP and RNN Methods, INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT) NCREIS – 2021 (Volume 09 – Issue 13 ) [5] Sonali Agarwal , Pranita Redkar , Aditi Gaur, Swati Varma , 2021, Text Summarization, INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT) NTASU – 2020 (Volume 09 – Issue 03)

REFERENCES [6] Sanjan S Malagi , Rachana Radhakrishnan , Monisha R, Keerthana S, Dr. D V Ashoka , 2020, An Overview of Automatic Text Summarization Techniques, INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT) NCAIT – 2020 (Volume 8 – Issue 15 ) [7] Akash Bedi , Mohit Bahrani , Divya Santwani , 2020, Texterizer , INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT) Volume 09, Issue 07 (July 2020) [ 8] Teena Varma , Stephen S Madari , Lenita L Montheiro , Rachna S Pooojary , 2021, Text Extraction From Image and Text to Speech Conversion, INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT) NTASU – 2020 (Volume 09 – Issue 03) [ 9] Asmi P, Sanaj M S, 2021, Toxic Speech Classification via Deep Learning using Combined Features from BERT & FastText Embedding, INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT) ICCIDT – 2021 (Volume 09 – Issue 07) [ 10] G. Priyadharshini , 2020, Detection of Hate Speech using Text Mining and Natural Language Processing, INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT) Volume 09, Issue 11 (November 2020),