Data Science Process.pptx

WidsoulDevil 2,454 views 79 slides Feb 28, 2023
Slide 1
Slide 1 of 79
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79

About This Presentation

Deta science in python


Slide Content

Data Science Process

Setting the research goal Spend time understanding the goals and context of your research Create a project charter

2. Retrieving Data Data within company This data can be stored in official data repositories such as databases, data marts, data warehouses, and data lakes maintained by a team of IT professionals. The primary goal of a database is data storage, while a data warehouse is designed for reading and analyzing that data. A data mart is a subset of the data warehouse and geared toward serving a specific business unit. While data warehouses and data marts are home to preprocessed data, data lakes contains data in its natural or raw format. Open source data

3. Data Preparation Cleansing

Why the errors should be corrected asap? Not everyone spots the data anomalies. Decision-makers may make costly mistakes on information based on incorrect data from applications that fail to correct for the faulty data. If errors are not corrected early on in the process, the cleansing will have to be done for every project that uses that data. Data errors may point to defective equipment, such as broken transmission lines and defective sensors. Data errors can point to bugs in software or in the integration of software that may be critical to the company. While doing a small project at a bank we discovered that two software applications used different local settings. This caused problems with numbers greater than 1,000. For one app the number 1.000 meant one, and for the other it meant one thousand.

Combining Data Joining Tables Appending Tables Creating views

Data Transformation

4. EDA

5. Build the Model Building a model is an iterative process. The way you build your model depends on whether you go with classic statistics or the somewhat more recent machine learning school, and the type of technique you want to use. Either way, most models consist of the following main steps: 1 Selection of a modeling technique and variables to enter in the model 2 Execution of the model 3 Diagnosis and model comparison

6 . Presentation and automation— Presenting your results to the stakeholders and industrializing your analysis process for repetitive reuse and integration with other tools.

Working with data from files Working with different Data Types , Different Formats , Different Compression , Different Parsing on Different Systems are very challenging task to prepare data. Dealing with different formats can become a Tedious Task . Thus, it is mandatory for any Data Scientist To Be Aware Of Different File Formats , common challenges in handling them and the best / efficient ways to handle this data in real life.

What is a file format?

Why should a data scientist understand different file formats? The files will depend on the application you are building. For example , in an image processing system, you need image files as input and output . Therefore , we will mostly see files in jpeg, gif or png format. As a data scientist, we need to understand the underlying structure of various file formats, their advantages and dis-advantages. Choosing the optimal file format for storing data can improve the performance of your models in data processing.

Why should a data scientist understand different file formats? Different File Formats . XLSX Comma-separated values (CSV) ZIP Plain Text (txt) JSON XML HTML Images Hierarchical Data Format PDF DOCX MP3 MP4

Different file formats and how to read them in Python Comma-separated values ( CSV ): Comma-Separated Values (CSV) file format falls under spreadsheet file format. In spreadsheet file format, data is stored in cells . Each cell has organized in rows and columns. A column in the spreadsheet file can have different types . For example, a column can be of string type , a date type or an integer type . Some of the most popular spreadsheet file formats are Comma Separated Values (CSV), Microsoft Excel Spreadsheet ( xls ) and Microsoft Excel Open XML Spreadsheet ( xlsx ). Some files are separated using tab . This file format is known as TSV ( Tab Separated Values ) file format.

Different file formats and how to read them in Python The below image shows a CSV file which is opened in Notepad .

Reading the data from CSV in Python For loading the data, you can use the “pandas” library in python . import pandas as pd pd.read_csv ( r'F :\IT DEPT \WINTER 2022\10212IT105 - DATA SCIENCE IN PYTHON/addresses.csv')

Different file formats and how to read them in Python Read Excel file: XLSX is a Microsoft Excel Open XML file format. It also comes under the Spreadsheet file format. It is an XML-based file format created by Microsoft Excel. In XLSX data is organized under the cells and columns in a sheet . Each XLSX file may contain one or more sheets . Therefore, a workbook can contain multiple sheets.

Different file formats and how to read them in Python Excel F ile:

Different file formats and how to read them in Python Read Excel file: import pandas as pd pd.read_excel ( r'C :\ Users\NITHI\Desktop\Mentees List.xlsx ')

Different file formats and how to read them in Python Read Excel file:

Different file formats and how to read them in Python Read some particular columns: import pandas as pd pd.read_excel ( r'C :\ Users\NITHI\Desktop\Mentees List. xlsx ', index_col =0 , usecols ="A:C")

Different file formats and how to read them in Python Read some particular columns:

Different file formats and how to read them in Python Read some particular columns: import pandas as pd pd.read_excel ( r'C :\ Users\NITHI\Desktop\Mentees List. xlsx ', index_col =0 , usecols =[3,5,6)

Different file formats and how to read them in Python Read some particular columns:

Different file formats and how to read them in Python Read a particular Sheet: import pandas as pd pd.read_excel ( r'C :\ Users\NITHI\Desktop\Mentees List. xlsx ', index_col =0 , sheet_name =0)

Different file formats and how to read them in Python Read a particular Sheet:

Different file formats and how to read them in Python Read a particular Sheet: import pandas as pd pd.read_excel ( r'C :\ Users\NITHI\Desktop\Mentees List. xlsx ', index_col =0 , sheet_name =“Second Year”)

Different file formats and how to read them in Python Read a particular Sheet:

Different file formats and how to read them in Python Read Microsoft Word file: XLSX is a Microsoft Word Open file format with extension . docx

Different file formats and how to read them in Python Read Microsoft Word file: pip install python- docx

Different file formats and how to read them in Python Read Microsoft Word file: from doc import Document document = Document( r'F :\IT DEPT \WINTER 2022\10212IT105 - DATA SCIENCE IN PYTHON\test.docx') type(document)

Different file formats and how to read them in Python Read Microsoft Word file: document.paragraphs

Different file formats and how to read them in Python Read Microsoft Word file: type(document.paragraphs)

Different file formats and how to read them in Python Read Microsoft Word file: document.paragraphs[1] document.paragraphs[0]

Different file formats and how to read them in Python Read Microsoft Word file: document.paragraphs[0]. text document.paragraphs[1]. text

Different file formats and how to read them in Python Read Microsoft Word file: document.paragraphs[2]. text

Exploratory Data Analysis

Exploratory Data Analysis A method used to analyze and summarize data sets . Data scientists to analyze and investigate data sets and summarize their main characteristics use exploratory Data Analysis (EDA), often employing data visualization methods. It helps the data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions . EDA is primarily used to provide a better understanding of data set variables and the relationships between them . It can helps to determine if the statistical techniques you are considering for data analysis are appropriate .

Exploratory Data Analysis Why is exploratory data analysis important in data science? Identify obvious errors , understand patterns , detect outliers or anomalous events , interesting relations among the variables . T o ensure the results that the data scientist produce are valid and applicable to any desired business outcomes and goals. EDA helps stakeholders by con f irming they are asking the right questions . EDA can help to answer questions about standard deviations , categorical variables , and confidence intervals . Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including machine learning.

Exploratory Data Analysis Exploratory Data Analysis Tools: Anyone spends a lot of time doing EDA to get a better understanding of data. EDA can be minimized by using auto visualizations tools such as – 1. Pandas-profiling , 2. Sweetviz , 3. Autoviz 4. D-Tale

Exploratory Data Analysis Exploratory Data Analysis Tools: EDA involves a lot of steps including some statistical tests, visualization of data using different kinds of plots Data Quality Check: Can be done using pandas library functions like describe(), info(), dtypes (), etc. It is used to find several features like its datatypes, duplicate values, missing valu e , etc. Statistical Test: Some statistical tests like Pearson correlation, Spearman correlation, Kendall test , etc are done to get a correlation between the features. It can be implemented in python using the “ stats ” library.

Exploratory Data Analysis Exploratory Data Analysis Tools: Quantitative Test: F ind the spread of numerical features , count of categorical features . It can be implemented in python using the functions of the “ pandas ” library. Visualization : T o get an understanding of the data. Graphical tec hniques like bar plots, pie charts are used to get an understanding of categorical features, whereas scatter plots, histograms are used for numerical features.

Exploratory Data Analysis Tools Pandas-Profiling : Pandas profiling is an open-source python library that automates the EDA process and creates a detailed report. Pandas Profiling can be used easily for large datasets as it is blazingly fast and creates reports in a few second s . Installation: pip install pandas-profiling

Exploratory Data Analysis Tools Pandas-Profiling : #Install the below libraries before importing import pandas as pd from pandas_profiling import ProfileReport # EDA using pandas-profiling profile = ProfileReport ( pd.read_excel ('Mentees List.xlsx'), explorative=True) # Saving results to a HTML file profile.to_file ("output.html")

Exploratory Data Analysis Tools Pandas-Profiling Report : The pandas-profiling library generates a report having: An overview of the dataset Variable properties Interaction of variables Correlation of variables Missing values Sample data

Exploratory Data Analysis Tools Pandas-Profiling : Report: file:///C:/ Users/NITHI/output.html

Exploratory Data Analysis Tools Sweetviz : Sweetviz is an open-source python auto-visualization library that generates a report, exploring the data with the help of high-density plots. It not only automates the EDA but is also used for comparing datasets and drawing inferences from it . A comparison of two datasets can be done by treating one as training and the other as testing . Installation : pip install sweetviz

Exploratory Data Analysis Tools Sweetviz : Sweetviz is an open-source python auto-visualization library that generates a report, exploring the data with the help of high-density plots. It not only automates the EDA but is also used for comparing datasets and drawing inferences from it . A comparison of two datasets can be done by treating one as training and the other as testing . Installation : pip install sweetviz

Exploratory Data Analysis Tools Sweetviz : #Install the below libraries before importing import pandas as pd import sweetviz as sv #EDA using Sweetviz sweet_report = sv.analyze ( pd.read_excel ( r'C :\Users\NITHI\Desktop\Mentees List.xlsx' )) # Saving results to HTML file sweet_report.show_html (' sweet_report.html ')

Exploratory Data Analysis Tools Sweetviz Report : The Sweetviz library generates a report having: An overview of the dataset Variable properties Categorical associations Numerical associations Most frequent, smallest, largest values for numerical features

Exploratory Data Analysis Tools Sweetviz Report : file:///C:/Users/NITHI/sweet_report.html

Exploratory Data Analysis Tools Autoviz : Autoviz is an open-source python auto visualization library that mainly focuses on visualizing the relationship of the data by generating different types of plot . Installation : pip install autoviz

Exploratory Data Analysis Tools Autoviz : #Install the below libraries before importing import pandas as pd from autoviz.AutoViz_Class import AutoViz_Class # EDA using Autoviz autoviz = AutoViz_Class (). AutoViz ( r' C :\Users\NITHI\Desktop\Mentees List.xlsx ')

Exploratory Data Analysis Tools Autoviz Report: The Autoviz library generates a report having: An overview of the dataset Pairwise scatter plot of continuous variables Distribution of categorical variables Heatmaps of continuous variables Average numerical variable by each categorical variable

Exploratory Data Analysis Tools Autoviz Report:

Exploratory Data Analysis Tools D-Tale : D-Tale is an open-source python auto-visualization library . It is one of the best auto data-visualization libraries. D-Tale helps you to get a detailed EDA of the data . It also has a feature of code export for every plot or analysis in the report . Installation : pip install dtale

Exploratory Data Analysis Tools D-Tale : D-Tale is an open-source python auto-visualization library . It is one of the best auto data-visualization libraries. D-Tale helps you to get a detailed EDA of the data . It also has a feature of code export for every plot or analysis in the report . Installation : pip install dtale

Exploratory Data Analysis Tools D-Tale : import dtale import pandas as pd dtale.show ( pd.read_excel ( r' C :\Users\NITHI\Desktop\Mentees List.xlsx '))

Exploratory Data Analysis Tools D-Tale Report: The dtale library generates a report having: An overview of the dataset Custom filters Correlation, Charts, and Heatmaps Highlight datatypes, missing values, ranges Code export

Exploratory Data Analysis Tools D-Tale Report:

Data Management

Data Management What is Data Management? Data management is the practice of collecting, organizing, protecting, and storing an organization’s data so it can be analyzed for business decisions. As organizations create and consume data at unprecedented rates , data management solutions become essential for making sense of the vast quantities of data . Today’s leading data management software ensures that reliable, up-to-date data is always used to drive decisions.

Data Management Types of Data Management Data management plays several roles in an organization’s data environment, making essential functions easier and less time-intensive. Data preparation is used to clean and transform raw data into the right shape and format for analysis, including making corrections and combining data sets. Data Pipelines enable the automated transfer of data from one system to another. ETLs ( Extract, Transform, Load ) are built to take the data from one system, transform it, and load it into the organization’s data warehouse .

Data Management Types of Data Management (cont...) Data Catalogs - help manage metadata to create a complete picture of the data, providing a summary of its changes, locations, and quality while also making the data easy to find. Data Warehouses are places to consolidate various data sources, contend with the many data types businesses store, and provide a clear route for data analysis. Data Governance defines standards, processes, and policies to maintain data security and integrity .

Data Management Types of Data Management (cont...) Data Architecture provides a formal approach for creating and managing data flow. Data Security protects data from unauthorized access and corruption. Data Modeling documents the flow of data through an application or organization.

Data Management Why data management is important? Data management is a crucial first step that leads to add value to our customers and improve our business bottom line . The effective data management, people across an organization can find and access trusted data for their queries.  Some benefits of an effective data management solution includes: Visibility Reliability Security Scalability

Data Management Important of Data Management Visibility – Increase the visibility of your organization’s data assets. E asier for people to quickly and confidently find the right data for their analysis. Reliability – By establishing processes and policies to build the trust in the data being used to make decisions across your organization. Security – Protects your organization and its employees from data losses, thefts, and breaches with authentication and encryption tools. Scalability – Allows organizations to effectively scale data and usage occasions with repeatable processes to keep data and metadata up to date.

Data Management Data Management Challenges: Traditional Data Management processes make it difficult to scale capabilities without compromising governance or security. Modern Data Management software must address several challenges to ensure trusted data can be found . Challenge 1: Increased Data Volumes - Organization to become unaware of what data it has, where the data is, and how to use it. Challenge 2: New Roles for Analytics - Understanding naming conventions, complex data structures, and databases can be a challenge. Challenge 3: Compliance Requirements - Constantly changing compliance requirements make it a challenge to ensure people are using the right data.

Data Management Establish Best Data Management: An effective data management strategy . Clearly Identify Your Business Goals Focus on t he Quality of Data Allow the Right People to Access the Data Prioritize Data Security

Data Cleaning Data Cleaning – Process of identifying the incorrect, incomplete, inaccurate, irrelevant or missing part of the data. M odifying , Replacing or Deleting them according to the necessity. Data Cleaning is considered a foundational element of the basic data science.

Data Cleaning Data Cleaning – Data is the most valuable thing for Analytics and Machine learning . In computing or Business, data is needed everywhere. When it comes to the real world data, it is not improbable that data may contain incomplete, inconsistent or missing values . If the data is corrupted then it may hinder the process or provide inaccurate results .

Data Cleaning Data Cleaning – Data is the most valuable thing for Analytics and Machine learning . In computing or Business, data is needed everywhere. When it comes to the real world data, it is not improbable that data may contain incomplete, inconsistent or missing values . If the data is corrupted then it may hinder the process or provide inaccurate results .

Data Cleaning