DS LAB MANUAL.pdf

687 views 84 slides Oct 31, 2022
Slide 1
Slide 1 of 84
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84

About This Presentation

Anna University R2021-Data Science Laboratory Manual


Slide Content

REGULATION – 2021
CS3361 – DATA SCIENCE LABORATORY
LAB MANUAL

YEAR / SEMESTER: II / III














Prepared by
P.SANTHIYA
Assistant Professor
Department of Computer Science and Engineering

CS3362 DATA SCIENCE LABORATORY L T P C
0 0 4 2
COURSE OBJECTIVES:
To understand the python libraries for data science.
To understand the basic Statistical and Probability
measures for data science. To learn descriptive analytics
on the benchmark data sets.
To apply correlation and regression analytics on
standard data sets. To present and interpret data
using visualization packages in Python.
LIST OF EXPERIMENTS:
1. Download, install and explore the features of NumPy, SciPy, Jupyter,
Statsmodels and Pandaspackages.
2. Working with Numpy arrays
3. Working with Pandas data frames
4. Reading data from text files, Excel and the web and exploring
various commands for doingdescriptive analytics on the Iris data set.
5. Use the diabetes data set from UCI and Pima Indians Diabetes data
set for performing thefollowing:
a. Univariate analysis: Frequency, Mean, Median, Mode,
Variance, Standard Deviation,Skewness and Kurtosis.
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets.
6. Apply and explore various plotting functions on UCI data sets.
a. Normal curves
b. Density and contour plots
c. Correlation and scatter plots
d. Histograms
e. Three dimensional plotting
7. Visualizing Geographic Data with Basemap

LIST OF EQUIPMENTS :(30 Students per Batch)
Tools: Python, Numpy, Scipy, Matplotlib, Pandas, statmodels, seaborn,
plotly, bokeh

Note: Example data sets like: UCI, Iris, Pima Indians Diabetes etc.

TOTAL: 60 PERIODS

COURSE OUTCOMES:
At the end of this course, the students will be able to:
 Make use of the python libraries for data science.
 Make use of the basic Statistical and Probability measures for data science.
 Perform descriptive analytics on the benchmark data sets.
 Perform correlation and regression analytics on standard data sets.
 Present and interpret data using visualization packages in Python.

1

Ex.No 1

Download, install and explore the features of NumPy, SciPy,
Jupyter, Statsmodels andpackages
Date:


AIM:
Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and
packages.
Downloading and Installing Anaconda on Linux:
1. Getting Started:

2. Getting through the License Agreement:

2

3. Choose Installation Location:



4. Extracting Files and packages:

3

5. Initializing Anaconda Installation:


6. Finishing up the Installation:

4

7. Working with Anaconda:
>> anaconda-navigator

5

a) Installing Jupyter Notebook using Anaconda:
To install Jupyter using Anaconda, just go through the following instructions:
1. Launch Anaconda Navigator:

2. Click on the Install Jupyter Notebook Button:

6

3. Beginning the Installation:


4. Loading Packages:

7

5. Finished Installation:

6. Launching Jupyter:

8



b) Installing Jupyter Notebook using pip:
To install Jupyter using pip, the following command to update pip:
>> python3 -m pip install --upgrade pip

After updating the pip version, follow the instructions provided below to install Jupyter:

Command to install Jupyter:

>>pip3 install Jupyter

1. Beginning Installation:

9

2. Collecting Files and Data:


3. Downloading Packages:

10

4. Running Installation:


5. Finished Installation:

11

6. Launching Jupyter:
Use the following command to launch Jupyter using command-line:

>>jupyter notebook

12

Explore the following features of python packages:
1. NumPy:
NumPy stands for Numerical Python.NumPy (Numerical Python) is an open-source library for
the Python programming language. It is used for scientific computing and working with arrays. The
source code for NumPy is located at this github repository https://github.com/numpy/numpy.
Features:

1. High-performance N-dimensional array object.
2. It contains tools for integrating code from C/C++ and Fortran.
3. It contains a multidimensional container for generic data.
4. Additional linear algebra, Fourier transform, and random number capabilities.
5. It consists of broadcasting functions.
6. It had data type definition capability to work with varied databases.


2. SciPy:
SciPy stands for Scientific Python. SciPy is a scientific computation library that uses NumPy
underneath. The source code for SciPy is located at this github repository
https://github.com/scipy/scipy
Features:

1. SciPy provides algorithms for optimization, integration, interpolation, eigenvalue problems,
algebraic equations, differential equations, statistics and many other classes of problems.
2. It provides more utility functions for optimization, stats and signal processing.

Numpy vs. SciPy
Numpy and SciPy both are used for mathematical and numerical analysis. Numpy is suitable for
basic operations such as sorting, indexing and many more because it contains array data, whereas SciPy
consists of all the numeric data.
Numpy contains many functions that are used to resolve the linear algebra, Fourier transforms,
etc. whereas SciPy library contains full featured version of the linear algebra module as well many other
numerical algorithms.

13

3. Pandas:
Python Pandas is defined as an open-source library that provides high-performance data
manipulation in Python. The name of Pandas is derived from the word Panel Data, which means an
Econometrics from Multidimensional data. It is used for data analysis in Python. Pandas is built
on top of the Numpy package, means Numpy is required for operating the Pandas.
Features:

1. Group by data for aggregations and transformations.
2. It has a fast and efficient DataFrame object with the default and customized indexing.
3. Used for reshaping and pivoting of the data sets.
4. It is used for data alignment and integration of the missing data.
5. Provide the functionality of Time Series.
6. Process a variety of data sets in different formats like matrix data, tabular heterogeneous, time
series.
7. Handle multiple operations of the data sets such as subsetting, slicing, filtering, groupBy, re-
ordering, and re-shaping.

4. Statsmodels:
statsmodels is a Python module that provides classes and functions for the estimation of many
different statistical models, as well as for conducting statistical tests, and statistical data
exploration. The package is released under the open source Modified BSD (3-clause) license. The
online documentation is hosted at statsmodels.org.
Features:











1. Linear regression models like Ordinary least squares, Generalized least
squares, Weighted least squares, Least squares with autoregressive errors.
2. Bayesian Mixed GLM for Binomial and Poisson
3. GEE: Generalized Estimating Equations for one-way clustered or longitudinal data
4. Nonparametric statistics: Univariate and multivariate kernel density estimators
5. Datasets: Datasets used for examples and in testing
6. Sandbox: statsmodels contains a sandbox folder with code in various
stages of development and testing.
RESULT:
Thus, the NumPy, SciPy, Jupyter, Statsmodels packages have been successfully
download, install and explore their features.

14

Ex.No 2

Working with Numpy arrays

Date:

AIM:
To write a Numpy arrays program to demonstrate basic array concepts in Jupyter Notebook.

PROGRAM:

1. Creating Arrays from Python Lists:
In[1]: import numpy as np

In[2]: # integer array:
np.array([1, 4, 2, 5, 3])
Out[2]: array([1, 4, 2, 5, 3])

#NumPy is constrained to arrays that all contain the same type. If types do not match, NumPy will
upcast if possible (here, integers are upcast to floating point):
In[3]: np.array([3.14, 4, 2, 3])
Out[3]: array([ 3.14, 4. , 2. , 3. ])


#If we want to explicitly set the data type of the resulting array, we can use the dtype keyword:
In[4]: np.array([1, 2, 3, 4], dtype='float32')
Out[4]: array([ 1., 2., 3., 4.], dtype=float32)

In[5]: # nested lists result in multidimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])


Out[5]: array([[2, 3, 4],
[4, 5, 6],
[6, 7, 8]])

2. NumPy Array Attributes:
In[1]: import numpy as np
np.random.seed(0) # seed for reproducibility
x1 = np.random.randint(10, size=6) # One-dimensional array
x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array

15

Each array has attributes ndim (the number of dimensions), shape (the size of each dimension), and
size (the total size of the array):

In[2]: print("x3 ndim: ",
x3.ndim) print("x3 shape:",
x3.shape) print("x3 size: ",
x3.size)
Out[2]:x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60

In[3]: print("dtype:", x3.dtype)# data type of the array
Out[3]:dtype: int64

In[4]: print("itemsize:", x3.itemsize,
"bytes") print("nbytes:", x3.nbytes,
"bytes")
Out[4]:itemsize: 8 bytes
Out[4]:nbytes: 480 bytes

3. Array Indexing: Accessing Single Elements:
In[5]: x1
Out[5]: array([5, 0, 3, 3, 7, 9])
In[6]:
x1[0]
Out[6
]: 5
In[7]:
x1[4]
Out[7]: 7
#To index from the end of the array, you can use negative indices
In[8]:
x1[-1]
Out[8]
: 9
In[9]:
x1[-2]
Out[9]: 7

16

#In a multidimensional array, you access items using a comma-separated tuple of indices
In[10]: x2
Out[10]: array([[3, 5, 2, 4],
[7, 6, 8, 8],
[1, 6, 7, 7]])
In[11]: x2[0, 0]
Out[11]: 3
In[12]: x2[2, 0]
Out[12]: 1
In[13]: x2[2, -1]
Out[13]: 7
#modify values using any of the above index notation
In[14]: x2[0, 0] = 12
x2
Out[14]: array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
In[15]: x1[0] = 3.14159 # this will be truncated!
x1
Out[15]: array([3, 0, 3, 3, 7, 9])
4. Array Slicing: Accessing Subarrays
#One-dimensional subarrays
In[16]: x =
np.arange(10) x
Out[16]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In[17]: x[:5] # first five elements
Out[17]: array([0, 1, 2, 3, 4])
In[18]: x[5:] # elements after index 5
Out[18]: array([5, 6, 7, 8, 9])
In[19]: x[4:7] # middle subarray
Out[19]: array([4, 5, 6])

In[20]: x[::2] # every other element
Out[20]: array([0, 2, 4, 6, 8])
In[21]: x[1::2] # every other element, starting at index 1
Out[21]: array([1, 3, 5, 7, 9])

17

In[22]: x[::-1] # all elements, reversed
Out[22]: array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
In[23]: x[5::-2] # reversed every other from index 5
Out[23]: array([5, 3, 1])
5. Multidimensional subarrays:
In[24]: x2
Out[24]: array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
In[25]: x2[:2, :3] # two rows, three columns
Out[25]: array([[12, 5, 2],
[ 7, 6, 8]])
In[26]: x2[:3, ::2] # all rows, every other column
Out[26]: array([[12, 2],
[ 7, 8],
[ 1, 7]])
#Finally, subarray dimensions can even be reversed together:
In[27]: x2[::-1, ::-1]
Out[27]: array([[ 7, 7, 6, 1],
[ 8, 8, 6, 7],
[ 4, 2, 5, 12]])

6. Accessing array rows and columns:
In[28]: print(x2[:, 0]) # first column of x2
[12 7 1]

In[29]: print(x2[0, :]) # first row of x2
[12 5 2 4]

#In the case of row access, the empty slice can be omitted for a more compact syntax:
In[30]: print(x2[0]) # equivalent to x2[0, :]
[12 5 2 4]
In[31]: print(x2)
[[12 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]

#extract a 2×2 subarray from this:
In[32]: x2_sub = x2[:2, :2]

18

print(x2_sub)
[[12 5]
[ 7 6]]
#modify this subarray

In[33]: x2_sub[0, 0] = 99
print(x2_sub)
[[99 5]
[ 7 6]]
In[34]: print(x2)
[[99 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]

7. Creating copies of arrays:
In[35]: x2_sub_copy = x2[:2,
:2].copy() print(x2_sub_copy)
[[99 5]
[ 7 6]]

#modify this subarray

In[36]: x2_sub_copy[0,
0] = 42
print(x2_sub_copy) [[42
5]
[ 7 6]]
In[37]: print(x2)
[[99 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]

Reshaping of Arrays:

In[38]: grid = np.arange(1,
10).reshape((3, 3)) print(grid)
[[1 2 3]
[4 5 6]
[7 8 9]]

19

In[39]: x = np.array([1, 2, 3])
# row vector via
reshape
x.reshape((1, 3))
Out[39]: array([[1, 2, 3]])
In[40]: # row vector via newaxis
x[np.newaxis, :]
Out[40]: array([[1, 2, 3]])

In[41]: # column vector via reshape
x.reshape((3, 1))
Out[41]:
array([[1], [2],
[3]])
In[42]: # column vector via newaxis
x[:, np.newaxis]
Out[42]: array([[1],
[2],
[3]])

8. Array Concatenation and Splitting:
In[43]: x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])
Out[43]: array([1, 2, 3, 3, 2, 1])
#concatenate more than two arrays at once:
In[44]: z = [99, 99, 99]
print(np.concatenate([x, y, z])) [
1 2 3 3 2 1 99 99 99]
#np.concatenate can also be used for two-dimensional arrays:
In[45]: grid = np.array([[1, 2, 3],
[4, 5, 6]])
In[46]: # concatenate along the first
axis
np.concatenate([grid, grid])
Out[46]: array([[1, 2, 3],
[4, 5, 6],
[1, 2, 3],

20

[4, 5, 6]])
In[47]: # concatenate along the second axis (zero-indexed)
np.concatenate([grid, grid], axis=1)
Out[47]: array([[1, 2, 3, 1,
2, 3],
[4, 5, 6, 4, 5, 6]])

In[48]: x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
[6, 5, 4]])
# vertically stack the arrays
np.vstack([x, grid])
Out[48]: array([[1, 2, 3],
[9, 8, 7],
[6, 5, 4]])
In[49]: # horizontally stack the
arrays y = np.array([[99],[99]])
np.hstack([grid, y])
Out[49]: array([[ 9, 8,
7, 99],
[ 6, 5, 4, 99]])

Splitting of arrays:

In[50]: x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)
[1 2 3] [99 99] [3 2 1]

In[51]: grid =
np.arange(16).reshape((4, 4)) grid
Out[51]: array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In[52]: upper, lower =
np.vsplit(grid, [2]) print(upper)
print(lower)
[[0 1 2 3]

21

[4 5 6 7]]

[[ 8 9 10 11]
[12 13 14 15]]
In[53]: left, right = np.hsplit(grid, [2])
print(left)
print(right)
Out[53]: [[ 0 1]
[ 4 5]
[ 8 9]
[12 13]]
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]

9. Exploring NumPy’s UFuncs:
In[54]: x = np.arange(4)
print("x =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2) # floor division
Out[54]: x = [0 1 2 3]
x + 5 = [5 6 7 8]
x - 5 = [-5 -4 -3 -2]
x * 2 = [0 2 4 6]
x / 2 = [ 0. 0.5 1. 1.5]
x // 2 = [0 0 1 1]

In[8]: print("-x = ", -x)
print("x ** 2 = ", x ** 2)
print("x % 2 = ", x % 2)
Out[8]:-x = [ 0 -1
-2 -3]
x ** 2 = [0 1 4 9]

22

x % 2 = [0 1 0 1]

In[9]: -(0.5*x + 1) ** 2
Out[9]: array([-1. , -2.25, -4. , -6.25])

In[10]: np.add(x, 2)
Out[10]: array([2, 3, 4, 5])
10. Absolute value:
In[11]: x = np.array([-2, -1, 0, 1, 2])
abs(x)
Out[11]: array([2, 1, 0, 1, 2])

In[12]: np.absolute(x)
Out[12]: array([2, 1, 0, 1, 2])
In[13]: np.abs(x)
Out[13]: array([2, 1, 0, 1, 2])

In[14]: x = np.array([3 - 4j, 4 - 3j, 2 + 0j, 0
+ 1j]) np.abs(x)
Out[14]: array([ 5., 5., 2., 1.])

11. Trigonometric functions:
In[15]: theta = np.linspace(0, np.pi, 3)

In[16]: print("theta = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))
Out[16]:theta = [ 0. 1.57079633 3.14159265]
sin(theta) = [ 0.00000000e+00 1.00000000e+00 1.22464680e-16]
cos(theta) = [ 1.00000000e+00 6.12323400e-17 -1.00000000e+00]
tan(theta) = [ 0.00000000e+00 1.63312394e+16 -1.22464680e-16]
In[17]: x = [-1, 0, 1]
print("x = ", x)
print("arcsin(x) = ", np.arcsin(x))
print("arccos(x) = ", np.arccos(x))
print("arctan(x) = ", np.arctan(x))
Out[17]:x = [-1, 0, 1]
arcsin(x) = [-1.57079633 0. 1.57079633]

23

arccos(x) = [ 3.14159265 1.57079633 0. ]
arctan(x) = [-0.78539816 0. 0.78539816]

12. Exponents and logarithms:
In[18]: x = [1, 2, 3]
print("x =", x)
print("e^x =", np.exp(x))
print("2^x =", np.exp2(x))
print("3^x =", np.power(3, x))
Out[18]:x = [1, 2, 3]
e^x = [ 2.71828183 7.3890561 20.08553692]
2^x = [ 2. 4. 8.]
3^x = [ 3 9 27]

In[19]: x = [1, 2, 4, 10]
print("x =", x)
print("ln(x) =", np.log(x))
print("log2(x) =", np.log2(x))
print("log10(x) =", np.log10(x))
Out[19]:x = [1, 2, 4, 10]
ln(x) = [ 0. 0.69314718 1.38629436 2.30258509]
log2(x) = [ 0. 1. 2. 3.32192809]
log10(x) = [ 0. 0.30103 0.60205999 1. ]

In[20]: x = [0, 0.001, 0.01, 0.1]
print("exp(x) - 1 =", np.expm1(x))
print("log(1 + x) =", np.log1p(x))
Out[20]:exp(x) - 1 = [ 0. 0.0010005 0.01005017 0.10517092]
log(1 + x) = [ 0. 0.0009995 0.00995033 0.09531018]

13. Specialized ufuncs:
In[21]: from scipy import special
In[22]: # Gamma functions (generalized factorials) and related functions
x = [1, 5, 10]
print("gamma(x) =", special.gamma(x))
print("ln|gamma(x)| =", special.gammaln(x))
print("beta(x, 2) =", special.beta(x, 2))
Out[22]:gamma(x) = [ 1.00000000e+00 2.40000000e+01
3.62880000e+05] ln|gamma(x)| = [ 0. 3.17805383 12.80182748]

24

beta(x, 2) = [ 0.5 0.03333333 0.00909091]
In[23]: # Error function (integral of Gaussian) its complement, and its inverse
x = np.array([0, 0.3, 0.7, 1.0])
print("erf(x) =", special.erf(x))
print("erfc(x) =", special.erfc(x))
print("erfinv(x) =", special.erfinv(x))
Out[23]:erf(x) = [ 0. 0.32862676 0.67780119 0.84270079]
erfc(x) = [ 1. 0.67137324 0.32219881 0.15729921]
erfinv(x) = [ 0. 0.27246271 0.73286908 inf]

14. Aggregates:
In[26]: x = np.arange(1, 6)
np.add.reduce(x)
In[27]: np.multiply.reduce(x)
Out[27]: 120

In[28]: np.add.accumulate(x)
Out[28]: array([ 1, 3, 6, 10, 15])

In[29]: np.multiply.accumulate(x)
Out[29]: array([ 1, 2, 6, 24, 120])

15. Outer products:
In[30]: x = np.arange(1, 6)
np.multiply.outer(x, x)
Out[30]: array([[ 1, 2, 3, 4, 5],
[ 2, 4, 6, 8, 10],
[ 3, 6, 9, 12, 15],
[ 4, 8, 12, 16, 20],
[ 5, 10, 15, 20, 25]])







RESULT:
Thus, the Numpy array program was successfully executed and verified.

25

Ex.No 3

Working of Pandas DataFrame

Date:


AIM:
To write a Pandas program using dictionary sample dataframe to perform opertations in its DataFrame.

Sample DataFrame:
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael',
'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

PROGRAM:
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',
'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print(df)
print("Summary of the basic information about this DataFrame and its data:")
print(df.info())

26


Sample Output:

attempts name qualify
score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
Summary of the basic information about this DataFrame and its data:
<class 'pandas.core.frame.DataFrame'> Index:
10 entries, a to j
Data columns (total 4 columns):
# Column Non-Null Count Dtype

0 name 10 non-null object
1 score 8 non-null float64
2 attempts 10 non-null int64
3 qualify 10 non-null object
dtypes: float64(1), int64(1), object(2) memory
usage: 400.0+ bytes
None

i. To get the first 3 rows of a given DataFrame.
print("First three rows of the data frame:")
print(df.iloc[:3])
Sample Output:

First three rows of the data frame:
attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5


ii. To select the 'name' and 'score' columns from the following DataFrame.
print("Select specific columns:")
print(df[['name', 'score']])

27

Sample Output:

Select specific columns:
name score
a Anastasia 12.5
b Dima 9.0
c Katherine 16.5
d James NaN
e Emily 9.0
f Michael 20.0
g Matthew 14.5
h Laura NaN
i Kevin 8.0
j Jonas 19.0

iii. To select the specified columns and rows from a given DataFrame. Select 'name' and 'score'
columns in rows 1, 3, 5, 6 from the following data frame.
print("Select specific columns and rows:")
print(df.iloc[[1, 3, 5, 6], [1, 3]])
Select specific columns and rows:
score qualify
b 9.0 no
d NaN no
f 20.0 yes
g 14.5 yes

iv. To select the rows where the number of attempts in the examination is greater than 2.
print("Number of attempts in the examination is greater than 2:")
print(df[df['attempts'] > 2])

Sample Output:

Number of attempts in the examination is greater than 2:
name score attempts qualify
b Dima 9.0 3 no
d James NaN 3 no
f Michael 20.0 3 yes



v. To select the rows where the score is missing, i.e. is NaN.
print("Rows where score is missing:")
print(df[df['score'].isnull()])

28

Sample Output:

Rows where score is missing: attempts
name qualify score
d 3 James no NaN
h 1 Laura no NaN
vi. To change the score in row 'd' to 11.5.
print("\nOriginal data frame:") print(df)
print("\nChange the score in row 'd' to 11.5:") df.loc['d',
'score'] = 11.5
print(df)

Sample Output:

Original data frame:
attempts name qualify score

a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
Change the score in row 'd' to 11.5:
Attempts name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no 11.5
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0

29



vii. To calculate the sum of the examination score by the students.
print("\nSum of the examination attempts by the students:")
print(df['score'].sum())
Sample Output:

Sum of the examination attempts by the students:
108.5

viii. To append a new row 'k' to DataFrame with given values for each column. Now delete the new
row and return the original data frame.

print("Original rows:") print(df)
print("\nAppend a new row:") df.loc['k'] =
[1, 'Suresh', 'yes', 15.5]
print("Print all records after insert a new record:") print(df)
print("\nDelete the new row and display the original rows:") df =
df.drop('k')
print(df)

Sample Output:

Original rows:
attempts name qualify score

a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0


Append a new row:
Print all records after insert a new record: attempts
name qualify score

30

a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0
k 1 Suresh yes 15.5


Delete the new row and display the original rows: attempts
name qualify score
a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0


ix. To delete the 'attempts' column from the DataFrame.
print("Original rows:") print(df)
print("\nDelete the 'attempts' column from the data frame:")
df.pop('attempts')
print(df)

31

Sample Output:

Original rows:
attempts name qualify score

a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0


Delete the 'attempts' column from the data frame:
name qualify score
a Anastasia yes 12.5
b Dima no 9.0
c Katherine yes 16.5
\d James no NaN
e Emily no 9.0
f Michael yes 20.0
g Matthew yes 14.5
h Laura no NaN
i Kevin no 8.0
j Jonas yes 19.0


x. To insert a new column in existing DataFrame.
print("Original rows:") print(df)
color = ['Red','Blue','Orange','Red','White','White','Blue','Green','Green','Red'] df['color'] =
color

32

print("\nNew DataFrame after inserting the 'color' column") print(df)
Sample Output

Original rows:
attempts name qualify score

a 1 Anastasia yes 12.5
b 3 Dima no 9.0
c 2 Katherine yes 16.5
d 3 James no NaN
e 2 Emily no 9.0
f 3 Michael yes 20.0
g 1 Matthew yes 14.5
h 1 Laura no NaN
i 2 Kevin no 8.0
j 1 Jonas yes 19.0


New DataFrame after inserting the 'color' column
Attempts name qualify score color
a 1 Anastasia yes 12.5 Red
b 3 Dima no 9.0 Blue
c 2 Katherine yes 16.5 Orange
d 3 James no NaN Red
e 2 Emily no 9.0 White
f 3 Michael yes 20.0 White
g 1 Matthew yes 14.5 Blue
h 1 Laura no NaN Green
i 2 Kevin no 8.0 Green
j 1 Jonas yes 19.0 Red





RESULT:
Thus, the working of Pandas Dataframe using Dictionary was executed and verified successfully.

33

Ex.No:4

Descriptive analytics on the Iris data set

Date:


AIM:
To Reading data from text files, Excel and the web and exploring various commands for doing descriptive
analytics on the Iris data set.
PROCEDURE:
Download the Iris.csv file from the https://www.kaggle.com/datasets/uciml/iris and use the Pandas library to
load this CSV file, and convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:

import pandas as pd
df = pd.read_csv("Music/Iris.csv")# Reading the CSV file
print(df)
print(df.d
types)
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \

0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8


Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa

34

.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
[150 rows x 6 columns]
Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object dtype: object

# Printing top 5 rows
print(df.head())
#To shape parameter to get the shape of the dataset.

print(df.shape)

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa

(150, 6)

#To know the columns and their data types use the info() method.

df.info()

<class
'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to
149

35

Data columns (total 6 columns):

# Column Non-Null Count Dtype



0 Id 150 non-null int64

1 SepalLengthCm 150 non-null float64

2 SepalWidthCm 150 non-null float64

3 PetalLengthCm 150 non-null float64

4 PetalWidthCm 150 non-null float64

Species
150 non-null object
dtypes: float64(4),
int64(1),
object(1)
memory usage: 7.2+ KB

print(df.describe())
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

count 150.000000 150.000000 150.000000 150.000000 150.000000

m
e
a
n

75.500000

5.843333

3.054000

3.758667

1.198667
s
t
d
43.445368 0.828066 0.433594 1.764420 0.763161
m
i
n
1.000000 4.300000 2.000000 1.000000 0.100000
2
5
%
38.250000 5.100000 2.800000 1.600000 0.300000
5
0
%
75.500000 5.800000 3.000000 4.350000 1.300000
7
5
%
112.750000 6.400000 3.300000 5.100000 1.800000
m
a
x
150.0000
00
7.900000 4.40000
0
6.90000
0
2.500000

36


# Missing values can occur when no information is provided
print(df.isnull().sum())


Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0

PetalWidthCm 0
Species 0
dtype: int64


# To check dataset contains any duplicates
or not data = df.drop_duplicates(subset
="Species",) print(data)
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \

0 1 5.1 3.5 1.4 0.2
5
0
5
1
7.0 3.
2
4.7 1.4
1
0
0
1
0
1
6.3 3.3 6.0 2.5


Species
0 Iris-
setosa
50 Iris-versicolor
100 Iris-virginica
#To find unique species from the given dataset
print(df.value_counts("Species"))


Species
Iris-setosa50
Iris-versicolor 50
Iris-virginica 50
dtype: int64

37

#matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
#seaborn
import seaborn as sns
#plot the variable ‘sepal.width’
plt.scatter(df.index,df['SepalWidthCm']) plt.show()
#visualize the same plot by considering its variety using the sns.scatterplot() function of the
seaborn library.
sns.scatterplot(x=df.index,y=df['SepalWidthCm'],hue=df['Species'])

38


#visualizes data by connecting the data points via line segments.
plt.figure(figsize=(6,6))
plt.title("line plot for petal length")
plt.xlabel('index',fontsize=20)
plt.ylabel('PetalLengthCm',fontsize=20)
plt.plot(df.index,df['PetalLengthCm'],markevery=1,marker='d') for name,group in
df.groupby('Species'):
plt.plot(group.index,group['PetalLengthCm'],label=name,markevery=1,marker='d')
plt.legend()
plt.show()


#Plotting histogram using the matplotlib plt.hist() function :

plt.hist(df["PetalWidthCm"])

39



sns.distplot(df["PetalWidthCm"],kde=False,color='RED',bins=10)
<AxesSubplot:xlabel='PetalWidthCm'>




























RESULT:
Thus, the descriptive analysis on the iris data set was successfully executed and practically
verified.

40

Ex.No:5.a

Univariate analysis using the UCI diabetes data set

Date:


AIM:
To Reading data from c s v files exploring various commands for doing Univariate analysis using the
UCI diabetes data set.
PROCEDURE:
Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/datasets/uciml/pima-
indians-diabetes-database and use the Pandas library to load this CSV file, and convert it into the dataframe.
read_csv() method is used to read CSV files.
PROGRAM:

import pandas as pd
df = pd.read_csv("E:\DATA
SCIENCE\Pima_indian_diabetes\diabetes.CSV") print(df)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \

0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1
.. ... ...
.
.
.

.
.
.
..
.
...
763 10 101 76 48 180 32.9
764 2 122 70 27 0 36.8
765 5 121 72 23 112 26.2
766 1 126 60 0 0 30.1
767 1 93 70 31 0 30.4

DiabetesPedigreeFunction Age Outcome

0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
.. ... ... ...

763 0.171 63 0
764 0.340 27 0

41

765 0.245 30 0
766 0.349 47 1
767 0.315 23 0


[768 rows x 9 columns]
# To know data type print(df.dtypes)
Pregnancies int64
Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction
float64 Age int64
Outcome int64
dtype: object
#To print fiest 5 rows
print(df.head())
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \

0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1


DiabetesPedigreeFunction Age Outcome

0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
#To shape parameter to get the shape of the dataset.
print(df.shape) (768, 9)

42

#calculate mean
print("Mean of Preganancies: %f"
%df['Pregnancies'].mean()) print("Mean of BloodPressure:
%f" %df['BloodPressure'].mean()) print("Mean of Glucose:
%f" %df['Glucose'].mean()) print("Mean of Age: %f"
%df['Age'].mean())
Sample Output:
Mean of Preganancies:
3.845052 Mean of
BloodPressure: 69.105469
Mean of Glucose:
120.894531 Mean of Age:
33.240885 #calculate
median
print("median of Preganancies: %f"
%df['Pregnancies'].median()) print("median of BloodPressure:
%f" %df['BloodPressure'].median()) print("medianf Glucose:
%f" %df['Glucose'].median())
print("median of Age: %f" %df['Age'].median())
Sample Output:
median of Preganancies:
3.000000 median of
BloodPressure: 72.000000
medianf Glucose: 117.000000
median of Age: 29.000000
#calculate standard deviation of 'points'
print("standard deviation for BloodPressure: %f" %
df['BloodPressure'].std()) print("standard deviation for Glucose: %f" %
df['Glucose'].std()) print("standard deviation for Pregnancies: %f" %
df['Pregnancies'].std()) Sample Output:
standard deviation for BloodPressure:
19.355807 standard deviation for Glucose:
31.972618 standard deviation for
Pregnancies: 3.369578 #To describe the
data

43

df.Glucose.describe()
Sample Output:

count 768.000000
mean 120.894531
std 31.972618
min 0.000000
25% 99.000000
50% 117.000000
75% 140.250000
max 199.000000
Name: Glucose, dtype: float64
#create frequency table
df['Pregnancies'].value_counts()
Sample Output:
99 17
100 17
111 14
129 14
125 14
..
191 1
177 1
44 1
62 1
190 1
Name: Glucose, Length: 136, dtype:
int64 #create frequency table
df['Glucose'].value_counts()
Sample Output:
99 17
100 17
111 14
129 14
125 14

44

..
191 1
177 1
44 1
62 1
190 1
Name: Glucose, Length: 136, dtype: int64
#skewness and kurtosis
print("Skewness: %f" %
df['Pregnancies'].skew()) print("Kurtosis:
%f" % df['Pregnancies'].kurt()) Sample
Output:
Skewness: 0.901674
Kurtosis: 0.159220
#find frequency of each letter grade
pd.crosstab(index=df['Outcome'],
columns='count') Sample Output:
col_0 count
Outcome
0 500
1 268

#create frequency table for
'points'
df['Pregnancies'].value_count
s() Sample Output:
1 135
0 111
2 103
3 75
4 68
5 57
6 50
7 45
8 38
9 28

45

10 24
11 11
13 10
12 9
14 2
15 1
17 1
Name: Pregnancies, dtype: int64

#find frequency of each letter grade
pd.crosstab(index=df['Pregnancies'],
columns='count') Sample Output:
col_0 count
Pregnancies
0 111
1 135
2 103
3 75
4 68
5 57
6 50
7 45
8 38
9 28
10 24
11 11
12 9
13 10
14 2
15 1
17 1
import matplotlib.pyplot as plt
df.hist(column='BloodPressure', grid=False,
edgecolor='black')
Sample Output:
array([[<AxesSubplot:title={'center':'BloodPressure'}>]], dtype=object)

46


#to create a density curve
import seaborn as sns
sns.kdeplot(df['BloodPres
sure'])
<AxesSubplot:xlabel='BloodPressure', ylabel='Density'>


#visualize the same plot by considering its variety using the sns.scatterplot() function of the seaborn
library. sns.scatterplot(x=df.index,y=df['Age'],hue=df['Outcome'])

47
















import numpy as np
preg_proportion =
np.array(df['Pregnancies'].value_counts()) preg_month =
np.array(df['Pregnancies'].value_counts().index)
preg_proportion_perc =
np.array(np.round(preg_proportion/sum(preg_proportion),3)*100,dtype=int) preg =
pd.DataFrame({'month':preg_month,'count_of_preg_prop':preg_proportion,'percentage_proportion':preg_pr
oportion_perc})
preg.set_index(['month'],inplace=
True) preg.head(10)
Sample Output:

month count_of_preg_prop percentage_proportion
1 135 17
0 111 14
2 103 13
3 75 9
4 68 8
5 57 7
6 50 6
7 45 5
8 38 4
9 28 3

48

import warnings
warnings.filterwarnings("igno
re")
fig,axes = plt.subplots(nrows=3,ncols=2,dpi=120,figsize = (8,6))

plot00=sns.countplot('Pregnancies',data=df,ax=axes[0][0],color='gre
en') axes[0][0].set_title('Count',fontdict={'fontsize':8})
axes[0][0].set_xlabel('Month of Preg.',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count',fontdict={'fontsize':7})
plt.tight_layout()

plot01=sns.countplot('Pregnancies',data=df,hue='Outcome',ax=axes[0][1])

axes[0][1].set_title('Diab. VS Non-
Diab.',fontdict={'fontsize':8}) axes[0][1].set_xlabel('Month
of Preg.',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count',fontdict={'fontsize':7})
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()
plot10 = sns.distplot(df['Pregnancies'],ax=axes[1][0])
axes[1][0].set_title('Pregnancies
Distribution',fontdict={'fontsize':8})
axes[1][0].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][0].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plt.tight_layout()
plot11 = df[df['Outcome']==False]['Pregnancies'].plot.hist(ax=axes[1][1],label='Non-
Diab.')
plot11_2=df[df['Outcome']==True]['Pregnancies'].plot.hist(ax=axes[1][1],label='Diab.
') axes[1][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[1][1].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][1].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plot11.axes.legend(loc=1)
plt.setp(axes[1][1].get_legend().get_texts(), fontsize='6') # for
legend text plt.setp(axes[1][1].get_legend().get_title(), fontsize='6')
# for legend title plt.tight_layout()

49

plot20 = sns.boxplot(df['Pregnancies'],ax=axes[2][0],orient='v')
axes[2][0].set_title('Pregnancies',fontdict={'fontsize':8})
axes[2][0].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][0].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.tight_layout()
plot21 = sns.boxplot(x='Outcome',y='Pregnancies',data=df,
ax=axes[2][1]) axes[2][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[2][1].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][1].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
plt.tight_layout()
plt.show()

Sample Output:



RESULT:

Thus, the Univariate analysis using the UCI diabetes data set was successfully executed and
practically verified.

50

Ex.No:5.b

Bivariate analysis using the UCI diabetes data set

Date:
AIM:
To Reading data from text files, Excel and the web and exploring various commands for doing Bivariate
analysis using the UCI diabetes data set.
PROCEDURE:
Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/ and use the Pandas
library to load this CSV file, and convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:

Linear regression modelling

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
# Use only one feature
diabetes_X = diabetes_X[:,
np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-50]
diabetes_X_test = diabetes_X[-50:]
# Split the targets into
training/testing sets
diabetes_y_train = diabetes_y[:-50]
diabetes_y_test = diabetes_y[-50:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients print("Coefficients: \n", regr.coef_)

51

Sample output:
Coefficients: [945.4992184]

# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred))
Sample output:
Mean squared error: 3471.92

# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(diabetes_y_test, diabetes_y_pred))
Sample output:
Coefficient of determination: 0.41

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue",
linewidth=3) plt.xticks(())
plt.yticks(()) plt.show()

52

Logistic regression modelling

#Import Sklearn Packages
import numpy as np
import pandas as pd
from sklearn.linear_model
import LogisticRegression from sklearn.model_selection
import train_test_split
#to create plot_bar,histogram,boxplot etc
import seaborn as sns
import matplotlib.pyplot as plt
#calculate accurancy measure and confusion matrix
from sklearn import metrics
import warnings warnings.filterwarnings("ignore")
#Loading Data
diabetes=pd.read_csv("E:\DATA SCIENCE\Pima_indian_diabetes\diabetes.csv")
diabetes
Preg
nanci
es
Glucos e Blood
Pressu re
Skin
Thickne ss
Insuli n DiabetesPedigree
Funct ion
Outcom e
6 14
8
72 35 0 0.627 1
1 85 66 29 0 0.351 0
8 18
3
64 0 0 0.672 1
1 89 66 23 94 0.167 0
0 13
7
40 35 168 2.288 1
.
.
.
... ... ... ... ... ...
1
0
10
1
76 48 180 0.171 0
2 12
2
70 27 0 0.340 0
5 12
1
72 23 112 0.245 0

53

Pregnanci es Glucos e BloodPressu re SkinThickn
e ss
DiabetesPedigreeFunct ion Outcom e
1 126 60 0 0.349 1
1 93 70 31 0.315 0

768 rows × 9 columns

#Train/Test split
X=df.drop("Outcome",axis=1)
Y=df[["Outcome"]]
# target variable
# split data into training and validation datasets
X_train, X_test, y_train, y_t est = train_test_split(X, y, test_size=0.25, random_state=0)
from sklearn.linear_model
import LogisticRegression
# instantiate the model
model = LogisticRegression()
# fitting the model
model.fit(X_train, y_train) y_pred = model.predict(X_test) y_pred[0:5]
# metrics
print("Accuracy for test set is {}.".format(round(metrics.accuracy_score(y_test, y_pred), 4)))
print("Precision for test set is {}.".format(round(metrics.precision_score(y_test, y_pred), 4)))
print("Recall for test set is {}.".format(round(metrics.recall_score(y_test, y_pred), 4)))
Sample Output:
Accuracy for test set is
0.7917. Precision for test
set is 0.7115. Recall for
test set is 0.5968.
print(metrics.classification_report(y_test, y_pred))
Sample Output:
precision recall f1-score support

0 0.82 0.88 0.85 130
1 0.71 0.60 0.65 62
accuracy

0.79 192
macro avg 0.77 0.74 0.75 192
weighted avg 0.79 0.79 0.79 192

54

#Visualization
f,ax = plt.subplots(figsize=(8,6))
sns.heatmap(df.corr(), cmap="GnBu", annot=True, linewidths=0.5, fmt=
'.1f',ax=ax) plt.show()







Result:

Thus, the Bivariate analysis using the UCI diabetes data set was successfully executed and
practically verified.

55

Ex.No:5.c

Multiple Regression analysis using the UCI diabetes data set

Date:
AIM:
To Reading data from Excel and exploring various commands for doing Multiple Regression analysis
using the UCI diabetes data set.
PROCEDURE:
Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/ and use the Pandas
library to load this CSV file, and convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:

#import our Libraries
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt

import statsmodels.api as sm
from statsmodels.stats
import diagnostic as diag
from statsmodels.stats.outliers_influence
import variance_inflation_factor

from sklearn.linear_model
import LinearRegression
from sklearn.model_selection
import train_test_split
from sklearn.metrics
import mean_squared_error, r2_score, mean_absolute_error

import warnings warnings.filterwarnings("ignore")

%matplotlib inline
#Loading Data
diabetes=pd.read_csv("E:\DATA SCIENCE\Pima_indian_diabetes\diabetes.csv")
diabetes

Preg
nanci
es
Glucos e Blood
Pressu re
Skin
Thickne ss
Insuli n DiabetesPedigree
Funct ion
Outcom e
6 148 72 35 0 0.627 1
1 85 66 29 0 0.351 0
8 183 64 0 0 0.672 1
1 89 66 23 94 0.167 0
0 137 40 35 168 2.288 1

56

.
.
.
... ... ... ... ... ...
1
0
101 76 48 180 0.171 0
2 122 70 27 0 0.340 0
5 121 72 23 112 0.245 0
768 rows × 9 columns

# calculate the correlation matrix
corr=diabetes.corr()
# display the correlation matrix
display(corr)


# plot
the

Pre
gna
ncie
s
Gluc ose
BloodPr
e ssure
SkinT
hic
kness
Insuli
n
BMI
Diabetes
Pedigree
Function
Age
Outco me

Pregnan
cies
1.00
000
0
0.129
459

0.1412
82

-
0.0816
72
-
0.073
535
0.017
683

-
0.033523
0.54
4
341
0.221
898
Glucose
0.12
945
9
1.000
000
0.1525
90
0.057
328
0.331
357
0.221
071
0.137337
0.26
3
514
0.466
581
BloodPr
essure
0.14
128
2
0.152
590
1.0000
00
0.207
371
0.088
933
0.281
805
0.041265
0.23
9
528
0.065
068

SkinThi
ckness
-
0.08
167
2
0.057
328

0.2073
71

1.000
000
0.436
783
0.392
573

0.183928
-
0.1
13
970
0.074
752

Insulin
-
0.07
353
5
0.331
357

0.0889
33

0.436
783
1.000
000
0.197
859

0.185071
-
0.0
42
163
0.130
548
BMI
0.01
768
3
0.221
071
0.2818
05
0.392
573
0.197
859
1.000
000
0.140647
0.03
6
242
0.292
695
DiabetesP
edigree
Function
-
0.03
352
3
0.137
337

0.0412
65

0.183
928
0.185
071
0.140
647

1.000000
0.03
3
561
0.173
844

Age
0.54
434
1
0.263
514

0.2395
28

-
0.1139
70
-
0.042
163
0.036
242

0.033561
1.00
0
000
0.238
356

57

correlation heatmap
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap='RdBu')
<AxesSubplot:>





#Train/Test split
X=diabetes.drop("Outcome",axis=1)
Y=diabetes[["Outcome"]]
# target variable
# split data into training and validation datasets
# Split X and y into X_
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.25,random_state=0)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train,Y_train) LinearRegression()
Outcom
e
0.22
189
8
0.466
581
0.0650
68
0.074
752
0.130
548
0.292
695
0.173844
0.23
8
356
1.000
000

58

# let's grab the coefficient of our model and the intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {:.4}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {:.2}".format(coef[0],coef[1]))
Sample output:
The intercept for our model is -0.879
The Coefficient for Pregnancies is 0.015
The Coefficient for Glucose is 0.0057
The Coefficient for BloodPressure is -0.0021
The Coefficient for SkinThickness is 0.001
The Coefficient for Insulin is -0.00017
The Coefficient for BMI is 0.013
The Coefficient for DiabetesPedigreeFunction is 0.14
The Coefficient for Age is 0.0038
# Get multiple predictions
y_predict = regression_model.predict(X_test)

# Show the first 5
predictions y_predict[:5]
array([[1.01391226],
[0.21532924],
[0.09157383],
[0.60583158],
[0.15988782]])
# define our intput
X2=sm.add_constant(X)

# create a OLS model
model=sm.OLS(Y, X2)

# fit the data
est = model.fit()

59

# print out a summary
print(est.summary())
OLS Regression Results
========================================================================= =====

Dep. Variable: Outcome R-squared: 0.303
Model: OLS Adj. R-squared: 0.296
Method: Least Squares F-statistic: 41.29
Date: Sat, 15 Oct 2022 Prob (F-statistic): 7.36e-55
Time: 19:14:26 Log-Likelihood: -381.91
No. Observations: 768 AIC: 781.8
Df Residuals: 759 BIC: 823.6
Df Model: 8
Covariance Type: nonrobust

============================================================================================
coef std err t P>|t| [0.025 0.975]


const -0.8539 0.085 -9.989 0.000 -
1.022
-0.686
Pregnancies 0.0206 0.005 4.014 0.000 0.011 0.031
Glucose 0.0059 0.
00
1
11.
493
0.0
00
0.005 0.007
BloodPressure -0.0023 0.
00
1

-
2.8
7
3

0.0
0
4

-
0.004
-0.001
SkinThickness 0.0002 0.001 0.139 0.890 -0.002 0.002
Insulin -0.0002 0.
00
0
-
1.2
05
0.2
29
-
0.000
0.000
BMI 0.0132 0.
00
2
6.3
44
0.0
00
0.009 0.017
DiabetesPedigreeFunction 0.1472 0.0
45
3.2
68
0
.
0
0
1
0.05
9
0
.
2
3
6
Age 0.0026 0.002 1.6
93
0.0
91
-
0
.
0
0
0
0.00
6

==============================================================================
Omnibus: 41.539 Durbin-Watson: 1.982
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31.183
Skew: 0.395 Prob(JB): 1.69e-07
Kurtosis: 2.408 Cond. No ..................... 1.10e+03
==============================================================================

60

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.1e+03. This might indicate that
there are strong multicollinearity or other numerical problems.
# make some confidence intervals, 95% by default
est.conf_int()


0 1
const -1.021709 -0.686079
Pregnancies 0.010521 0.030663
Glucose 0.004909 0.006932
BloodPressure -0.003925 -0.000739
SkinThickness -0.002029 0.002338
Insulin -0.000475 0.000114
BMI 0.009146 0.017343
DiabetesPedigreeFunction 0.058792 0.235682
Age -0.000419 0.005662
# estimate the p-values
est.pvalues
Sample output:

const 3.707465e-22
Pregnancies 6.561462e-05
Glucose 2.691192e-28
BloodPressure 4.178788e-03
SkinThickness 8.895424e-01
Insulin 2.285711e-01
BMI 3.853484e-10
DiabetesPedigreeFunction
1.131733e-03 Age
9.092163e-02
dtype: float64 import math
# calculate the mean squared error
model_mse = mean_squared_error(Y_test, y_predict)

# calculate the mean absolute error
model_mae = mean_absolute_error(Y_test, y_predict)

# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)

61

# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))

print("RMSE {:.3}".format(model_rmse))
MSE 0.148
MAE 0.322
RMSE 0.384
model_r2 = r2_score(Y_test, y_predict) print("R2: {:.2}".format(model_r2))
R2: 0.32

import pickle
# pickle the model
with open('my_mulitlinear_regression.sav','wb') as f:
pickle.dump(regression_model, f)
# load it back in
with open('my_mulitlinear_regression.sav', 'rb') as pickle_file:
regression_model_2 = pickle.load(pickle_file)
# make a new prediction
regression_model_2.predict([X_test.loc[150]]) array([[0.42308994]])













Result:

Thus, the multiple regression analysis using the UCI diabetes data set was successfully
executed and practically verified.

62

Ex.No:6

Apply and explore various plotting functions on UCI data sets

Date:
AIM:
To Reading data from Excel and exploring various commands for Apply and explore various plotting
functions on UCI
data sets.
PROCEDURE:
Download the Pima_indian_diabetes data as csv file from the https://www.kaggle.com/ and use the Pandas
library to load this CSV file, and convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes =True)
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
#To run numerical descriptive stats for the data set
diabetes.describe()


Pregnan
cies
Glucose
Blood
Pres sure
Skin
Thick
ness
Insuli
n
BMI
Diabet
es
Pedigr
ee
F
unct
ion
Age
Outco me
cou
nt
768.000000 768.00000
0
768.00000
0
768.00000
0
768.000
000
768.00
0
000
768.0000
00
768.00
0
000
768.000
000
me
an
3.845052 120.89453
1
69.10546
9
20.536458
79.7994
79
31.992
5
78
0.471876
33.240
8
85
0.34895
8

63

50% 3.00000
0
117.000
000
72.000000 23.000000
30.5000
00
32.0000
00
0.372500
29.0000
00
0.00000
0
75% 6.00000
0
140.250
000
80.000000 32.000000
127.250
000
36.6000
00
0.626250
41.0000
00
1.00000
0
ma x 17.0000
00
199.000
000
122.00000
0
99.000000
846.000
000
67.1000
00
2.420000
81.0000
00
1.00000
0
sns.kdeplot(diabetes["Pregnancies"], color = "green",shade =
True) plt.show()
plt.figure()







std
3.369578 31.972618
19.35580
7
15.952218
115.244
002
7.8841
6
0
0.331329
11.760
2
32
0.47695
1
min
0.000000 0.000000
0.00000
0
0.000000
0.00000
0
0.0000
0
0
0.078000
21.000
0
00
0.00000
0
25% 1.000000 99.000000
62.00000
0
0.000000
0.00000
0
27.300
0
00
0.243750
24.000
0
00
0.00000
0

64

plt.figure(figsize=(6,6))
sns.kdeplot(diabetes["Glucose"], color = "green",shade = True)
plt.show()
plt.figure()



plt.figure(figsize=(8,8))
sns.kdeplot(diabetes["Age"], diabetes["BloodPressure"],cmap="RdYlBu", shade =
True) plt.show()
plt.figure()

65

plt.figure(figsize=(6,6))
sns.kdeplot(x=diabetes.Age, y=diabetes.Glucose, cmap="PRGn", shade=True,
bw_adjust=1) plt.show()


# calculate the correlation
matrix corr=diabetes.corr()
# display the correlation
matrix display(corr)


Pregna
ncies
Gluc
ose
BloodPre
ssure
SkinThic
kness
Insuli
n
BMI
Diabetes
Pedigree
Function
Age
Outco
me

Pregnancies
1.00000
0
0.129
459

0.141282

0.081672
0.073
535
0.017
683

0.033523
0.544
341
0.221
898
Glucose
0.12945
9
1.000
000
0.152590 0.057328
0.331
357
0.221
071
0.137337
0.263
514
0.466
581

66

Pregna
ncies
Glu
co se
BloodPre
ssure
SkinThic
kness
Insuli n
BMI
DiabetesP
edigree
Function
Age
Outtco
me
BloodPre
ssure
0.14128
2
0.152
590
1.000000 0.207371
0.088
933
0.281
805
0.041265
0.239
528
0.065
068

SkinThic
kness
-
0.08167
2
0.057
328

0.207371

1.000000
0.436
783
0.392
573
0.183928 0.113
9
7
0
0.074
752

Insulin
0.07353
5
0
.
3
3
1
3
5
7

0.088933

0.436783
1.000
000
0.197
859

0.185071
0.042
163
0.130
548
BMI
0.01768
3
0
.
2
2
1
0
7
1
0.281805 0.392573
0.197
859
1.000
000
0.140647
0.036
242
0.292
695
DiabetesP
edigree
Function
0.03352
3
0
.
1
3
7
3
3
7

0.041265

0.183928
0.185
071
0.140
647

1.000000
0.033
561
0.173
844

Age
0.54434
1
0
.
2
6
3
5
1
4

0.239528

0.113970
0.042
163
0.036
242

0.033561
1.000
000
0.238
356
Outcome
0.22189
8
0
.
4
6
6
5
8
1
0.065068 0.074752
0.130
548
0.292
695
0.173844
0.238
356
1.000
000

67

import seaborn as sns
sns.scatterplot(x="Pregnancies", y="Glucose", data=corr);


sns.lmplot(x="Pregnancies", y="Glucose", hue="Outcome", data=corr);



# Histogram+Density Plot

68

sns.distplot(diabetes["Age"], color =
"green") plt.show()
plt.figure()


















# Adding Two Plots In One
sns.kdeplot(diabetes[diabetes.Outcome == 0]['Age'],
color = "blue")
sns.kdeplot(diabetes[diabetes.Outcome == 1]['Age'],
color = "orange", shade = True)
plt.show()

69

dia1 = diabetes[diabetes.Outcome==1]
dia0 = diabetes[diabetes.Outcome==0]
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
plt.title("Histogram for Glucose")
sns.distplot(diabetes.Glucose,
kde=False) plt.subplot(1,3,2)
sns.distplot(dia0.Glucose,kde=False,color="Gold", label="Gluc for Outcome=0")
sns.distplot(dia1.Glucose, kde=False, color="Blue", label = "Gloc for Outcome=1")
plt.title("Histograms for Glucose by Outcome")
plt.legend() plt.subplot(1,3,3)
sns.boxplot(x=diabetes.Outcome,y=diabetes.Glucose)
plt.title("Boxplot for Glucose by Outcome")
Text(0.5, 1.0, 'Boxplot for Glucose by Outcome')

70

Three dimensional plotting:

import numpy as np
# linear algebra
import pandas as pd
# data processing, CSV file I/O (e.g. pd.read_csv)
from mpl_toolkits
import mplot3d
import matplotlib.pyplot as plt
import matplotlib
import functools
import plotly.express as px
import plotly.graph_objects as go
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
#Loading Data
diabetes=pd.read_csv("E:\DATA SCIENCE\Pima_indian_diabetes\diabetes.csv")
diabetes
Pregnan
ci es
Glucos e Blood
Pressu re
Skin
Thickne ss
Insuli n BM I Diabetes
Pedigree
Funct ion
Ag e Outcom e
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
Pregnanci es ucos e Blood
Pressu re
Skin
Thickne ss
Insuli n BM I Diabetes
Pedigree
Funct ion
Ag e Outcom e
2 8 183 4 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 0 35 168 43.1 2.288 33 1
... ... ... ... ... ... ... ... ... ...
76
3
10 101 76 48 180 32.9 0.171 63 0
76
4
2 122 70 27 0 36.8 0.340 27 0
76
5
5 121 72 23 112 26.2 0.245 30 0
76
6
1 126 60 0 0 30.1 0.349 47 1
76
7
1 93 70 31 0 30.4 0.315 23 0

71


768 rows × 9 columns
x=diabetes.Age[:20]
y=diabetes.Glucose[:20]
def f(x, y):
return np.sin(np.sqrt(x ** 2 + y ** 2))
x = np.linspace(-6, 6, 30)
y = np.linspace(-6, 6, 30) X,
Y = np.meshgrid(x, y) Z = f(X, Y)
fig = plt.figure(figsize=(10,10))
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 50, cmap='binary')
ax.set_xlabel('Age')
ax.set_ylabel('Glucose')
ax.set_zlabel('z');


fig = plt.figure(figsize=(10,10))
ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,cmap='viridis', edgecolor='none')
ax.set_title('surface');
ax.set_xlabel('Age')
ax.set_ylabel('Glucose')
ax.set_zlabel('z')

72



fig = plt.figure(figsize=(10,10))
ax = plt.axes(projection='3d')
ax.scatter(X,Y,Z, cmap='viridis', linewidth=0.5);
ax.set_title('scatter');
ax.set_xlabel('Age')
ax.set_ylabel('Glucose')
ax.set_zlabel('z')

73
















Result:

Thus, the Three dimensional plotting using the UCI diabetes data set was successfully executed and
practically verified.

74

Ex.No:7

Visualizing Geographic Data with Basemap

Date:


AIM:
To Reading data from Excel and exploring various commands for Apply and explore various plotting
functions on UCI
data sets.
PROCEDURE:
Download the csv file from the https://www.kaggle.com/ and use the Pandas library to load this CSV file, and
convert it into the dataframe. read_csv() method is used to read CSV files.
PROGRAM:

import pandas as pd
import numpy as np
from numpy import array
import matplotlib as mpl
# for plots
import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.basemap
import Basemap
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.patches
import Polygon
from matplotlib.collections
import PatchCollection
import warnings
warnings.filterwarnings("ignore")
cities = pd.read_csv (r"C:\Users\Admin\Downloads\datasets_557_1096_cities_r2.csv")
cities.head()
fig = plt.figure(figsize=(10,8))
states = cities.groupby('state_name')['name_of_city'].count().sort_values(ascending=True)
states.plot(kind="barh", fontsize = 20)
plt.grid(b=True, which='both',
color='Black',linestyle='-') plt.xlabel('No of cities
taken for analysis', fontsize = 20) plt.show ()

75



fig = plt.figure(figsize=(8,8))
ax=fig.add_subplot(111)
map=Basemap(llcrnrlon=67,llcrnrlat=5,urcrnrlon=99,urcrnrlat=37,projection="lcc",lat_0=28,lon_0=77)
#map.bluemarble()
#map.fillcontinents(color="red")
map.drawmapboundary(color="red")
map.drawcountries(color="brown")
map.drawcoastlines(color="blue")
#draw state from shapefile
map.readshapefile(r"C:\Users\Admin\Music\India_State_Shapefile\India_State_Boundary","India_St
ate_Bo undary")
cities['latitude'] = cities['location'].apply(lambda x: x.split(',')[0]) cities['longitude'] =
cities['location'].apply(lambda x: x.split(',')[1])
print("The Top 10 Cities sorted according to the Total Population (Descending Order)")
top_pop_cities = cities.sort_values(by='population_total',ascending=False)

76

top10_pop_cities=top_pop_cities.head(10)
#plt.subplots(figsize=(20, 15))
lg=array(top10_pop_cities['longitude'])
lt=array(top10_pop_cities['latitude'])
pt=array(top10_pop_cities['population_total'])
nc=array(top10_pop_cities['name_of_city'])
x, y = map(lg, lt)
population_sizes = top10_pop_cities["population_total"].apply(lambda x: int(x /5000))
plt.scatter(x, y, s=population_sizes, marker="o", c=population_sizes, cmap=cm.Dark2,
alpha=0.7) for ncs, xpt, ypt in zip(nc, x, y):
plt.text(xpt+60000, ypt+30000, ncs, fontsize=10, fontweight='bold')
plt.title('Top 10 Populated Cities in India',fontsize=20)
The Top 10 Cities sorted according to the Total Population (Descending
Order) Text(0.5, 1.0, 'Top 10 Populated Cities in India')

77




Result:

Thus, the Visualizing Geographic Data with Basemap was successfully executed and practically
verified.

78

VIVA QUESTIONS

NumPy

1. What is Numpy?
Ans: NumPy is a general-purpose array-processing package. It provides a high-performance
multidimensional array object, and tools for working with these arrays. It is the fundamental package
for scientific computing with Python. … A powerful N-dimensional array object. Sophisticated
(broadcasting) functions.


2. Why NumPy is used in Python?
Ans: NumPy is a package in Python used for Scientific Computing. NumPy package is used to
perform different operations. The ndarray (NumPy Array) is a multidimensional array used to store
values of same datatype. These arrays are indexed just like Sequences, starts with zero.


3. What does NumPy mean in Python?
Ans: NumPy (pronounced /ˈnʌmpaɪ/ (NUM-py) or sometimes /ˈnʌmpi/ (NUM-pee)) is a library for
the Python programming language, adding support for large, multi-dimensional arrays and matrices,
along with a large collection of high-level mathematical functions to operate on these arrays.


4. Where is NumPy used?
Ans: NumPy is an open source numerical Python library. NumPy contains a multi-dimentional array
and matrix data structures. It can be utilised to perform a number of mathematical operations on arrays
such as trigonometric, statistical and algebraic routines. NumPy is an extension of Numeric and
Numarray.

79

Pandas
1. What is Pandas?
Ans: Pandas is a Python package providing fast, flexible, and expressive data structures designed to
make working with “relational” or “labeled” data both easy and intuitive. It aims to be the
fundamental high-level building block for doing practical, real world data analysis in Python.



2. What is Python pandas used for?
Ans: Pandas is a software library written for the Python programming language for data
manipulation and analysis. In particular, it offers data structures and operations for manipulating
numerical tables and time series. pandas is free software released under the three-clause BSD
license.



3. What is a Series in Pandas?
Ans: Pandas Series is a one-dimensional labelled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is
nothing but a column in an excel sheet.



4. Mention the different Types of Data structures in pandas??
Ans: There are two data structures supported by pandas library, Series and DataFrames. Both of the
data structures are built on top of Numpy. Series is a one-dimensional data structure in pandas and
DataFrame is the two-dimensional data structure in pandas. There is one more axis label known as
Panel which is a three- dimensional data structure and it includes items, major_axis, and minor_axis.



5. Explain Reindexing in pandas?
Ans: Re-indexing means to conform DataFrame to a new index with optional filling logic, placing
NA/NaN in locations having no value in the previous index. It changes the row labels and column
labels of a DataFrame.



6. What are the key features of pandas library ?
Ans: There are various features in pandas library and some of them are mentioned below

 Data Alignment
 Memory Efficient
 Reshaping
 Merge and join
 Time Series


7. What is pandas Used For ?
Ans: This library is written for the Python programming language for performing operations like data
manipulation, data analysis, etc. The library provides various operations as well as data structures to
manipulate time series and numerical tables.

80



8. How can we create copy of series in Pandas?
Ans: pandas.Series.copy

Series.copy(deep=True)

pandas.Series.copy. Make a deep copy, including a copy of the data and the indices. With
deep=False neither the indices or the data are copied. Note that when deep=True data is copied,
actual python objects will not be copied recursively, only the reference to the object.



9. What is Time Series in pandas?
Ans: A time series is an ordered sequence of data which basically represents how some quantity
changes over time. pandas contains extensive capabilities and features for working with time series
data for all domains.

pandas supports:

Parsing time series information from various sources and formats
Generate sequences of fixed-frequency dates and time spans
Manipulating and converting date time with timezone information
Resampling or converting a time series to a particular frequency
Performing date and time arithmetic with absolute or relative time increments



10. What is pylab?
Ans: PyLab is a package that contains NumPy, SciPy, and Matplotlib into a single namespace.


Jupyter Notebook
1. What is Jupyter
Notebook?
Jupyter Notebook is a web-based interactive computing platform that allows users to create and share
code, equations, visualizations, and narrative text. Jupyter Notebook is popular among data scientists
and engineers as it allows for rapid prototyping and iteration.

2. What are the main features of Jupyter Notebook?
Jupyter Notebook is a web-based interactive computing platform that allows users to create and share
code, equations, visualizations, and narrative text. Jupyter Notebook is popular among data scientists
and engineers as it provides an easy way to mix code, output, and explanatory text all in one place.
Jupyter Notebook is also used by educators to teach programming and data science concepts.

3. How can you create a new notebook in Jupyter?
You can create a new notebook in Jupyter by clicking on the “New” button in the upper right corner
and selecting “Notebook” from the drop-down menu.

4. Can you explain what the data science workflow involves?
The data science workflow generally involves four main steps: data wrangling, exploratory data
analysis, modeling, and evaluation. Data wrangling is the process of cleaning and preparing data for

81

analysis.
Exploratory data analysis is the process of exploring data to find patterns and relationships. Modeling
is the process of building models to make predictions or recommendations based on data. Evaluation is
the process of assessing the accuracy of models and using them to make decisions.

5. What are some common use cases for Jupyter Notebook?
Jupyter Notebook is a popular tool for data scientists and analysts because it allows for an interactive
Coding experience. Jupyter Notebook is often used for exploratory data analysis and for visualizing data.

*******