XII IP Ch 2 Python Pandas - II DataFrame.pdf

wecoyi4681 377 views 62 slides Jul 22, 2024
Slide 1
Slide 1 of 62
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62

About This Presentation

ohoihioi


Slide Content

Python Pandas -II

DataFrame
•It is a data structure, which stores data in the two-
dimensional form (tabular form).
•Columns may store values of different datatypes.
•A single column will have the same type of values.

•It has two indices –a row index (axis 0) and a
column index (axis 1)
•The indices can be numeric/string.
•It is value mutable i.e. we can change the values
•It is size mutable i.e. we can add/ delete the
rows/ columns
•The row index are specified using index
•The column index are specified using columns

Creating Empty DataFrame
import pandas as pd
df=pd.DataFrame()
print(df)
EmptyDataFrame
Columns: [ ]
Index: [ ]

Method 1 -Using a nested list
import pandas as pd
df= pd.DataFrame([['Delhi',40,32,24.1],
['Bengaluru',31,25,36.2],
['Chennai',35,27,40.8],
['Mumbai',29,21,35.2],
['Kolkata',39,23,41.8]],
index=[1,2,3,4,5],
columns=['City','Maxtemp','Mintemp','RainFall'])
print(df)

Method 1 -Using a nested list
import pandas as pd
L = [['Delhi',40,32,24.1],
['Bengaluru',31,25,36.2],
['Chennai',35,27,40.8],
['Mumbai',29,21,35.2],
['Kolkata',39,23,41.8]]
df= pd.DataFrame(L , index=[1,2,3,4,5],
columns=['City','Maxtemp','Mintemp','RainFall'])
print(df)

Method 2 -Using a Dictionary / Dictionary of Lists
import pandas as pd
df= pd.DataFrame(
{'City':['Delhi','Bengaluru','Chennai','Mumbai','Kolkata'],
'Maxtemp':[40,31,35,29,39],
'Mintemp':[32,25,27,21,23],
'RainFall':[24.1, 36.2, 40.8, 35.2, 41.8]},
index=[1,2,3,4,5])
print(df)

Method 3 -Using a Nested Dictionary
import pandas as pd
df= pd.DataFrame(
{'City':{1:'Delhi',2:'Bengaluru',3:'Chennai',4:'Mumbai',5:'Kolkata'},
'Maxtemp':{1:40, 2:31, 3:35, 4:29, 5:39},
'Mintemp':{1:32, 2:25, 3:27, 4:21, 5:23},
'RainFall':{1:24.1, 2:36.2, 3:40.8, 4:35.2 ,5:41.8}})
print(df)

Method 4 -Using List of Dictionaries
import pandas as pd
df= pd.DataFrame(
[{'City':'Delhi', 'Maxtemp':40, 'Mintemp':32, 'Rainfall':24.1},
{'City':'Bengaluru', 'Maxtemp':31, 'Mintemp':25, 'Rainfall':36.2},
{'City':'Chennai', 'Maxtemp':35, 'Mintemp':27, 'Rainfall':40.8},
{'City':'Mumbai', 'Maxtemp':29, 'Mintemp':21, 'Rainfall':35.2},
{'City':'Kolkata', 'Maxtemp':39, 'Mintemp':23, 'Rainfall':41.8}],
index=[1,2,3,4,5])
print(df)

Method 5 -Using Series Objects
import pandas as pd
A = pd.Series(['Delhi','Bengaluru','Chennai','Mumbai','Kolkata'],
index=[1,2,3,4,5])
B = pd.Series([40,31,35,29,39],index=[1,2,3,4,5])
C = pd.Series([32,25,27,21,23],index=[1,2,3,4,5])
D = pd.Series([24.1,36.2,40.8,35.2,41.8],index=[1,2,3,4,5])
df= pd.DataFrame({'City':A, 'Maxtemp':B, 'Mintemp':C, 'RainFall':D})
print(df)

Creating DataFrame from 2D NumpyArray
import numpyas np
import pandas as pd
A=np.array([[10,20,30],[40,50,60],[70,80,90]])
D=pd.DataFrame(A)
print(D) 0 1 2
0 10 20 30
140 50 60
270 80 90

Attributes of DataFrame
AttributeDescription
index The index(row labels)
columnsThe column labels
axes Alist of both the axes, axis 0 –index and axis 1-the
columns
valuesValues in the DataFrame
dtypesItwill display the data type of all the columns
size numberof elements
shape a tuple representing the dimensions
ndim number of dimensions
empty True/ False (DataFrameis empty or not)
T Transposesthe index and columns

Application of Attributes
D.index Int64Index([1, 2, 3, 4, 5], dtype='int64')
D.columns Index(['City', 'Maxtemp', 'Mintemp', 'Rainfall'], dtype='object')
D.axes [Int64Index([1, 2, 3, 4, 5], dtype='int64'),
Index(['City', 'Maxtemp', 'Mintemp', 'Rainfall'],dtype='object')]
D.values array([['Delhi', 40, 32, 24.1],
['Bengaluru', 31, 25, 36.2],
['Chennai', 35, 27, 40.8],
['Mumbai', 29, 21, 35.2],
['Kolkata', 39, 23, 41.8]], dtype=object)

D.dtypes City object
Maxtemp int64
Mintemp int64
Rainfall float64
dtype: object
D.size 20
D.shape (5,4)
D.ndim 2
D.empty False
D.T

Getting number of rows in a DataFrame
•len() function can be used to find the number
of rows in a DataFrame.
print(len(D)) 5

Indexing
•Indexing in pandas means simply selecting
particular rows and columns of data from a
DataFrame.
•Indexing could mean selecting all the rows and
some of the columns, some of the rows and all of
the columns, or some of each of the rows and
columns.

Selecting a Column
print(df.Rainfall)
OR
print(df['Rainfall'])

Selecting multiple Columns
To display multiple columns, we need to use
double square brackets.
print(df[['City','Rainfall','Maxtemp']])

Selecting a Row
print(df.loc[2])

Selecting multiple Rows
To display multiple rows, we need to use double
square brackets, or a range can be specified.
print(df.loc[[2,4,5]])
print(df.loc[2:4])

Obtaining a Subset using Row/Column names
We use loc to obtain a subset in the following
format:
df.loc[ row , col]
Here, row/colcan be an individual value, range
or a list.

>>> print(df.loc[3,'Mintemp'])
27
>>> print(df.loc[2,'City'])
Bengaluru
>>> print(df.loc[3:5 ,'Mintemp'])
3 27
4 21
5 23
Name: Mintemp, dtype: int64

>>> print(df.loc[3,'City':'Mintemp'])
City Chennai
Maxtemp35
Mintemp27
Name: 3, dtype: object
>>> print(df.loc[3:5,'City':'Mintemp'])
City MaxtempMintemp
3 Chennai 35 27
4 Mumbai 29 21
5 Kolkata 39 23

>>> print(df.loc[[1,4],'Mintemp'])
1 32
4 21
Name: Mintemp, dtype: int64
>>> print(df.loc[2,['Maxtemp','Rainfall']])
Maxtemp31
Rainfall 36.2
Name: 2, dtype: object
>>> print(df.loc[[1,2,4],['Maxtemp','Rainfall']])
MaxtempRainfall
1 40 24.1
2 31 36.2
4 29 35.2

>>> print(df.loc[1:4,['Maxtemp','Rainfall']])
MaxtempRainfall
1 40 24.1
2 31 36.2
3 35 40.8
4 29 35.2
>>> print(df.loc[:,['Maxtemp','Rainfall']])
MaxtempRainfall
1 40 24.1
2 31 36.2
3 35 40.8
4 29 35.2
5 39 41.8

Obtaining a Subset using in-built indexes
We use ilocto obtain a subset using the in-built
indexes
>>> df.iloc[0,2]
32
>>> df.iloc[0:3, 0:2]
City Maxtemp
1 Delhi 40
2 Bengaluru31
3 Chennai 35

Accessing Individual Value
For accessing an individual value, we can also use at
in place of loc, and iatin place of iloc.
>>> df.loc[3,'Maxtemp']OR
>>> df.at[3,'Maxtemp']
35
>>> df.iloc[0,2] OR
>>> df.iat[0,2]
32

Boolean Indexing
•If a DataFrame has the indexes as booleanvalues,
that is, True and False it is called Boolean
Indexing.
•The rows of such a DataFrame can be accessed
using the loc as we do in any other DataFrame
•1 and 0 can also be used to represent the
booleanvalues True and False respectively.

import pandas as pd
dict= {'name':["aparna", "pankaj", "sudhir", "Girish"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
df= pd.DataFrame(dict, index = [True, False, True,
False])
print(df)

print(df.loc[True])

Modifying the data in a DataFrame
The values in a DataFrame can be modified by the same
method as we access the values.
>>> df['Mintemp']=[33,27,29,22,20]
>>> print(df)
City MaxtempMintempRainfall
1 Delhi 40 33 24.1
2 Bengaluru31 27 36.2
3 Chennai 35 29 40.8
4 Mumbai 29 22 35.2
5 Kolkata 39 20 41.8

>>> df.loc[4] =['Mumbai',33,20,35]
>>> print(df)
City Maxtemp Mintemp Rainfall
1 Delhi 40 33 24.1
2 Bengaluru31 27 36.2
3 Chennai 35 29 40.8
4 Mumbai 33 20 35.0
5 Kolkata 39 20 41.8
>>> df.loc[5,'Maxtemp']=42
>>> print(df)
City Maxtemp Mintemp Rainfall
1 Delhi 40 33 24.1
2 Bengaluru31 27 36.2
3 Chennai 35 29 40.8
4 Mumbai 33 20 35.0
5 Kolkata 39 20 41.8

Adding data in a DataFrame
If a column name or a row index is specified, which exists
in the DataFrame, it modifies the values in the DataFrame.
If a column name or a row index is specified, which does
not existin the DataFrame, it is added as a new
column/row.

>>> df['Humidity']=[30,40,55,38,60]
>>> print(df)
City MaxtempMintempRainfall Humidity
1 Delhi 40 33 24.1 30
2 Bengaluru 31 27 36.2 40
3 Chennai 35 29 40.8 55
4 Mumbai 33 20 35.0 38
5 Kolkata 42 20 41.8 60
>>> df['Avgtemp'] = (df['Maxtemp']+df['Mintemp'])/2
>>> print(df)
City MaxtempMintempRainfall HumidityAvgtemp
1 Delhi 40 33 24.1 3036.5
2 Bengaluru 31 27 36.2 4029.0
3 Chennai 35 29 40.8 5532.0
4 Mumbai 33 20 35.0 3826.5
5 Kolkata 42 20 41.8 6031.0

>>> df.loc[6] = ['Jaipur',48,26,12.0,16,37]
>>> print(df)
City MaxtempMintempRainfall HumidityAvgtemp
1 Delhi 40 33 24.1 3036.5
2 Bengaluru 31 27 36.2 4029.0
3 Chennai 35 29 40.8 5532.0
4 Mumbai 33 20 35.0 3826.5
5 Kolkata 42 20 41.8 6031.0
6 Jaipur 48 26 12.0 1637.0

Deleting Rows
To remove the rows from the DataFrame, we use the
function drop().
It displays the DataFrame, removing the row index
mentioned in the drop() function.
To remove the row permanently, a parameter
inplace=Truehas to be mentioned.

df.drop(3) OR
df.drop(3, axis=0) OR
df.drop([3], axis=0) OR
df.drop(index=3)
To remove the row permanently,
df.drop(3, inplace=True)
print(df)

Deleting Columns
To remove the columns from the DataFrame, we use the
function drop()/ pop()/ del command.
To remove the column, we need to specify the parameter
axis=1 with the drop() function. It displays the
DataFrame, removing the column mentioned.
To remove the column permanently, a parameter
inplace=Truehas to be mentioned.

df.drop('Humidity', axis=1)
df.drop(['Humidity'], axis=1)
df.drop(columns = 'Humidity')
To remove the column permanently,
df.drop('Humidity', axis=1, inplace=True)

The pop() function or the del command can also
be used to remove column permanently from the
DataFrame.
df.pop('Humidity')
print(df)
OR
del df['Humidity']
print(df)

Renaming the Row indexes / Column headings
New indexes/ column headings can be specified using the
attribute index and columns.
rename() function can also be used to rename existing indices/
column labels in a dataframe.
The old and new index/column labels are to be provided in the
form of a dictionary, where keys are the old index/column labels
and the values are the new names for the same.
To make the changes permanent, inplace=True needs to be used.

Using attributes
df.index= ['A','B','C','D','E']
df.columns= ['P','Q','R','S','T']
print(df)

Renaming Rows
df.rename({1:'A', 2:'B', 6:'E'})
df.rename({1:'A', 2:'B', 6:'E'}, axis=0)
df.rename(index={1:'A', 2:'B', 6:'E'})
# To make the changes permanent
df.rename({1:'A', 2:'B', 6:'E'}, axis=0, inplace=True)

Renaming Columns
df.rename({'Maxtemp':'High', 'Mintemp':'Low'}, axis=1)
df.rename(columns={'Maxtemp':'High', 'Mintemp':'Low'})
# to make changes permanent in the DataFrame
df.rename({'Maxtemp':'High', 'Mintemp':'Low'}, axis=1,
inplace=True)

To change the index column
To change the index column we can use the
function set_index()
To change the index back to the default indexes
(0,1,2…) we use the function reset_index()
To make the changes permanent, inplace=True
needs to be used.

df.set_index('City')
# To make the changes permanent
df.set_index('City', inplace=True)
print(df)

df.reset_index()
# To make the changes permanent
df.reset_index(inplace=True)

Iterating over a DataFrame
•To iterate over horizontal subsets, row wise
for iin df.iterrows():
print(i)
•To iterate over vertical subsets , column wise
for iin df.iteritems():
print(i)

Binary Operations in a DataFrame
•Operations requiring two values are called
binary operations.
•In a binary operation, the data from the two
DataFramesare aligned, and for the matching
row and column index the given operation is
performed and for the non-matching index,
NaNis stored as a result.

df1=pd.DataFrame(
[[10,20,30],[40,50,60],[70,80,90]],
index=[1,2,3],
columns=['A','B','C'])
print(df1)
df2=pd.DataFrame([[1,2],[3,4]],
index=[1,2],
columns=['A','B'])
print(df2)

Statistics with Pandas
1.min()
It is used to find minimum value from a Data
Frame
2. max()
It is used to find maximum value from a Data
Frame

DF.max() DF.min()
DF.max(axis=1) DF.min(axis=1)

3. mean()
It is used to find mean (average)
DF.mean() DF.mean(axis=1)

4. count()
count() can be used to find the number of non-NA
values along the rows / columns.
D.count()
ORD.count(0)
ORDF.count(axis=0)
ORD.count(axis='index')
D.count(1)
ORDF.count(axis=1)
ORD.count(axis='columns')

5. sum()
It is used to find the sum of values.
DF.sum() DF.sum(axis=1)

Applying functions on particular row/column
•Particular columns
DF[2016].min()
•Particular rows
DF.loc['Qtr1'].min()
•Particular subset
DF.loc['Qtr3':'Qtr4',2018:2019].count()

Sorting
Sorting means arranging the contents in ascending
or descending order.
The default sort order is ascending.
To arrange the data in descending order, add the
argument ascending=False

import pandas as pd
d = {'Name':['Sachin','Dhoni','Virat','Rohit','Shikhar'],
'Age':[26,25,25,24,31], 'Score':[87,67,89,55,47]}
df= pd.DataFrame(d)
print("Dataframecontents without sorting")
print (df)

df.sort_values('Score')
df.sort_values(by=['Age', 'Score'],ascending=[True,False])

Head and Tail functions
•The head() function returns the first n rows
and tail() function returns the last n rows.
•If n is not specified, the default value is 5.
Tags