KBN Pandas in python for Btech students.pptx

gandhamcharan2006 5 views 59 slides Sep 10, 2025
Slide 1
Slide 1 of 59
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59

About This Presentation

KBN pandas for BTech students


Slide Content

DATA ANALYTICS WITH PANDAS Dr C.Naga Raju B.Tech (CSE), M.Tech (CSE),PhD(CSE),MIEEE,MCSI,MISTE Associate Professor Department of CSE YSR Engineering College of YVU Proddatur Dr. C.NAGARAJU YSRCE OF YOGIVEMANAUNIVERSITY 9949218570 1 https://archive.ics.uci.edu/ml/datasets.php

INTRODUCTION TO PANDAS Pandas is a high-level data manipulation tool developed by Wes McKinney. Pandas library provides data analytics features like R programming and MATLAB Pandas is built on Numpy , Scipy and Matplotlib packages so that it uses features of these packages The key data structures of pandas are 1) Series 2)Data Frames Series is like one dimensional array object contains data and labels(index). Data Frame is like two dimensional array object stores data in the form of rows and columns. rows represents observations and columns represents variables.

Pandas series: series is like one dimensional object containing data and labels(or) indexes Series can be created in different ways using series method

Single value can be selected from series by single index. multiple values are selected from series by multiple indexes

Series is fixed length ordered Dictionary(dist). How ever unlike dictionary index items do not have to be unique

Series operations Filtering Numpy -like type operations on data

Pandas can accommodate incomplete data

Unlike numpy , ndarray data is automatically alligned

DATA FREAMES Data Frame is like two dimensional array object stores data in the form of rows and columns. rows represents observations and columns represents variables. It has both row and column indexes It also considered as collection of series as a dictionary( dict )

Dataframe is created by using DataFrame method of pandas Data Frame can be created using dictionary of equal length lists

Data frame can be created with dictionary of dictionaries

Create file using excel with given name ex: Book1.xlsx create file using note pad with given name ex:abc.csv

Working with the whole DataFrame import numpy as np import matplotlib.pyplot as plt import pandas as pd from pandas import DataFrame,Series df= pd.DataFrame ( [ [4,7,10,20,30], [5,8,11,33,34], [6,9,12,23,12],[4,7,10,20,30], [4,7,10,20,30], [4,7,10,20,30]], index=[1,2,3,4,5,6],columns=[' a','b','c','d','e ']) print(' dataframe \ n’,df ) print('information about dataframe \n’, df.info()) dataframe a b c d e 1 4 7 10 20 30 2 5 8 11 33 34 3 6 9 12 23 12 4 4 7 10 20 30 5 4 7 10 20 30 4 7 10 20 30 information about dataframe <class ' pandas.core.frame.DataFrame '> Int64Index: 6 entries, 1 to 6 Data columns (total 5 columns): a 6 non-null int64 b 6 non-null int64 c 6 non-null int64 d 6 non-null int64 e 6 non-null int64 dtypes : int64(5) memory usage: 288.0 bytes None

n=2 dfh = df.head (n) print('head\n', dfh ) dft = df.tail (n) print('tail \n', dft ) dfs = df.describe () print('describe\n', dfs ) top_left_corner_df = df.iloc [:5,:5] print(' top_left_corner_df \n', top_left_corner_df ) dfT = df.T print('transpose\n', dfT ) head a b c d e 1 4 7 10 20 30 2 5 8 11 33 34 tail a b c d e 5 4 7 10 20 30 6 4 7 10 20 30 describe a b c d e count 6.00000 6.00000 6.00000 6.000000 6.000000 mean 4.50000 7.50000 10.50000 22.666667 27.666667 std 0.83666 0.83666 0.83666 5.202563 7.840068 min 4.00000 7.00000 10.00000 20.000000 12.000000 25% 4.00000 7.00000 10.00000 20.000000 30.000000 50% 4.00000 7.00000 10.00000 20.000000 30.000000 75% 4.75000 7.75000 10.75000 22.250000 30.000000 max 6.00000 9.00000 12.00000 33.000000 34.000000 top_left_corner_df a b c d e 1 4 7 10 20 30 2 5 8 11 33 34 3 6 9 12 23 12 4 4 7 10 20 30 5 4 7 10 20 30 transpose 1 2 3 4 5 6 a 4 5 6 4 4 4 b 7 8 9 7 7 7 c 10 11 12 10 10 10 d 20 33 23 20 20 20 e 30 34 12 30 30 30

idx = df.columns # get col index print('Column index\ n',idx ) label = df.columns [0] # 1st col label print('Column Label\ n',label ) lst = df.columns.tolist () # get as a list print('Column as List\ n',lst ) s = df ['a'] # select col to Series print(' col to Series\ n',s ) s = df [['a']] # select col to df print(' col to df \ n',s ) s = df [[' a','b ']] # select 2 or more print('select 2 or more columns\ n',s ) Column index Index(['a', 'b', 'c', 'd', 'e'], dtype ='object') Column Label a Column as List ['a', 'b', 'c', 'd', 'e'] col to Series 1 4 2 5 3 6 4 4 5 4 6 4 Name: a, dtype : int64 col to df a 1 4 2 5 3 6 4 4 5 4 6 4 select 2 or more columns a b 1 4 7 2 5 8 3 6 9 4 4 7 5 4 7 6 4 7

s = df [[' c','a','b ']]# change order print('change order of columns\ n',s ) f= df.columns [[0, 3, 4]] print('Column name by number\ n',f ) s = df.pop('c') print('Deleting a column\ n',df ) idx = df.index # get row index print('Row index\ n',idx ) change order of columns c a b 1 10 4 7 2 11 5 8 3 12 6 9 4 10 4 7 5 10 4 7 6 10 4 7 select by number 1 7 2 8 3 9 4 7 5 7 6 7 Name: b, dtype : int64 Column name by number Index(['a', 'd', 'e'], dtype ='object') Deleting a column a b d e 1 4 7 20 30 2 5 8 33 34 3 6 9 23 12 4 4 7 20 30 5 4 7 20 30 6 4 7 20 30 Row index Int64Index([1, 2, 3, 4, 5, 6], dtype ='int64')

label = df.index [0] # 1st row label print('Row Label\ n',label ) lst = df.index.tolist () # get as a list print('Index as List\ n',lst ) df.sort_index ( inplace =True) # sort by row df = df.sort_index (ascending=False) print('Sorting by row\ n',df ) Row Label 1 Index as List [1, 2, 3, 4, 5, 6] Sorting by row a b d e 6 4 7 20 30 5 4 7 20 30 4 4 7 20 30 3 6 9 23 12 2 5 8 33 34 1 4 7 20 30

s= df.dtypes print('serial col data type\ n’,s ) b= df.empty print(' true for empty data type: ',b) i = df.ndim print(' \n no of dimensions: ', i ) ( r,c )= df.shape print(\n 'no of rows and cols: ’,( r,c )) i = df.size print(' \n size: ', i ) a= df.values print(' \n values\ n',a ) dfc = df.copy () print('copy\n', dfc ) dfr = df.rank () print('rank\n', dfr ) serial col data type a int64 b int64 c int64 d int64 e int64 dtype : object true for empty data type: False no of dimensions: 2 no of rows and cols: (6, 5) size: 30 values [[ 4 7 10 20 30] [ 5 8 11 33 34] [ 6 9 12 23 12] [ 4 7 10 20 30] [ 4 7 10 20 30] [ 4 7 10 20 30]] copy a b c d e 1 4 7 10 20 30 2 5 8 11 33 34 3 6 9 12 23 12 4 4 7 10 20 30 5 4 7 10 20 30 6 4 7 10 20 30 rank a b c d e 1 2.5 2.5 2.5 2.5 3.5 2 5.0 5.0 5.0 6.0 6.0 3 6.0 6.0 6.0 5.0 1.0 4 2.5 2.5 2.5 2.5 3.5 5 2.5 2.5 2.5 2.5 3.5 6 2.5 2.5 2.5 2.5 3.5

dfab = df.abs () print('Absolute \n', dfab ) dfad = df.add (1) print('Add\n', dfad ) s = df.count () print('count\ n',s ) dfmax = df.cummax () print('cumulative max\n', dfmax ) dfmin = df.cummin () print('cumulative min\n', dfmin ) Absolute a b c d e 1 4 7 10 20 30 2 5 8 11 33 34 3 6 9 12 23 12 4 4 7 10 20 30 5 4 7 10 20 30 4 7 10 20 30 Add a b c d e 1 5 8 11 21 31 2 6 9 12 34 35 3 7 10 13 24 13 4 5 8 11 21 31 5 5 8 11 21 31 5 8 11 21 31 count a 6 b 6 c 6 d 6 e 6 dtype : int64 cumulative max a b c d e 1 4 7 10 20 30 2 5 8 11 33 34 3 6 9 12 33 34 4 6 9 12 33 34 5 6 9 12 33 34 6 6 9 12 33 34 cumulative min a b c d e 1 4 7 10 20 30 2 4 7 10 20 30 3 4 7 10 20 12 4 4 7 10 20 12 5 4 7 10 20 12 6 4 7 10 20 12

cumulative sum a b c d e 1 4 7 10 20 30 2 9 15 21 53 64 3 15 24 33 76 76 4 19 31 43 96 106 5 23 38 53 116 136 27 45 63 136 166 cumulative product a b c d e 1 4 7 10 20 30 2 20 56 110 660 1020 3 120 504 1320 15180 12240 4 480 3528 13200 303600 367200 5 1920 24696 132000 6072000 11016000 6 7680 172872 1320000 121440000 330480000 list difference a b c d e 1 NaN NaN NaN NaN NaN 2 1.0 1.0 1.0 13.0 4.0 3 1.0 1.0 1.0 -10.0 -22.0 4 -2.0 -2.0 -2.0 -3.0 18.0 5 0.0 0.0 0.0 0.0 0.0 6 0.0 0.0 0.0 0.0 0.0 division a b c d e 1 2.0 3.5 5.0 10.0 15.0 2 2.5 4.0 5.5 16.5 17.0 3 3.0 4.5 6.0 11.5 6.0 4 2.0 3.5 5.0 10.0 15.0 5 2.0 3.5 5.0 10.0 15.0 6 2.0 3.5 5.0 10.0 15.0 dfcs = df.cumsum () print('cumulative sum\n', dfcs ) dfpr = df.cumprod () print('cumulative product\n', dfpr ) dif = df.diff () print('list difference\n', dif ) div1= df.div (2) print('division\n',div1)

s = df.max () print(' max of axis (col def) \ n’,s ) s = df.mean () print('mean (col default axis)\ n',s ) s = df.median () print('median (col default)\ n’,s ) s = df.min () print(' min of axis (col def) \ n',s ) mul = df.mul (1) print(' mul by df Series val \n', mul ) s = df.sum () print('sum of axis\ n',s ) max of axis (col def) a 6 b 9 c 12 d 33 e 34 dtype : int64 'mean (col default axis) a 4.500000 b 7.500000 c 10.500000 d 22.666667 e 27.666667 dtype : float64 median (col default) a 4.0 b 7.0 c 10.0 d 20.0 e 30.0 dtype : float64 min of axis (col def) a 4 b 7 c 10 d 20 e 12 dtype : int64 mul by df Series val a b c d e 1 4 7 10 20 30 2 5 8 11 33 34 3 6 9 12 23 12 4 4 7 10 20 30 5 4 7 10 20 30 6 4 7 10 20 30 sum of axis a 27 b 45 c 63 d 136 e 166 dtype : int64

Dataframe filters for selection of rows and col dffi = df.filter (items=[' a','b ']) print('Filter by col \n', dffi ) dfrow = df.filter (items=[2],axis=0) print('filter by row\n', dfrow ) dfin = df.filter (like='%a%') print('Filter in col\n', dfin ) Filter by col a b 1 4 7 2 5 8 3 6 9 4 4 7 5 4 7 4 7 filter by row a b c d e 5 8 11 33 34 Filter in col Empty DataFrame Columns: [] Index: [1, 2, 3, 4, 5, 6]

Basic Statistics s = df['a'].describe() print('describe col a\ n',s ) cor = df.corr () print('correlation \n', cor ) cov = df.cov () print('covariance\n', cov ) kur = df.kurt () print('kurtosis \n', kur ) describe col a count 6.00000 mean 4.50000 std 0.83666 min 4.00000 25% 4.00000 50% 4.00000 75% 4.75000 max 6.00000 Name: a, dtype : float64 correlation a b c d e a 1.000000 1.000000 1.000000 0.505424 -0.762257 b 1.000000 1.000000 1.000000 0.505424 -0.762257 c 1.000000 1.000000 1.000000 0.505424 -0.762257 d 0.505424 0.505424 0.505424 1.000000 0.173252 e -0.762257 -0.762257 -0.762257 0.173252 1.000000 covariance a b c d e a 0.7 0.7 0.7 2.200000 -5.000000 b 0.7 0.7 0.7 2.200000 -5.000000 c 0.7 0.7 0.7 2.200000 -5.000000 d 2.2 2.2 2.2 27.066667 7.066667 e -5.0 -5.0 -5.0 7.066667 61.466667 kurtosis a 1.428571 b 1.428571 c 1.428571 d 4.837353 e 5.231624 dtype : float64

mdev= df.mad () print(' mean absolute deviation\ n',mdev ) serr = df.sem () print(' standard error of mean\n', serr ) vaco = df.var () print('variance over cols \n', vaco ) s = df['a']. value_counts () print('value count in col a\ n',s ) mean absolute deviation a 0.666667 b 0.666667 c 0.666667 d 3.555556 e 5.222222 dtype : float64 standard error of mean a 0.341565 b 0.341565 c 0.341565 d 2.123938 e 3.200694 dtype : float64 variance over cols a 0.700000 b 0.700000 c 0.700000 d 27.066667 e 61.466667 dtype : float64 value count in col a 4 4 6 1 5 1 Name: a, dtype : int64

Cross-tabulation (frequency count) ct = pd.crosstab (index=df['a'],columns=df['b']) print(‘Crosstab\n', ct ) Quantiles and ranking quants = [0.05, 0.25, 0.5, 0.75, 0.95] q = df.quantile (quants) print(‘Quantile\ n’,q ) r = df.rank () print('Rank\ n',r ) Crosstab b 7 8 9 a 4 4 0 0 5 0 1 0 6 0 0 1 Quantile a b c d e 0.05 4.00 7.00 10.00 20.00 16.5 0.25 4.00 7.00 10.00 20.00 30.0 0.50 4.00 7.00 10.00 20.00 30.0 0.75 4.75 7.75 10.75 22.25 30.0 0.95 5.75 8.75 11.75 30.50 33.0 Rank a b c d e 1 2.5 2.5 2.5 2.5 3.5 2 5.0 5.0 5.0 6.0 6.0 3 6.0 6.0 6.0 5.0 1.0 4 2.5 2.5 2.5 2.5 3.5 5 2.5 2.5 2.5 2.5 3.5 6 2.5 2.5 2.5 2.5 3.5

Working with strings assume that df['col'] is series of strings df['col']=(' niki ’) s = df['col']. str.lower () print('Lower \ n',s ) s = df['col']. str.upper () print('Upper\ n',s ) s = df['col']. str.len () print('Length\ n',s ) Lower 1 niki 2 niki 3 niki 4 niki 5 niki 6 niki Name: col, dtype : object Upper 1 NIKI 2 NIKI 3 NIKI 4 NIKI 5 NIKI 6 NIKI Name: col, dtype : object Length 1 4 2 4 3 4 4 4 5 4 6 4 Name: col, dtype : int64

df['col'] += 'suffix' print('Append\ n',df ['col’]) df['col']=(' niki ') df['col'] *= 2 print('duplicate\ n',df ['col’]) df['col']=(' niki ') s = df['col'] + df['col'] print('concatenate\ n',s ) Append 1 nikisuffix 2 nikisuffix 3 nikisuffix 4 nikisuffix 5 nikisuffix 6 nikisuffix Name: col, dtype : object duplicate 1 nikiniki 2 nikiniki 3 nikiniki 4 nikiniki 5 nikiniki 6 nikiniki Name: col, dtype : object concatenate 1 nikiniki 2 nikiniki 3 nikiniki 4 nikiniki 5 nikiniki 6 nikiniki Name: col, dtype : object

Working with Columns idx = df.columns print('Column index\n', idx ) label = df.columns [0] print('Column Label\ n',label ) lst = df.columns.tolist () print('Column as List\n', lst ) s = df['a'] print('col to Series\ n',s ) s = df[['a']] print('col to df\ n',s ) Column index Index(['a', 'b', 'c', 'd', 'e'], dtype ='object') Column Label a Column as List ['a', 'b', 'c', 'd', 'e’] col to Series 1 4 2 5 3 6 4 4 5 4 6 4 Name: a, dtype : int64 col to df a 1 4 2 5 3 6 4 4 5 4 6 4

s = df[[' a','b ']] print('select 2 or more columns\ n’,s ) s = df[[' c','a','b ']] print('change order of columns\ n',s ) s = df[ df.columns [1]] print('select by number\ n',s ) f= df.columns [[0, 3, 4]] print('Column name by number\ n',f ) s = df.pop ('c') print('Deleting a column\ n',df ) select 2 or more columns a b 1 4 7 2 5 8 3 6 9 4 4 7 5 4 7 6 4 7 change order of columns c a b 1 10 4 7 2 11 5 8 3 12 6 9 4 10 4 7 5 10 4 7 6 10 4 7 select by number 1 7 2 8 3 9 4 7 5 7 6 7 Name: b, dtype : int64 Column name by number Index(['a', 'd', 'e'], dtype ='object') Deleting a column a b d e 1 4 7 20 30 2 5 8 33 34 3 6 9 23 12 4 4 7 20 30 5 4 7 20 30 6 4 7 20 30

Working with rows idx = df.index print('Row index\n', idx ) label = df.index [0] print('Row Label\ n',label ) lst = df.index.tolist () print('Index as List\n', lst ) df.sort_index ( inplace =True) df = df.sort_index (ascending=False) print('Sorting by row\ n',df ) Row index Int64Index([1, 2, 3, 4, 5, 6], dtype ='int64') Row Label 1 Index as List [1, 2, 3, 4, 5, 6] Sorting by row a b c d e 6 4 7 10 20 30 5 4 7 10 20 30 4 4 7 10 20 30 3 6 9 12 23 12 2 5 8 11 33 34 1 4 7 10 20 30

import pandas as pd df = pd . DataFrame ( { 'name':[' john','mary','peter','jeff','bill','lisa','jose '], 'age':[23,78,22,19,45,33,20], 'gender':['M','F','M','M','M','F','M'], 'state':[' california','dc','california','dc','california','texas','texas '], ' num_children ':[2,0,0,3,2,1,4], ' num_pets ':[5,1,0,5,2,2,3] } )

Plot two dataframe columns as a scatter plot # a scatter plot comparing num_children and num_pets import matplotlib.pyplot as plt import pandas as pd df . plot (kind = ' scatter',x = ' num_children',y = ' num_pets',color = 'red') plt . show ()

Plot column values as a bar plot import matplotlib.pyplot as plt import pandas as pd df . plot (kind = ' bar',x = ' name',y = 'age')

Line plot with multiple columns import matplotlib.pyplot as plt import pandas as pd df . plot (kind = ' line',x = ' name',y = ' num_children',ax = ax) df . plot (kind = ' line',x = ' name',y = ' num_pets ', color = 'red', ax = ax) plt . show ()

Bar plot with group by import matplotlib.pyplot as plt import pandas as pd df . groupby ('state')['name'] . nunique () . plot(kind = 'bar') plt . show ()

Plot histogram of column values import matplotlib.pyplot as plt import pandas as pd df [['age']] . plot(kind = ' hist',bins = [0,20,40,60,80,100], rwidth = 0.8) plt . show ()
Tags