KBN Pandas in python for Btech students.pptx

DATA ANALYTICS WITH PANDAS Dr C.Naga Raju B.Tech (CSE), M.Tech (CSE),PhD(CSE),MIEEE,MCSI,MISTE Associate Professor Department of CSE YSR Engineering College of YVU Proddatur Dr. C.NAGARAJU YSRCE OF YOGIVEMANAUNIVERSITY 9949218570 1 https://archive.ics.uci.edu/ml/datasets.php

INTRODUCTION TO PANDAS Pandas is a high-level data manipulation tool developed by Wes McKinney. Pandas library provides data analytics features like R programming and MATLAB Pandas is built on Numpy , Scipy and Matplotlib packages so that it uses features of these packages The key data structures of pandas are 1) Series 2)Data Frames Series is like one dimensional array object contains data and labels(index). Data Frame is like two dimensional array object stores data in the form of rows and columns. rows represents observations and columns represents variables.

Pandas series: series is like one dimensional object containing data and labels(or) indexes Series can be created in different ways using series method

Single value can be selected from series by single index. multiple values are selected from series by multiple indexes

Series is fixed length ordered Dictionary(dist). How ever unlike dictionary index items do not have to be unique

Series operations Filtering Numpy -like type operations on data

Pandas can accommodate incomplete data

Unlike numpy , ndarray data is automatically alligned

DATA FREAMES Data Frame is like two dimensional array object stores data in the form of rows and columns. rows represents observations and columns represents variables. It has both row and column indexes It also considered as collection of series as a dictionary( dict )

Dataframe is created by using DataFrame method of pandas Data Frame can be created using dictionary of equal length lists

Data frame can be created with dictionary of dictionaries

Create file using excel with given name ex: Book1.xlsx create file using note pad with given name ex:abc.csv

Working with the whole DataFrame import numpy as np import matplotlib.pyplot as plt import pandas as pd from pandas import DataFrame,Series df= pd.DataFrame ( [ [4,7,10,20,30], [5,8,11,33,34], [6,9,12,23,12],[4,7,10,20,30], [4,7,10,20,30], [4,7,10,20,30]], index=[1,2,3,4,5,6],columns=[' a','b','c','d','e ']) print(' dataframe \ n’,df ) print('information about dataframe \n’, df.info()) dataframe a b c d e 1 4 7 10 20 30 2 5 8 11 33 34 3 6 9 12 23 12 4 4 7 10 20 30 5 4 7 10 20 30 4 7 10 20 30 information about dataframe <class ' pandas.core.frame.DataFrame '> Int64Index: 6 entries, 1 to 6 Data columns (total 5 columns): a 6 non-null int64 b 6 non-null int64 c 6 non-null int64 d 6 non-null int64 e 6 non-null int64 dtypes : int64(5) memory usage: 288.0 bytes None

n=2 dfh = df.head (n) print('head\n', dfh ) dft = df.tail (n) print('tail \n', dft ) dfs = df.describe () print('describe\n', dfs ) top_left_corner_df = df.iloc [:5,:5] print(' top_left_corner_df \n', top_left_corner_df ) dfT = df.T print('transpose\n', dfT ) head a b c d e 1 4 7 10 20 30 2 5 8 11 33 34 tail a b c d e 5 4 7 10 20 30 6 4 7 10 20 30 describe a b c d e count 6.00000 6.00000 6.00000 6.000000 6.000000 mean 4.50000 7.50000 10.50000 22.666667 27.666667 std 0.83666 0.83666 0.83666 5.202563 7.840068 min 4.00000 7.00000 10.00000 20.000000 12.000000 25% 4.00000 7.00000 10.00000 20.000000 30.000000 50% 4.00000 7.00000 10.00000 20.000000 30.000000 75% 4.75000 7.75000 10.75000 22.250000 30.000000 max 6.00000 9.00000 12.00000 33.000000 34.000000 top_left_corner_df a b c d e 1 4 7 10 20 30 2 5 8 11 33 34 3 6 9 12 23 12 4 4 7 10 20 30 5 4 7 10 20 30 transpose 1 2 3 4 5 6 a 4 5 6 4 4 4 b 7 8 9 7 7 7 c 10 11 12 10 10 10 d 20 33 23 20 20 20 e 30 34 12 30 30 30

idx = df.columns # get col index print('Column index\ n',idx ) label = df.columns [0] # 1st col label print('Column Label\ n',label ) lst = df.columns.tolist () # get as a list print('Column as List\ n',lst ) s = df ['a'] # select col to Series print(' col to Series\ n',s ) s = df [['a']] # select col to df print(' col to df \ n',s ) s = df [[' a','b ']] # select 2 or more print('select 2 or more columns\ n',s ) Column index Index(['a', 'b', 'c', 'd', 'e'], dtype ='object') Column Label a Column as List ['a', 'b', 'c', 'd', 'e'] col to Series 1 4 2 5 3 6 4 4 5 4 6 4 Name: a, dtype : int64 col to df a 1 4 2 5 3 6 4 4 5 4 6 4 select 2 or more columns a b 1 4 7 2 5 8 3 6 9 4 4 7 5 4 7 6 4 7

s = df [[' c','a','b ']]# change order print('change order of columns\ n',s ) f= df.columns [[0, 3, 4]] print('Column name by number\ n',f ) s = df.pop('c') print('Deleting a column\ n',df ) idx = df.index # get row index print('Row index\ n',idx ) change order of columns c a b 1 10 4 7 2 11 5 8 3 12 6 9 4 10 4 7 5 10 4 7 6 10 4 7 select by number 1 7 2 8 3 9 4 7 5 7 6 7 Name: b, dtype : int64 Column name by number Index(['a', 'd', 'e'], dtype ='object') Deleting a column a b d e 1 4 7 20 30 2 5 8 33 34 3 6 9 23 12 4 4 7 20 30 5 4 7 20 30 6 4 7 20 30 Row index Int64Index([1, 2, 3, 4, 5, 6], dtype ='int64')

label = df.index [0] # 1st row label print('Row Label\ n',label ) lst = df.index.tolist () # get as a list print('Index as List\ n',lst ) df.sort_index ( inplace =True) # sort by row df = df.sort_index (ascending=False) print('Sorting by row\ n',df ) Row Label 1 Index as List [1, 2, 3, 4, 5, 6] Sorting by row a b d e 6 4 7 20 30 5 4 7 20 30 4 4 7 20 30 3 6 9 23 12 2 5 8 33 34 1 4 7 20 30

s= df.dtypes print('serial col data type\ n’,s ) b= df.empty print(' true for empty data type: ',b) i = df.ndim print(' \n no of dimensions: ', i ) ( r,c )= df.shape print(\n 'no of rows and cols: ’,( r,c )) i = df.size print(' \n size: ', i ) a= df.values print(' \n values\ n',a ) dfc = df.copy () print('copy\n', dfc ) dfr = df.rank () print('rank\n', dfr ) serial col data type a int64 b int64 c int64 d int64 e int64 dtype : object true for empty data type: False no of dimensions: 2 no of rows and cols: (6, 5) size: 30 values [[ 4 7 10 20 30] [ 5 8 11 33 34] [ 6 9 12 23 12] [ 4 7 10 20 30] [ 4 7 10 20 30] [ 4 7 10 20 30]] copy a b c d e 1 4 7 10 20 30 2 5 8 11 33 34 3 6 9 12 23 12 4 4 7 10 20 30 5 4 7 10 20 30 6 4 7 10 20 30 rank a b c d e 1 2.5 2.5 2.5 2.5 3.5 2 5.0 5.0 5.0 6.0 6.0 3 6.0 6.0 6.0 5.0 1.0 4 2.5 2.5 2.5 2.5 3.5 5 2.5 2.5 2.5 2.5 3.5 6 2.5 2.5 2.5 2.5 3.5

dfab = df.abs () print('Absolute \n', dfab ) dfad = df.add (1) print('Add\n', dfad ) s = df.count () print('count\ n',s ) dfmax = df.cummax () print('cumulative max\n', dfmax ) dfmin = df.cummin () print('cumulative min\n', dfmin ) Absolute a b c d e 1 4 7 10 20 30 2 5 8 11 33 34 3 6 9 12 23 12 4 4 7 10 20 30 5 4 7 10 20 30 4 7 10 20 30 Add a b c d e 1 5 8 11 21 31 2 6 9 12 34 35 3 7 10 13 24 13 4 5 8 11 21 31 5 5 8 11 21 31 5 8 11 21 31 count a 6 b 6 c 6 d 6 e 6 dtype : int64 cumulative max a b c d e 1 4 7 10 20 30 2 5 8 11 33 34 3 6 9 12 33 34 4 6 9 12 33 34 5 6 9 12 33 34 6 6 9 12 33 34 cumulative min a b c d e 1 4 7 10 20 30 2 4 7 10 20 30 3 4 7 10 20 12 4 4 7 10 20 12 5 4 7 10 20 12 6 4 7 10 20 12

cumulative sum a b c d e 1 4 7 10 20 30 2 9 15 21 53 64 3 15 24 33 76 76 4 19 31 43 96 106 5 23 38 53 116 136 27 45 63 136 166 cumulative product a b c d e 1 4 7 10 20 30 2 20 56 110 660 1020 3 120 504 1320 15180 12240 4 480 3528 13200 303600 367200 5 1920 24696 132000 6072000 11016000 6 7680 172872 1320000 121440000 330480000 list difference a b c d e 1 NaN NaN NaN NaN NaN 2 1.0 1.0 1.0 13.0 4.0 3 1.0 1.0 1.0 -10.0 -22.0 4 -2.0 -2.0 -2.0 -3.0 18.0 5 0.0 0.0 0.0 0.0 0.0 6 0.0 0.0 0.0 0.0 0.0 division a b c d e 1 2.0 3.5 5.0 10.0 15.0 2 2.5 4.0 5.5 16.5 17.0 3 3.0 4.5 6.0 11.5 6.0 4 2.0 3.5 5.0 10.0 15.0 5 2.0 3.5 5.0 10.0 15.0 6 2.0 3.5 5.0 10.0 15.0 dfcs = df.cumsum () print('cumulative sum\n', dfcs ) dfpr = df.cumprod () print('cumulative product\n', dfpr ) dif = df.diff () print('list difference\n', dif ) div1= df.div (2) print('division\n',div1)

s = df.max () print(' max of axis (col def) \ n’,s ) s = df.mean () print('mean (col default axis)\ n',s ) s = df.median () print('median (col default)\ n’,s ) s = df.min () print(' min of axis (col def) \ n',s ) mul = df.mul (1) print(' mul by df Series val \n', mul ) s = df.sum () print('sum of axis\ n',s ) max of axis (col def) a 6 b 9 c 12 d 33 e 34 dtype : int64 'mean (col default axis) a 4.500000 b 7.500000 c 10.500000 d 22.666667 e 27.666667 dtype : float64 median (col default) a 4.0 b 7.0 c 10.0 d 20.0 e 30.0 dtype : float64 min of axis (col def) a 4 b 7 c 10 d 20 e 12 dtype : int64 mul by df Series val a b c d e 1 4 7 10 20 30 2 5 8 11 33 34 3 6 9 12 23 12 4 4 7 10 20 30 5 4 7 10 20 30 6 4 7 10 20 30 sum of axis a 27 b 45 c 63 d 136 e 166 dtype : int64

Dataframe filters for selection of rows and col dffi = df.filter (items=[' a','b ']) print('Filter by col \n', dffi ) dfrow = df.filter (items=[2],axis=0) print('filter by row\n', dfrow ) dfin = df.filter (like='%a%') print('Filter in col\n', dfin ) Filter by col a b 1 4 7 2 5 8 3 6 9 4 4 7 5 4 7 4 7 filter by row a b c d e 5 8 11 33 34 Filter in col Empty DataFrame Columns: [] Index: [1, 2, 3, 4, 5, 6]

Basic Statistics s = df['a'].describe() print('describe col a\ n',s ) cor = df.corr () print('correlation \n', cor ) cov = df.cov () print('covariance\n', cov ) kur = df.kurt () print('kurtosis \n', kur ) describe col a count 6.00000 mean 4.50000 std 0.83666 min 4.00000 25% 4.00000 50% 4.00000 75% 4.75000 max 6.00000 Name: a, dtype : float64 correlation a b c d e a 1.000000 1.000000 1.000000 0.505424 -0.762257 b 1.000000 1.000000 1.000000 0.505424 -0.762257 c 1.000000 1.000000 1.000000 0.505424 -0.762257 d 0.505424 0.505424 0.505424 1.000000 0.173252 e -0.762257 -0.762257 -0.762257 0.173252 1.000000 covariance a b c d e a 0.7 0.7 0.7 2.200000 -5.000000 b 0.7 0.7 0.7 2.200000 -5.000000 c 0.7 0.7 0.7 2.200000 -5.000000 d 2.2 2.2 2.2 27.066667 7.066667 e -5.0 -5.0 -5.0 7.066667 61.466667 kurtosis a 1.428571 b 1.428571 c 1.428571 d 4.837353 e 5.231624 dtype : float64

mdev= df.mad () print(' mean absolute deviation\ n',mdev ) serr = df.sem () print(' standard error of mean\n', serr ) vaco = df.var () print('variance over cols \n', vaco ) s = df['a']. value_counts () print('value count in col a\ n',s ) mean absolute deviation a 0.666667 b 0.666667 c 0.666667 d 3.555556 e 5.222222 dtype : float64 standard error of mean a 0.341565 b 0.341565 c 0.341565 d 2.123938 e 3.200694 dtype : float64 variance over cols a 0.700000 b 0.700000 c 0.700000 d 27.066667 e 61.466667 dtype : float64 value count in col a 4 4 6 1 5 1 Name: a, dtype : int64

Cross-tabulation (frequency count) ct = pd.crosstab (index=df['a'],columns=df['b']) print(‘Crosstab\n', ct ) Quantiles and ranking quants = [0.05, 0.25, 0.5, 0.75, 0.95] q = df.quantile (quants) print(‘Quantile\ n’,q ) r = df.rank () print('Rank\ n',r ) Crosstab b 7 8 9 a 4 4 0 0 5 0 1 0 6 0 0 1 Quantile a b c d e 0.05 4.00 7.00 10.00 20.00 16.5 0.25 4.00 7.00 10.00 20.00 30.0 0.50 4.00 7.00 10.00 20.00 30.0 0.75 4.75 7.75 10.75 22.25 30.0 0.95 5.75 8.75 11.75 30.50 33.0 Rank a b c d e 1 2.5 2.5 2.5 2.5 3.5 2 5.0 5.0 5.0 6.0 6.0 3 6.0 6.0 6.0 5.0 1.0 4 2.5 2.5 2.5 2.5 3.5 5 2.5 2.5 2.5 2.5 3.5 6 2.5 2.5 2.5 2.5 3.5

Working with strings assume that df['col'] is series of strings df['col']=(' niki ’) s = df['col']. str.lower () print('Lower \ n',s ) s = df['col']. str.upper () print('Upper\ n',s ) s = df['col']. str.len () print('Length\ n',s ) Lower 1 niki 2 niki 3 niki 4 niki 5 niki 6 niki Name: col, dtype : object Upper 1 NIKI 2 NIKI 3 NIKI 4 NIKI 5 NIKI 6 NIKI Name: col, dtype : object Length 1 4 2 4 3 4 4 4 5 4 6 4 Name: col, dtype : int64

df['col'] += 'suffix' print('Append\ n',df ['col’]) df['col']=(' niki ') df['col'] *= 2 print('duplicate\ n',df ['col’]) df['col']=(' niki ') s = df['col'] + df['col'] print('concatenate\ n',s ) Append 1 nikisuffix 2 nikisuffix 3 nikisuffix 4 nikisuffix 5 nikisuffix 6 nikisuffix Name: col, dtype : object duplicate 1 nikiniki 2 nikiniki 3 nikiniki 4 nikiniki 5 nikiniki 6 nikiniki Name: col, dtype : object concatenate 1 nikiniki 2 nikiniki 3 nikiniki 4 nikiniki 5 nikiniki 6 nikiniki Name: col, dtype : object

Working with Columns idx = df.columns print('Column index\n', idx ) label = df.columns [0] print('Column Label\ n',label ) lst = df.columns.tolist () print('Column as List\n', lst ) s = df['a'] print('col to Series\ n',s ) s = df[['a']] print('col to df\ n',s ) Column index Index(['a', 'b', 'c', 'd', 'e'], dtype ='object') Column Label a Column as List ['a', 'b', 'c', 'd', 'e’] col to Series 1 4 2 5 3 6 4 4 5 4 6 4 Name: a, dtype : int64 col to df a 1 4 2 5 3 6 4 4 5 4 6 4

s = df[[' a','b ']] print('select 2 or more columns\ n’,s ) s = df[[' c','a','b ']] print('change order of columns\ n',s ) s = df[ df.columns [1]] print('select by number\ n',s ) f= df.columns [[0, 3, 4]] print('Column name by number\ n',f ) s = df.pop ('c') print('Deleting a column\ n',df ) select 2 or more columns a b 1 4 7 2 5 8 3 6 9 4 4 7 5 4 7 6 4 7 change order of columns c a b 1 10 4 7 2 11 5 8 3 12 6 9 4 10 4 7 5 10 4 7 6 10 4 7 select by number 1 7 2 8 3 9 4 7 5 7 6 7 Name: b, dtype : int64 Column name by number Index(['a', 'd', 'e'], dtype ='object') Deleting a column a b d e 1 4 7 20 30 2 5 8 33 34 3 6 9 23 12 4 4 7 20 30 5 4 7 20 30 6 4 7 20 30

Working with rows idx = df.index print('Row index\n', idx ) label = df.index [0] print('Row Label\ n',label ) lst = df.index.tolist () print('Index as List\n', lst ) df.sort_index ( inplace =True) df = df.sort_index (ascending=False) print('Sorting by row\ n',df ) Row index Int64Index([1, 2, 3, 4, 5, 6], dtype ='int64') Row Label 1 Index as List [1, 2, 3, 4, 5, 6] Sorting by row a b c d e 6 4 7 10 20 30 5 4 7 10 20 30 4 4 7 10 20 30 3 6 9 12 23 12 2 5 8 11 33 34 1 4 7 10 20 30

import pandas as pd df = pd . DataFrame ( { 'name':[' john','mary','peter','jeff','bill','lisa','jose '], 'age':[23,78,22,19,45,33,20], 'gender':['M','F','M','M','M','F','M'], 'state':[' california','dc','california','dc','california','texas','texas '], ' num_children ':[2,0,0,3,2,1,4], ' num_pets ':[5,1,0,5,2,2,3] } )

Plot two dataframe columns as a scatter plot # a scatter plot comparing num_children and num_pets import matplotlib.pyplot as plt import pandas as pd df . plot (kind = ' scatter',x = ' num_children',y = ' num_pets',color = 'red') plt . show ()

Plot column values as a bar plot import matplotlib.pyplot as plt import pandas as pd df . plot (kind = ' bar',x = ' name',y = 'age')

Line plot with multiple columns import matplotlib.pyplot as plt import pandas as pd df . plot (kind = ' line',x = ' name',y = ' num_children',ax = ax) df . plot (kind = ' line',x = ' name',y = ' num_pets ', color = 'red', ax = ax) plt . show ()

Bar plot with group by import matplotlib.pyplot as plt import pandas as pd df . groupby ('state')['name'] . nunique () . plot(kind = 'bar') plt . show ()

Plot histogram of column values import matplotlib.pyplot as plt import pandas as pd df [['age']] . plot(kind = ' hist',bins = [0,20,40,60,80,100], rwidth = 0.8) plt . show ()

KBN Pandas in python for Btech students.pptx

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

KBN Pandas in python for Btech students.pptx

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx