Pandas yayyyyyyyyyyyyyyyyyin Python.pptx

AamnaRaza1 16 views 59 slides Mar 03, 2025
Slide 1
Slide 1 of 59
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59

About This Presentation

yay


Slide Content

Pandas in Python

D ata manipulation package in Python for tabular data P andas’ functionality includes data transformations, like  sorting rows  and taking subsets, to calculating summary statistics such as the mean, reshaping DataFrames , and joining DataFrames together etc. Very suitable for Data Analytics and Data Science

Panel Data Open source Python library To install in Python: pip install pandas

Use of pandas Import datasets from databases, spreadsheets, comma-separated values (CSV) files, and more. Clean datasets , for example, by dealing with missing values. Tidy datasets by reshaping their structure into a suitable format for analysis. Aggregate data by calculating summary statistics such as the mean of columns, correlation between them, and more.

Visualize datasets and uncover insights. pandas also contains functionality for time series analysis and analyzing text data .

Pandas deals with the following three data structures − Series DataFrame Panel

A Series is essentially a column , and a DataFrame is a multi-dimensional table made up of a collection of Series .

Series Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). A xis labels are collectively called index Pandas Series is nothing but a column in an excel sheet Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas Series can be created from the lists, dictionary, and from a scalar value etc .

Creating a series import pandas as pd import numpy as np ser = pd.Series () print(ser) # simple array data = np.array (['g', 'e', 'e', 'k', 's']) ser = pd.Series (data) print(ser)

Data Frame A t wo-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns) Pandas DataFrame consists of three principal components, the data, rows, and columns .

Creating dataframes data = { 'apples': [3, 2, 0, 1], 'oranges': [0, 3, 7, 2] } purchases = pd.DataFrame (data) purchases #creating dataframe from a dictionary Note: Each (key, value) item in the data corresponds to a column in the resulting Data Frame.

Providing customized index Index if not specified defaults from 0 In case, we want to provide index then, purchases = pd.DataFrame (data, index=['June', 'Robert', 'Lily', 'David']) purchases

Importing data in pandas Importing csv files Use read_csv () with the path to the CSV file to read a comma-separated values file Example: df= pd.read_csv (“diabetes.csv”) Note: CSVs don't have indexes like our DataFrames , so all we need to do is just designate the index_col when reading: df= pd.read_csv (“diabetes.csv”, index_col =0) df= pd.read_csv (r’ C:\Users\vivek\Documents\SRCC\Python/ ’+’diabetes.csv’) [To be done if the file is not present in the current working directory]

More information on read_csv () function ( Important ) pandas read_csv() Tutorial: Importing Data | DataCamp

Importing text files The separator argument refers to the symbol used to separate rows in a DataFrame . Comma ( sep = ","), whitespace( sep = "\s"), tab ( sep = "\t"), and colon( sep = ":") are the commonly used separators. df = pd.read_csv ("diabetes.txt", sep ="\s")

Importing excel files df = pd.read_excel ('diabetes.xlsx’) df = pd.read_excel ('diabetes_multi.xlsx', sheet_name =1)

Outputting a dataframe df.to_csv ("diabetes_out.csv", index=False) The arguments include the filename with path and index – where index = True implies writing the DataFrame’s index. Similarly, json files can be imported and exported as well as excel files.

Viewing and analysing dataframes The first few and last few rows of a dataframe can be read using the head() and tail() function df.head () df.tail () df.head (n=**) df.tail (n=**) [**Specify value of n according to choice]

The .describe() method prints the summary statistics of all numeric columns, such as count, mean, standard deviation, range, and quartiles of numeric columns. df.describe ()

Modifying what we want to read in the describe() function can also modify the quartiles using the percentiles argument. Here, for example, we’re looking at the 30%, 50%, and 70% percentiles of the numeric columns in DataFrame df. df.describe (percentiles=[0.3, 0.5, 0.7]) What is the output?

df.describe (include=[int]) Summarizing columns with integer data types only df.describe (exclude=[int]) Excluding columns having int types, summary stats of non-integer columns only

Transpose of the Data Frame df.T df.describe ().T

The info() method The .info() method is a quick way to look at the data types, missing values, and data size of a DataFrame . Here, we’re setting the show_counts argument to True, which gives a few over the total non-missing values in each column . We’re also setting memory_usage to True, which shows the total memory usage of the DataFrame elements . When verbose is set to True, it prints the full summary from .info(). df.info( show_counts =True, memory_usage =True, verbose=True)

Getting the structure of the dataframe df.shape # Get the number of rows and columns Output is a tuple  (row, column) df.shape [0] # Get the number of rows only df.shape [1] # Get the number of columns only

Fetching the columns and column names df.columns list( df.columns )  What does this statement do on execution?

Slicing and Extracting data in Pandas df['Outcome’] isolating a single column using a square bracket [ ] with a column name in it What is the datatype/class of the result of the code above? df[['Pregnancies', 'Outcome’]] #Isolating more than 1 column from a dataframe

Extracting rows from dataframe df[ df.index ==1] A single row can be fetched by passing in a boolean series with one True value. In the example above, the second row with index = 1 is returned . Here, .index returns the row labels of the DataFrame , and the comparison turns that into a Boolean one-dimensional array. What will df.index output?

Extracting more than 1 row df[ df.index.isin (range(2,10))]

Using .loc and . iloc to fetch rows and columns in dataframe dataFrame.loc [<ROWS RANGE> , <COLUMNS RANGE>] ROWS OR COLUMN RANGE can be also be ‘:’ and if given in rows or column Range parameter then all entries will be included for corresponding row or column. Note: .loc[] uses a label to point to a row, column or cell, whereas . iloc [] uses the numeric position.

Creating a copy of the dataframe created df2= df.copy () This is done so as not to affect the original dataframe and perform operations on the copy of the dataframe created.

Demonstrating loc and iloc df2.index = range(1,769) #modifying the index of df2 created df2.loc[1] df2.iloc[1] The 1 represents the row index (label) in loc, whereas the 1 in . iloc [] is the row position (first row).

Fetching multiple rows df2.loc[100:110]

df2.iloc[100:110]

Getting a subset of rows df2.loc[[100, 200, 300]] Try df2.iloc[[100,200,300]]

Selecting specific columns along with rows df2.loc[100:110, ['Pregnancies', 'Glucose', ' BloodPressure ’]]

What does df2.iloc[100:110, :3] output to? Note: This is where . iloc [] is different from .loc[] – it requires column location and not column labels.

Practice df2.loc[760:, ['Pregnancies', 'Glucose', ' BloodPressure ’]] # Passing starting index of a row as a range

What is the output? df2.iloc[760:, :3]

Conditional Slicing df[ df.BloodPressure == 122] df[ df.Outcome == 1] df.loc [df[' BloodPressure '] > 100, ['Pregnancies', 'Glucose', ' BloodPressure ’]]  The below code fetches Pregnancies, Glucose, and BloodPressure for all records with BloodPressure greater than 100.

Updating value of a column df2.loc[df['Age']==81, ['Age']] = 80 Row where value of “Age” is 81 “Age” column This statement updates values of “Age” column in new dataframe df2 to 80 at the location of all the rows in df where “Age” is 81

Isolating rows based on a condition in Dataframe df.loc [df[' BloodPressure '] > 100, ['Pregnancies', 'Glucose', ' BloodPressure ’]]

Cleaning data using pandas Checking for missing values Let’s create a copy of the original dataframe and populate it with null values. Say, df3= df.copy () df3.loc[2:5,"Pregnancies"]=None #Populating 3 rows of Pregnancies column with NaN values df3.head(7)

Checking for null values df3.isnull().head(7) df3.isnull().sum() #getting the total number of null values in the columns The column Pregnancies has total 4 null values. Rest of the columns do not have any null value.

What does this give as output? df3.isnull().sum().sum()

Dropping Missing Values The df3 dataframe contains 4 values as NaN Let’s create a df4 dataframe as df4=df3.copy() Check the shape of df3 Then check the shape of df4 after executing the statement , df4=df4.dropna() The df4 dataframe will contain 4 less values after dropping the NaN values

Another way of dropping Missing Values 1. df4.dropna( inplace =True, axis=1) #axis =0 for rows, axis=1 for columns 2. df4.dropna( inplace =True, how=“all”) #can also drop both rows and columns with missing values by setting #the how argument to 'all'

Dropping Duplicates The drop_duplicates () function is used

Data Analysis in pandas df.mean () df.median () df.mode ()

Creating new columns based on existing columns Create a copy of the dataframe df df=df1.copy() df1[' Glucose_Insulin_Ratio '] = df['Glucose']/df['Insulin’] df1.head()

Working with categorical values Category values can be counted using the . value_counts () methods. Here, for example, we are counting the number of observations where Outcome is diabetic (1) and the number of observations where the Outcome is non-diabetic (0). df['Outcome']. value_counts ()

Applying . value_counts () on a subset of columns df.value_counts (subset=['Pregnancies', 'Outcome'])

Aggregating data with . groupby () in pandas pandas lets you aggregate values by grouping them by specific column values. You can do that by combining the . groupby () method with a summary method of your choice. The below code displays the mean of each of the numeric columns grouped by Outcome. df.groupby ('Outcome').mean()

Another example df.groupby (['Pregnancies', 'Outcome']).mean()

Suggested Read ( Important ) Python pandas tutorial: The ultimate guide for beginners | DataCamp
Tags