D ata manipulation package in Python for tabular data P andas’ functionality includes data transformations, like sorting rows and taking subsets, to calculating summary statistics such as the mean, reshaping DataFrames , and joining DataFrames together etc. Very suitable for Data Analytics and Data Science
Panel Data Open source Python library To install in Python: pip install pandas
Use of pandas Import datasets from databases, spreadsheets, comma-separated values (CSV) files, and more. Clean datasets , for example, by dealing with missing values. Tidy datasets by reshaping their structure into a suitable format for analysis. Aggregate data by calculating summary statistics such as the mean of columns, correlation between them, and more.
Visualize datasets and uncover insights. pandas also contains functionality for time series analysis and analyzing text data .
Pandas deals with the following three data structures − Series DataFrame Panel
A Series is essentially a column , and a DataFrame is a multi-dimensional table made up of a collection of Series .
Series Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). A xis labels are collectively called index Pandas Series is nothing but a column in an excel sheet Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas Series can be created from the lists, dictionary, and from a scalar value etc .
Creating a series import pandas as pd import numpy as np ser = pd.Series () print(ser) # simple array data = np.array (['g', 'e', 'e', 'k', 's']) ser = pd.Series (data) print(ser)
Data Frame A t wo-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns) Pandas DataFrame consists of three principal components, the data, rows, and columns .
Creating dataframes data = { 'apples': [3, 2, 0, 1], 'oranges': [0, 3, 7, 2] } purchases = pd.DataFrame (data) purchases #creating dataframe from a dictionary Note: Each (key, value) item in the data corresponds to a column in the resulting Data Frame.
Providing customized index Index if not specified defaults from 0 In case, we want to provide index then, purchases = pd.DataFrame (data, index=['June', 'Robert', 'Lily', 'David']) purchases
Importing data in pandas Importing csv files Use read_csv () with the path to the CSV file to read a comma-separated values file Example: df= pd.read_csv (“diabetes.csv”) Note: CSVs don't have indexes like our DataFrames , so all we need to do is just designate the index_col when reading: df= pd.read_csv (“diabetes.csv”, index_col =0) df= pd.read_csv (r’ C:\Users\vivek\Documents\SRCC\Python/ ’+’diabetes.csv’) [To be done if the file is not present in the current working directory]
More information on read_csv () function ( Important ) pandas read_csv() Tutorial: Importing Data | DataCamp
Importing text files The separator argument refers to the symbol used to separate rows in a DataFrame . Comma ( sep = ","), whitespace( sep = "\s"), tab ( sep = "\t"), and colon( sep = ":") are the commonly used separators. df = pd.read_csv ("diabetes.txt", sep ="\s")
Outputting a dataframe df.to_csv ("diabetes_out.csv", index=False) The arguments include the filename with path and index – where index = True implies writing the DataFrame’s index. Similarly, json files can be imported and exported as well as excel files.
Viewing and analysing dataframes The first few and last few rows of a dataframe can be read using the head() and tail() function df.head () df.tail () df.head (n=**) df.tail (n=**) [**Specify value of n according to choice]
The .describe() method prints the summary statistics of all numeric columns, such as count, mean, standard deviation, range, and quartiles of numeric columns. df.describe ()
Modifying what we want to read in the describe() function can also modify the quartiles using the percentiles argument. Here, for example, we’re looking at the 30%, 50%, and 70% percentiles of the numeric columns in DataFrame df. df.describe (percentiles=[0.3, 0.5, 0.7]) What is the output?
df.describe (include=[int]) Summarizing columns with integer data types only df.describe (exclude=[int]) Excluding columns having int types, summary stats of non-integer columns only
Transpose of the Data Frame df.T df.describe ().T
The info() method The .info() method is a quick way to look at the data types, missing values, and data size of a DataFrame . Here, we’re setting the show_counts argument to True, which gives a few over the total non-missing values in each column . We’re also setting memory_usage to True, which shows the total memory usage of the DataFrame elements . When verbose is set to True, it prints the full summary from .info(). df.info( show_counts =True, memory_usage =True, verbose=True)
Getting the structure of the dataframe df.shape # Get the number of rows and columns Output is a tuple (row, column) df.shape [0] # Get the number of rows only df.shape [1] # Get the number of columns only
Fetching the columns and column names df.columns list( df.columns ) What does this statement do on execution?
Slicing and Extracting data in Pandas df['Outcome’] isolating a single column using a square bracket [ ] with a column name in it What is the datatype/class of the result of the code above? df[['Pregnancies', 'Outcome’]] #Isolating more than 1 column from a dataframe
Extracting rows from dataframe df[ df.index ==1] A single row can be fetched by passing in a boolean series with one True value. In the example above, the second row with index = 1 is returned . Here, .index returns the row labels of the DataFrame , and the comparison turns that into a Boolean one-dimensional array. What will df.index output?
Extracting more than 1 row df[ df.index.isin (range(2,10))]
Using .loc and . iloc to fetch rows and columns in dataframe dataFrame.loc [<ROWS RANGE> , <COLUMNS RANGE>] ROWS OR COLUMN RANGE can be also be ‘:’ and if given in rows or column Range parameter then all entries will be included for corresponding row or column. Note: .loc[] uses a label to point to a row, column or cell, whereas . iloc [] uses the numeric position.
Creating a copy of the dataframe created df2= df.copy () This is done so as not to affect the original dataframe and perform operations on the copy of the dataframe created.
Demonstrating loc and iloc df2.index = range(1,769) #modifying the index of df2 created df2.loc[1] df2.iloc[1] The 1 represents the row index (label) in loc, whereas the 1 in . iloc [] is the row position (first row).
Fetching multiple rows df2.loc[100:110]
df2.iloc[100:110]
Getting a subset of rows df2.loc[[100, 200, 300]] Try df2.iloc[[100,200,300]]
Selecting specific columns along with rows df2.loc[100:110, ['Pregnancies', 'Glucose', ' BloodPressure ’]]
What does df2.iloc[100:110, :3] output to? Note: This is where . iloc [] is different from .loc[] – it requires column location and not column labels.
Practice df2.loc[760:, ['Pregnancies', 'Glucose', ' BloodPressure ’]] # Passing starting index of a row as a range
What is the output? df2.iloc[760:, :3]
Conditional Slicing df[ df.BloodPressure == 122] df[ df.Outcome == 1] df.loc [df[' BloodPressure '] > 100, ['Pregnancies', 'Glucose', ' BloodPressure ’]] The below code fetches Pregnancies, Glucose, and BloodPressure for all records with BloodPressure greater than 100.
Updating value of a column df2.loc[df['Age']==81, ['Age']] = 80 Row where value of “Age” is 81 “Age” column This statement updates values of “Age” column in new dataframe df2 to 80 at the location of all the rows in df where “Age” is 81
Isolating rows based on a condition in Dataframe df.loc [df[' BloodPressure '] > 100, ['Pregnancies', 'Glucose', ' BloodPressure ’]]
Cleaning data using pandas Checking for missing values Let’s create a copy of the original dataframe and populate it with null values. Say, df3= df.copy () df3.loc[2:5,"Pregnancies"]=None #Populating 3 rows of Pregnancies column with NaN values df3.head(7)
Checking for null values df3.isnull().head(7) df3.isnull().sum() #getting the total number of null values in the columns The column Pregnancies has total 4 null values. Rest of the columns do not have any null value.
What does this give as output? df3.isnull().sum().sum()
Dropping Missing Values The df3 dataframe contains 4 values as NaN Let’s create a df4 dataframe as df4=df3.copy() Check the shape of df3 Then check the shape of df4 after executing the statement , df4=df4.dropna() The df4 dataframe will contain 4 less values after dropping the NaN values
Another way of dropping Missing Values 1. df4.dropna( inplace =True, axis=1) #axis =0 for rows, axis=1 for columns 2. df4.dropna( inplace =True, how=“all”) #can also drop both rows and columns with missing values by setting #the how argument to 'all'
Dropping Duplicates The drop_duplicates () function is used
Data Analysis in pandas df.mean () df.median () df.mode ()
Creating new columns based on existing columns Create a copy of the dataframe df df=df1.copy() df1[' Glucose_Insulin_Ratio '] = df['Glucose']/df['Insulin’] df1.head()
Working with categorical values Category values can be counted using the . value_counts () methods. Here, for example, we are counting the number of observations where Outcome is diabetic (1) and the number of observations where the Outcome is non-diabetic (0). df['Outcome']. value_counts ()
Applying . value_counts () on a subset of columns df.value_counts (subset=['Pregnancies', 'Outcome'])
Aggregating data with . groupby () in pandas pandas lets you aggregate values by grouping them by specific column values. You can do that by combining the . groupby () method with a summary method of your choice. The below code displays the mean of each of the numeric columns grouped by Outcome. df.groupby ('Outcome').mean()
Another example df.groupby (['Pregnancies', 'Outcome']).mean()
Suggested Read ( Important ) Python pandas tutorial: The ultimate guide for beginners | DataCamp