Group B - Pandas Pandas is a powerful Python library that provides high-performance, easy-to-use data structures and data analysis tools. It is built on top of the NumPy library and is designed to make working with data both easy and efficient..pptx

HarshitChauhan88 52 views 60 slides Jun 29, 2024
Slide 1
Slide 1 of 60
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60

About This Presentation

PANDAS


Slide Content

UNIVERSITY OF DELHI PANDAS DEPARTMENT OF OPERATIONAL RESEARCH PYTHON B TEAM SUBMITTED TO – DR. ADARSH ANAND

TEAM - B MEMBERS SHIVAM KUMAR RAMASHISH KUMAR NANDUNAM SAI KUIMAR ANUSHA SINGH GAURAV SURABHI SUDIN JANA RIJUL ANAND PRIYA RAWAT AKASH BALIYAN AADARSH GAUTAM HARSHIT PAWAN KUMAR HIMANSHU RAHUL NAGLE ALBIN GEO

CONTENTS Introduction to Pandas Basics of Dataframe Import of Data Functions of Dataframe Data Extraction Creating charts for Dataframe

INTRODUCTION TO PANDAS Pandas is a powerful Python library that provides high-performance, easy-to-use data structures and data analysis tools. It is built on top of the NumPy library and is designed to make working with data both easy and efficient. Pandas is a popular choice for data analysis because it offers a wide range of features, including: DataFrame Series Indexing Data manipulation Time series Plotting

Introduction to DataFrames

What is a DataFrame ? A DataFrame is a two-dimensional, tabular data structure in the Pandas library for Python. It is similar to a spreadsheet or a SQL table, where data is organized in rows and columns. The DataFrame provides a powerful and flexible way to manipulate, analyze, and visualize structured data. Key characteristics of a Pandas DataFrame – Two-Dimensional Structure Column Names and Index Heterogeneous Data Types Flexibility in Data Operations Integration with Other Libraries Data Input and Output

Series VS DataFrame In Pandas, a Series is a one-dimensional labeled array, whereas a DataFrame is a two-dimensional labeled data structure with columns of potentially different types. Some key differences between the two include: Shape Data types Indexing Operations E.g., a Series can be used to represent a single column of data, such as the heights of a group of people, while a DataFrame can be used to represent a table of data, such as the results of a survey.

Creating from a Dictionary: Creating a DataFrame Creating from a List of Dictionaries: Reading from External Sources:

Adding Rows and Columns Adding a new row to the DataFrame : Adding a new column to the DataFrame:

Deleting a column from the DataFrame : Deleting Rows and Columns Deleting a row from the DataFrame :

Import of Data

DATA FILE FORMATS TYPES OF FILE FORMATS UTILITY PERFORMANCE

WHY Different File Formats? Storage and processing Continuously evolving schema. Time taken to read from one loc and write to another loc. HOW to Choose the Right format? Row /Columnar Based read/write heavy workloads splitable Support schema evolution Compression

Row-based format represented by key-value pairs in a partially structured format. eg. {”ID”:1,”Name”:”Luke”,Interests”:[”Psychology”]} {”ID”:2,”Name”:”Yuki”,”Interests”:[”Ballet”, “Travelling”]} Compressible Supports schema evolution SOME USEFUL DATA FILE FORMATS Row-based format. record delimiter- newline, Header- Optional. Does not support block compression. Special encoding like UTF8 to display the non-ASCIII chars in the file. 01 CSV (Comma-separated Values) 02 JSON (JavaScript Object Notation) HDF5 (Hierarchical Data Form version 5) 04 03 Columnar format developed by Cloudera and Twitter. Only the required columns are read reducing disk I/O. Stores data in the form of binary files. Parquet files are splittable. Support block and file level compression. Parquet Open-source file format that supports large, complex, and heterogeneous data with its “directory-like” grouping mechanism. store and modify compressed data i.e. Fast I/O

SUMMARY SOURCE : https://aaltoscicomp.github.io/python-for-scicomp/data-formats/

SOURCE : https://aaltoscicomp.github.io/python-for-scicomp/data-formats/ PERFORMANCE

IMPORT THE DATASET IMPORT THE DATASET USING PANDAS’ read_*{ fileExtension } Eg : read_csv (filename), read_json (filename)

Functions of DataFrames

Basic Information Function

Working with CSV files Loading pandas library Reading CSV file Displaying DataFrame Output

The info() method prints information about the DataFrame.The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values). Syntax – dataframe.info( verbose,buf,max_cols,memory_usage,show_counts,null_counts ) Pandas DataFrame info() Method

Return Value None. The info() method does not return any value, it prints the information. Using Info function OUTPUT

Checking No. of row and columns Finding Size of Dataframe Examine the dimension of the dataframe Some Basic Information Functions of Dataframe OUTPUT

Functions Description Functions Description pd.DataFrame() Creates a new DataFrame . df.append(new_row) Adds a new row to the DataFrame. df.head(n) Returns the first ‘n’ rows of the DataFrame . df.merge(df2) Merges two DataFrames based on a common column. df.tail(n) Returns the last n rows of the DataFrame . df.isnull() Returns a DataFrame of the same shape as df with True and False values indicating missing values. df.info() Provides a concise summary of the DataFrame . df.dropna() Drops rows containing any missing values. df.describe() Generates descriptive statistics of numeric columns. df.fillna (value) Fills missing values with a specified value. df.columns Returns the column labels of the DataFrame. df.pivot_table () Creates a spreadsheet-style pivot table. df.index Returns the row labels of the DataFrame. df.rename (columns={' old_name ': ' new_name '}) Renames columns. df.dtypes Returns the data types of each column. df.shape Returns a tuple representing the dimensions of the DataFrame . df.values Returns a Numpy representation of the DataFrame. df.groupby('col').agg(func) Groups DataFrame by a column and applies an aggregation function. df.drop (index) Deletes a row by index. df.sort_values ('col') Sorts DataFrame by values in a specific column. DataFrame Functions and Methods

Mathematical & Statistical Functions

Pandas Addition : add() The pandas addition function perform addition of dataframes . The addition is performed element wise. Syntax : pandas.DataFrame.add (other, axis=’columns’, level=None, fill_value =None) Pandas Subtract : sub() The subtract function of pandas is used to perform subtract operation on dataframes . Syntax : pandas.DataFrame.sub (other, axis=’columns’, level=None, fill_value =None)

Pandas Multiply : mul () The multiplication function of pandas is used to perform multiplication operations on dataframes . Syntax : pandas.DataFrame.mul (other, axis=’columns’, level=None, fill_value =None) Pandas Division : div() The division function of pandas is used to perform division operation on dataframes . Syntax : pandas.DataFrame.div (other, axis=’columns’, level=None, fill_value =None)

Pandas Sum : sum() The sum function helps in finding the sum of the values for desired axis. Syntax : pandas.DataFrame.sum (axis=None, skipna =None, level=None, numeric_only =None, min_count =0, kwargs ) Pandas Aggregate: agg () The pandas aggregate function is used to aggregate using one or mor operations over desired axis. Syntax : pandas.dataframe.agg ( func , axis=0, * args , kwargs )

1. Percent_change Series, DatFrames and Panel, all have the function pct_change(). This function compares every element with its prior element and computes the change percentage. Pandas Statistical Functions

2. Covariance Covariance is applied on series data. The Series object has a method cov to compute covariance between series objects. NA will be excluded automatically. 3. Correlation Correlation shows the linear relationship between any two array of values (series). There are multiple methods to compute the correlation like pearson(default), spearman and kendall.

Sort Functions

Sorting in Pandas refers to the process of arranging the elements (rows or columns) of a DataFrame or Series in a specified order based on the values they contain. The primary purpose of sorting is to organize the data in a structured way, making it easier to analyze and interpret. Pandas provides the ‘ sort_values ()’ function for sorting DataFrame rows based on one or more columns, and it also offers methods like ‘ sort_index ()’ for sorting based on the index. Sort by value – You use . sort_values () to sort values in a DataFrame along either axis (columns or rows). Typically, you want to sort the rows in a DataFrame by the values of one or more columns. Sort by index – You use . sort_index () to sort a DataFrame by its row index or column labels. The difference from using . sort_values () is that you’re sorting the DataFrame based on its row index or column names, not by the values in these rows or columns. Sorting

The following is a comprehensive list of actions related to sorting: Sorting based on a single column Sorting based on multiple columns Sorting by multiple columns with varying sort orders Sorting by index Disregarding the index while sorting Selection of the sorting algorithm Managing missing values during sorting. How to sort DataFrames in Pandas

sort_values () sorts your data in ascending order by default # Sorting based on a single column

# Sorting based on multiple columns # Sorting by multiple columns with varying sort orders

# Sorting by index

Available algorithm are merge sort, quick sort, heap sort # Selection of the sorting algorithm # Managing missing values during sorting Null values at the top>>>>

Data Extraction

In the context of Pandas, data extraction refers to the process of retrieving specific subsets of data from a larger dataset based on certain criteria or conditions. The Pandas library provides several methods and functions for efficiently extracting and filtering data from DataFrames . Why Pandas for Data Extraction? Tabular Data Handling Data Cleaning Powerful Data Structures Integration with NumPy Wide Range of I/O Functions What is Data Extraction?

There are many ways to extract data from a Pandas DataFrame . Here are a few examples: Using the loc attribute Using the iloc attribute Using the at attribute Using the iat attribute Using the get method Ways to Extract Data in Pandas

Relational operators are used for making comparisons in Pandas. They are often used for filtering and querying data within DataFrame objects. DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). Relational operators in pandas are used to create boolean masks, which are then used to filter rows of a DataFrame based on specified conditions. Common operators: <, >, <=, >=, ==, !=. Relational Operators in Pandas

Using Relational Operator to extract specific data EXAMPLE OUTPUT

Logical operators in Python pandas are used to combine or modify conditions when filtering and querying data within DataFrame objects. Logical operators, including AND (&), OR (|), and NOT (~), play a crucial role in creating complex conditions for data selection. Example: df[(df['column1'] > 50) & (df['column2'] == 'value')] selects rows where 'column1' is greater than 50 and 'column2' is equal to 'value'. Logical Operators in Pandas

EXAMPLE Using Logical Operator to extract specific data OUTPUT

The iloc indexer in pandas is used for integer-location based indexing, allowing you to select data from a DataFrame based on its numerical position in the DataFrame. It is primarily used for selecting rows and columns by their integer indices. Example: df.iloc[2:5, 0:3] selects rows 2 to 4 and columns 0 to 2. iloc in Pandas

Extracting limited number of Rows with ILOC function EXAMPLE OUTPUT

Charts for DataFrame

WHY DO WE MAKE CHARTS ?

Line plot Bar diagram Histogram Box plot Area plot TYPES OF CHART

Line Plot

Box Plot

Area Plot

Histogram

Bar Diagram

Pie Chart

PROBLEMS

Consider this... Guess the output ? 1. 2.

ANSWER “Two identical things do not exist at all.” 1 2

Q&A TIME

THANK YOU!
Tags