Data Science.pptx00000000000000000000000

Data Science and its Applications Vi sem

Data visualization Matplotlib F or simple bar charts, line charts, and scatterplots, it works pretty well. If you are interested in producing elaborate interactive visualizations for the Web it is likely not the right choice . we will be using the matplotlib.pyplot module . When you import matplotlib.pyplot using the standard convention import matplotlib.pyplot as plt , you gain access to a wide range of functions and methods that allow you to create various types of plots, customize them, and add annotations. Some common types of plots you can create with Matplotlib include line plots, scatter plots, bar plots, histograms, and more

import matplotlib.pyplot as plt # Sample data x = [1, 2, 3, 4, 5] y = [2, 4, 6, 8, 10] # Plotting the data plt.plot (x, y) # Adding labels and title plt.xlabel ('X-axis’) plt.ylabel ('Y-axis’) plt.title ('Simple Line Plot')

from matplotlib import pyplot as plt years = [ 1950 , 1960 , 1970 , 1980 , 1990 , 2000 , 2010 ] gdp = [ 300.2 , 543.3 , 1075.9 , 2862.5 , 5979.6 , 10289.7 , 14958.3 ] # create a line chart, years on x-axis, gdp on y-axis plt . plot ( years , gdp , color = 'green' , marker = 'o' , linestyle = 'solid' ) # add a title plt . title ( "Nominal GDP" ) # add a label to the y-axis plt . ylabel ( "Billions of $" ) plt . show ()

import matplotlib.pyplot as plt # Sample data categories = ['A', 'B', 'C', 'D', 'E’] values = [10, 20, 15, 25, 30] # Creating a bar plot plt.bar (categories, values) # Adding labels and title plt.xlabel ('Categories’) plt.ylabel ('Values’) plt.title ('Basic Bar Plot’) # Display the plot plt.show ()

movies = [ "Annie Hall" , "Ben-Hur" , "Casablanca" , "Gandhi" , "West Side Story" ] num_oscars = [ 5 , 11 , 3 , 8 , 10 ] # bars are by default width 0.8, so we'll add 0.1 to the left coordinates # so that each bar is centered xs = [ i + 0.1 for i , _ in enumerate ( movies )] # plot bars with left x-coordinates [ xs ], heights [ num_oscars ] plt . bar ( xs , num_oscars ) plt . ylabel ( "# of Academy Awards" ) plt . title ( "My Favorite Movies" ) # label x-axis with movie names at bar centers plt . xticks ([ i + 0.5 for i , _ in enumerate ( movies )], movies ) plt . show ()

In this list comprehension, enumerate(movies) is used to loop through the list of movies, providing both the index i and the movie name _. Since we are only interested in the index, you use i . Then, you add 0.1 to each index to ensure that the bars are centered when plotting. Finally, these adjusted indices are stored in the list xs , which will be used as the x-coordinates for plotting the bars. In this code: plt.xticks () is used to set the x-axis ticks. The first argument is the list of x-coordinates where you want the ticks to appear. The second argument is the list of tick labels, which in this case are the movie names. plt.show () is called to display the plot.

grades = [ 83 , 95 , 91 , 87 , 70 , , 85 , 82 , 100 , 67 , 73 , 77 , ] decile = lambda grade : grade // 10 * 10 histogram = Counter ( decile ( grade ) for grade in grades ) plt . bar ([ x - 4 for x in histogram . keys ()], # shift each bar to the left by 4 histogram . values (), # give each bar its correct height 8 ) # give each bar a width of 8 plt . axis ([- 5 , 105 , , 5 ]) # x-axis from -5 to 105, # y-axis from 0 to 5 plt . xticks ([ 10 * i for i in range ( 11 )]) # x-axis labels at 0, 10, ..., 100 plt . xlabel ( "Decile" ) plt . ylabel ( "# of Students" ) plt . title ( "Distribution of Exam 1 Grades" ) plt . show ()

from collections import Counter # Create a Counter from a list my_list = ['a', 'b', 'c', 'a', 'b', 'a'] my_counter = Counter( my_list ) # Access counts print( my_counter ['a']) # Output: 3 (since 'a' appears 3 times) # Access unique elements and their counts print( my_counter.keys ()) # Output: dict_keys (['a', 'b', 'c']) print( my_counter.values ()) # Output: dict_values ([3, 2, 1]) # Arithmetic operations other_list = ['a', 'b', 'c', 'a', 'a'] other_counter = Counter( other_list ) print( my_counter + other_counter ) # Output: Counter({'a': 5, 'b': 3, 'c': 2})

Line Charts As we saw already, we can make line charts using plt.plot (). variance = [1, 2, 4, 8, 16, 32, 64, 128, 256] bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1] total_error = [x + y for x, y in zip(variance, bias_squared )] xs = [ i for i , _ in enumerate(variance)] # we can make multiple calls to plt.plot # to show multiple series on the same chart plt.plot ( xs , variance, 'g-', label='variance') # green solid line plt.plot ( xs , bias_squared , 'r-.', label='bias^2') # red dot-dashed line plt.plot ( xs , total_error , 'b:', label='total error') # blue dotted line # because we've assigned labels to each series # we can get a legend for free # loc=9 means "top center" plt.legend (loc=9) plt.xlabel ("model complexity") plt.title ("The Bias-Variance Tradeoff") plt.show ()

Scatterplots A scatterplot is the right choice for visualizing the relationship between two paired sets of data. For example:the relationship between the number of friends your users have and the number of minutes they spend on the site every day: friends = [ 70, 65, 72, 63, 71, 64, 60, 64, 67] minutes = [175, 170, 205, 120, 220, 130, 105, 145, 190] labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', ' i '] plt.scatter (friends, minutes) # label each point for label, friend_count , minute_count in zip(labels, friends, minutes): plt.annotate (label, xy =( friend_count , minute_count ), # put the label with its pointxytext =(5, -5), but slightly offset textcoords ='offset points') plt.title ("Daily Minutes vs. Number of Friends") plt.xlabel ("# of friends") plt.ylabel ("daily minutes spent on the site") plt.show ()

Scatterplots In the line for label, friend_count , minute_count in zip(labels, friends, minutes): Python's zip() function is used to iterate over multiple lists (labels, friends, and minutes) simultaneously. labels contains the labels for each data point. friends contains the number of friends for each data point. minutes contains the minutes spent on the site for each data point. By using zip(), we iterate over these lists together. In each iteration, label, friend_count , and minute_count will correspond to the current elements from labels, friends, and minutes lists, respectively. plt.annotate (label, xy =( friend_count , minute_count ), xytext =(5, -5), textcoords ='offset points') is then used to annotate the scatter plot with the label for each point. This function places text at the specified coordinates ( xy =( friend_count , minute_count )) with a small offset ( xytext =(5, -5)) from the specified point.

Bar Chart : Use bar charts to represent categorical data or data that can be divided into distinct groups. Best for comparing values between different categories or groups. Useful for showing discrete data points or data that doesn't have a natural order. Suitable for showing changes over time when time is divided into distinct intervals (e.g., months, years). Examples of when to use bar charts: Comparing sales performance of different products. Showing population distribution by country. Displaying the frequency of occurrence of different categories Line Chart : Use line charts to visualize trends and patterns in continuous data over time. Best for showing changes and trends over a continuous scale (e.g., time, temperature, distance). Ideal for illustrating relationships between variables and identifying patterns such as growth, decline, or fluctuations. Also suitable for displaying multiple series of data on the same chart for comparison. Examples of when to use line charts: Showing stock price fluctuations over time. Visualizing temperature changes throughout the year. Displaying trends in website traffic over months or years. In summary, choose a bar chart when you want to compare discrete categories or groups, and opt for a line chart when you need to visualize trends or patterns in continuous data over time.

Linear Algebra

vectors are objects that can be added together (to form new vectors) and that can be multiplied by scalars (i.e., numbers), also to form new vectors. For example, if you have the heights, weights, and ages of a large number of people, you can treat your data as three-dimensional vectors (height, weight, age). If you’re teaching a class with four exams, you can treat student grades as four-dimensional vectors (exam1, exam2, exam3, exam4). height_weight_age = [70, # inches, 170, # pounds, 40 ] # years grades = [95, # exam1 80, # exam2 75, # exam3 62 ] # exam4

def vector_add (v, w): """adds corresponding elements""" return [ v_i + w_i for v_i , w_i in zip(v, w)] Similarly, to subtract two vectors we just subtract corresponding elements: def vector_subtract (v, w): """subtracts corresponding elements""" return [ v_i - w_i for v_i , w_i in zip(v, w)] def scalar_multiply (c, v): """c is a number, v is a vector""" return [c * v_i for v_i in v]

def vector_sum (vectors): """sums all corresponding elements""" result = vectors[0] # start with the first vector for vector in vectors[1:]: # then loop over the others result = vector_add (result, vector) # and add them to the result return result We’ll also need to be able to multiply a vector by a scalar, which we do simply by multiplying each element of the vector by that number: def scalar_multiply (c, v): """c is a number, v is a vector""" return [c * v_i for v_i in v]

In Python, a tuple is a collection data type similar to a list, but with one key difference: tuples are immutable, meaning once they are created, their elements cannot be changed or modified. Tuples are defined by enclosing comma-separated values within parentheses (). Here's a basic example of a tuple: my_tuple = (1, 2, 'a', 'b', True) Tuples can contain elements of different data types, including integers, floats, strings, booleans , and even other tuples or data structures. You can access elements of a tuple using indexing, just like with lists: print( my_tuple [0]) # Output: 1 print( my_tuple [2]) # Output: 'a '

Tuples support many of the same operations as lists, such as slicing, concatenation, and repetition: python tuple1 = (1, 2, 3) tuple2 = ('a', 'b', 'c') # Slicing print(tuple1[:2]) # Output: (1, 2) # Concatenation tuple3 = tuple1 + tuple2 print(tuple3) # Output: (1, 2, 3, 'a', 'b', 'c') # Repetition tuple4 = tuple2 * 2 print(tuple4) # Output: ('a', 'b', 'c', 'a', 'b', 'c')

However, because tuples are immutable, you cannot modify individual elements: my_tuple [0] = 5 # This will raise an error because tuples are immutable Tuples are commonly used in Python for various purposes, such as representing fixed collections of items, returning multiple values from a function, and as keys in dictionaries (since they are immutable).

Vector addition If two vectors v and w are the same length, their sum is just the vector whose first element is v[0] + w[0], whose second element is v[1] + w[1], and so on. (If they’re not the same length, then we’re not allowed to add them.) For example, adding the vectors [1, 2] and [2, 1] results in [1 + 2, 2 + 1] or [3, 3],

notes The %matplotlib inline command is used in Jupyter Notebooks or IPython environments to display Matplotlib plots directly within the notebook. It ensures that plots are rendered inline, meaning they appear directly below the code cell that generates them. By using %matplotlib inline, you're setting up the notebook to show Matplotlib plots without the need for additional commands like plt.show ()

The plt.plot (x, y, 'r') command you've used plots the data in arrays x and y using red color ('r'). Here's what each part of the command does: plt.plot (): This function is used to create a line plot. x and y: These are the data arrays to be plotted along the x and y axes, respectively. 'r': This specifies the color of the line. In this case, 'r' stands for red. You can use different color abbreviations ('b' for blue, 'g' for green, etc.) or full color names ('red', 'blue', 'green', etc.). So, plt.plot (x, y, 'r') will create a line plot of y against x with a red color line

The np.arange () function in NumPy is used to create an array with evenly spaced values within a specified interval. Here's how it works:python np.arange (start, stop, step) start: The starting value of the sequence. stop: The end value of the sequence, not included. step: The step size between each pair of consecutive values. It defaults to 1 if not provided. For example, np.arange (0, 10) will generate an array containing integers from 0 up to (but not including) 10, with a default step size of 1. So, the resulting array will be [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

np.linspace (0,10,5) array([ 0. , 2.5, 5. , 7.5, 10. ]) np.linspace (0,10,50) array([ 0. , 0.20408163, 0.40816327, 0.6122449 , 0.81632653, 1.02040816, 1.2244898 , 1.42857143, 1.63265306, 1.83673469, 2.04081633, 2.24489796, 2.44897959, 2.65306122, 2.85714286, 3.06122449, 3.26530612, 3.46938776, 3.67346939, 3.87755102, 4.08163265, 4.28571429, 4.48979592, 4.69387755, 4.89795918, 5.10204082, 5.30612245, 5.51020408, 5.71428571, 5.91836735, 6.12244898, 6.32653061, 6.53061224, 6.73469388, 6.93877551, 7.14285714, 7.34693878, 7.55102041, 7.75510204, 7.95918367, 8.16326531, 8.36734694, 8.57142857, 8.7755102 , 8.97959184, 9.18367347, 9.3877551 , 9.59183673, 9.79591837, 10. ])

labels = [' a','b','c '] my_list = [10,20,30] arr = np.array ([10,20,30]) d = {'a':10,'b':20,'c':30} labels: This is a list containing three strings: 'a', 'b', and 'c'. my_list : This is a list containing three integers: 10, 20, and 30. arr : This is a NumPy array created using np.array (), containing the same integers as my_list d: This is a dictionary where keys are strings ('a', 'b', 'c') and values are integers (10, 20, 30). These data structures store similar information but in different ways and with different functionalities. Lists are ordered collections, NumPy arrays are arrays of homogeneous data, and dictionaries are mappings of keys to values

Pandas is a popular open-source Python library used for data manipulation and analysis. It provides easy-to-use data structures and functions for working with structured data, such as tables or spreadsheet-like data, making it a fundamental tool for data scientists, analysts, and researchers. Here are some key features of Pandas: DataFrame : The primary data structure in Pandas is the DataFrame , which is a two-dimensional labeled data structure with columns of potentially different types. It resembles a spreadsheet or SQL table, and you can think of it as a dictionary of Series objects, where each Series represents a column

import matplotlib.pyplot as plt # Data for demonstration x = [1, 2, 3, 4] y = [1, 4, 9, 16] # Create a figure with 4 rows and 2 columns of subplotsplt.figure ( figsize =(10, 10)) # Loop through each subplot position in the 4x2 grid for i in range(1, 9): # 1 to 8 for a 4x2 grid plt.subplot (4, 2, i ) plt.plot (x, y) plt.title ( f'Subplot { i }’) # Label each subplot # Adjust layout to prevent overlapplt.tight_layout () # Display the plotplt.show () -------------------+-------------------+ | Subplot 1 | Subplot 2 | +----------------

Statistics

Central Tendencies we’ll want some notion of where our data is centered. we’ll use the mean (or average), which is just the sum of the data divided by its count: def mean(x): return sum(x) / len (x) mean( num_friends ) We’ll also sometimes be interested in the median , which is the middle-most value (if the number of data points is odd) or the average of the two middle-most values (if the number of data points is even).

from collections import Counter import matplotlib.pyplot as plt # Example list of friend counts num_friends = [3, 5, 3, 8, 2, 5, 7, 5, 3, 6, 4, 3, 4, 5, 6] friend_counts = Counter( num_friends ) xs = range(101) # Assuming the largest value is 100 ys = [ friend_counts [x] for x in xs ] # height is just the number of friends plt.bar ( xs , ys ) plt.axis ([0, 101, 0, max( ys ) + 1]) # Setting the y-axis limit to one more than the maximum count plt.title ("Histogram of Friend Counts") plt.xlabel ("# of friends") plt.ylabel ("# of people") plt.show ()

sorted_values = sorted( num_friends ) sorted_values smallest_value = sorted_values [0] smallest_value 2 second_largest_value = sorted_values [-2] 7 [2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 6, 6, 7, 8]

def median(v): """finds the 'middle-most' value of v""" n = len (v) sorted_v = sorted(v) midpoint = n // 2 if n % 2 == 1: # if odd, return the middle value return sorted_v [midpoint] else: # if even, return the average of the middle values lo = midpoint - 1 hi = midpoint return ( sorted_v [lo] + sorted_v [hi]) / 2 Median

def quantile(x, p): p_index = int(p * len (x)) return sorted(x)[ p_index ] x = [3, 5, 3, 8, 2, 5, 7, 5, 3, 6, 4, 3, 4, 5, 6] p = 0.25 # Desired quantile (25th percentile) result = quantile(x, p) p_index = int(p * len (x)) calculates the index corresponding to the quantile 𝑝 p in the sorted dataset 𝑥 len ( x ) returns the number of elements in the dataset 𝑥 x , i.e., the length of the dataset p is the quantile you want to calculate, represented as a value between 0 and 1. p×len (x) calculates the position in the sorted dataset corresponding to the desired quantile. Since 𝑝 p is a fraction between 0 and 1, multiplying it by the length of the dataset gives the index at which the quantile would be if the data were sorted. t( p×len (x)) takes the integer part of the result. This ensures that the index is an integer value, as indices in Python must be integers.

x=[3,5,3,8,2,5,7,5,3,6,4,3,4,5,6] and you want to calculate the 25th percentile 𝑝=0.25 0.25×15=3.75 p×len (x)=0.25×15=3.75. Taking the integer part of 3.75 3.75 gives 3 3, so the 25th percentile of the dataset 𝑥 x corresponds to the element at index 3 in the sorted dataset.

Uncertainty Randomness

Dependence and Independence I f we flip a fair coin twice, knowing whether the first flip is Heads gives us no information about whether the second flip is Heads. These events are independent. On the other hand, knowing whether the first flip is Heads certainly gives us information about whether both flips are Tails. (If the first flip is Heads, then definitely it’s not the case that both flips are Tails.) These two events are dependent .

import random def random_kid (): return random.choice (["boy", "girl"]) both_girls = 0 older_girl = 0 either_girl = 0 random.seed (0) for _ in range(10000): younger = random_kid () older = random_kid () if older == "girl": older_girl += 1 if older == "girl" and younger == "girl": both_girls += 1 if older == "girl" or younger == "girl": either_girl += 1 print("P(both | older):", both_girls / older_girl ) # 0.514 ~ 1/2 print("P(both | either):", both_girls / either_girl )

We want to calculate two conditional probabilities related to having girls in a family with two children:The probability that both children are girls given that the older child is a girl.The probability that both children are girls given that at least one of the children is a girl. random_kid () function:Returns either "boy" or "girl" randomly with equal probability (0.5 each). Counters: both_girls : Counts the number of times both children are girls. older_girl : Counts the number of times the older child is a girl. either_girl : Counts the number of times at least one child is a girl. Simulation Loop :For 10,000 iterations, the code simulates the gender of two children (older and younger).It updates the counters based on the genders of the children. Probabilities P(both | older):This is the probability that both children are girls given that the older child is a girl.both_girls / older_girl : The number of times both children are girls divided by the number of times the older child is a girl.

P(both | either) :This is the probability that both children are girls given that at least one of the children is a girl.both_girls / either_girl : The number of times both children are girls divided by the number of times at least one child is a girl. Mathematical Explanation P(both | older):Given that the older child is a girl, the younger child can be either a girl or a boy with equal probability. Therefore, the probability that both are girls is 1/2. P(both | either):This situation requires considering all possible combinations where at least one child is a girl:Girl-GirlGirl-BoyBoy-Girl Out of these combinations, the only one with both girls is "Girl- Girl".There are 3 favorable combinations with at least one girl out of 4 total combinations (Girl-Girl, Girl-Boy, Boy-Girl, Boy-Boy).Therefore, the probability is 1/3. Simulation Results P(both | older): The simulation result should be close to 0.5 (which is 1/2). P(both | either): The simulation result should be close to 1/3.

Normal Distribution

Program 5 Code: import pandas as pd import numpy as np # Import the data into a DataFrame books_df = pd.read_csv ('desktop/BL-Flickr-Images-Book.csv') # Display the first few rows of the DataFrame print("Original DataFrame :") print( books_df.head ()) # Find and drop the columns which are irrelevant for the book information columns_to_drop = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'Former owner', 'Engraver', 'Contributors', 'Issuance type', ' Shelfmarks '] books_df.drop (columns= columns_to_drop , inplace =True) # Change the Index of the DataFrame books_df.set_index ('Identifier', inplace =True) # Tidy up fields in the data such as date of publication with the help of simple regular expression def clean_date (date): if isinstance (date, str): match = re.search (r'\d{4}', date) if match: return match.group ()

return np.nan books_df ['Date of Publication'] = books_df ['Date of Publication'].apply( clean_date ) # Combine str methods with NumPy to clean columns books_df ['Place of Publication'] = np.where ( books_df ['Place of Publication']. str.contains ('London'), 'London', np.where ( books_df ['Place of Publication']. str.contains ('Oxford'), 'Oxford', books_df ['Place of Publication'].replace( r'^\s*$', 'Unknown', regex=True ) ) ) # Display the cleaned DataFrame print("\ nCleaned DataFrame :") print( books_df.head ())

Function Definition : The function clean_date takes one parameter, date def clean_date (date): . Type Check: It first checks if the input date is a string using the isinstance function. if isinstance (date, str): Regular Expression Search: If date is indeed a string, the function uses the re.search method to search for a pattern that matches four consecutive digits (which typically represent a year) in the string match = re.search (r'\d{4}', date) re.search searches the input string for the first location where the regular expression pattern \d{4} (which means any four digits) matches. If such a pattern is found, re.search returns a match object; otherwise, it returns None.

Extracting the Year: If a match is found (i.e., the match object is not None), the function extracts the matched string (the year) using the group method of the match object. if match: return match.group () Handling No Match: If date is not a string or if no four-digit number is found in the string, the function returns np.nan (which represents a missing value in the context of data analysis, often using the NumPy library). return np.nan Libraries re: This library provides regular expression matching operations. numpy as np: This library is typically used for numerical and array operations. np.nan is a special floating-point value that represents 'Not a Number' and is used to denote missing values .

Example: Input: "April 20, 1995" Output: "1995" Input: "The year is 2023" Output: "2023" Input: "No year here" Output: np.nan Input: 12345 Output: np.nan (since the input is not a string)

r'^\s*$': This is a regular expression pattern. ^: Asserts the position at the start of the string. \s*: Matches zero or more whitespace characters (spaces, tabs, newlines). $: Asserts the position at the end of the string. Therefore, r'^\s*$' matches any string that contains only whitespace characters or is completely empty. 'Unknown': This is the replacement value. Any string that matches the regular expression pattern will be replaced with the string 'Unknown'. regex=True: This tells the replace method to interpret the first argument as a regular expression pattern. Without regex=True, the method would treat the pattern as a plain string and attempt to find and replace the exact string r'^\s*$', which wouldn't match anything in most cases.

What does this code do The code replaces any entry in the 'Place of Publication' column that is empty or contains only whitespace with the string 'Unknown'. By setting regex=True, it ensures that the regular expression pattern is correctly used to identify these entries.

Data Science.pptx00000000000000000000000

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Data Science.pptx00000000000000000000000

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx