Lecture 1 Introduction to statistics and data Science last.pdf

alqoumri789 36 views 50 slides Aug 27, 2024
Slide 1
Slide 1 of 50
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50

About This Presentation

Art


Slide Content

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

1
Lecturer
Dr. Belal Abdullah Murshed
Lecture 1 and 2
Dr. Belal Abdullah Hezam
Statistical Analysis for Data
Science

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Course Contents
2
Introduction to Statistics and Data Science
Data analysis Data and Representation
Measures of Central Tendency and Dispersion
Exploring Data Analysis
Statistical Experiments and Significance Testing
Regression and Correlation Analysis
Evidence, Probabilities and Basic Classification and Clustering
methods
Final Project Discussion

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Course Intended Learning
Outcomes (CILOs)
3

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Course Intended Learning Outcomes (CILO)
Demonstrate an understanding of the methods, techniques and tools for obtaining,
organizing, exploring, and analyzing data.
Demonstrate a profound knowledge in utilizing and adapting the statistical programming tools
to manipulate data
Recognize how data analysis, inferential statistics and statistical computing can be utilized in
an integrated capacity
Analyze data using open source statistical tools related to the field of data science.
Propose descriptive analyses solutions for solving the real-world computing problems by
means of statistical methods, tables, measures and graphs linear regression analyses.
Apply descriptive statistics methods, AI and Data Science fundamentals by practicing the
formulation of the data in Python to develop and design statistical-based solutions.
Choose the appropriate statistical test according to the nature of the data to meet a given set of
computing requirements in the context of AI and data science.
Work effectively as a member/leader of a team in order to complete a statistical analysis
project using appropriate resources and tools efficiently.
4

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Assessments
Assessment Marks Grading Percentage
Assignments

10 10%
Project
20 20%
Mid Term
Practical 20 20%
Attendance 10 10%
Final Exam 40 40%
Total 100 100%
Extra marks for Sharing
Between 1..10 Extra
Marks
+10%
5

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Reference Book
6
1- Required Textbook(s) ( maximum two ):
1.De Smith M J (2015) STATSREF: Statistical Analysis Handbook - a web-based statistics
resource. The Winchelsea Press, Winchelsea, UK .
http://statsref.com/StatsRefSample.pdf.
2.Peter Bruce and Andrew Bruce, 2017, Practical Statistics for Data
Scientists: 50 Essential Concepts, 2nd Edition , Published by O’Reilly Media,
Inc., 1005 Gravenstein Highway North.
2- Essential References:
1.Glen Cowan, (2015), "Statistical Data Analysis" (Oxford Science Publications) 1st
Edition .
http://www.amazon.com/GlenCowan/e/B001HCU9Y2/ref=dp_byline_cont_boo
2.N. Dasgupta, 2018, Practical Big Data Analytics, Packt Publishing, Birmingham, UK.

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Introduction to Statistics
and Data Science
7

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Objectives
•To understand the meaning of Statistics.
•To explore some of Statistics Terminologies.
•To define the Statistical Analysis.
•To understand the objectives of Statistical analysis.
•To explain the types of Statistical analysis.
•To explain the importance of Statistical Analysis.
•To understand what is the Data Science
•To explain the Foundation of Data Science.
•To explain the Data Science Lifecycle
•To know what are the Skills for Data Science.
•To display the Data Science Tools
8

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Introduction to Statistics
•Statistics is a branch of math focused on collecting, organizing, and
understanding numerical data.

•It involves analyzing and interpreting data to solve real-life problems,
using various quantitative models.

•Some view statistics as a separate scientific discipline rather than
just a branch of math. It simplifies complex tasks and offers clear
insights into regular activities.

•Statistics finds applications in diverse fields like weather forecasting,
stock market analysis, insurance, and data science.


9

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

What is Statistics?
•Statistics in Mathematics is the study and manipulation of data. It
involves the analysis of numerical data, enabling the extraction of meaningful
conclusions from the collected and analyzed data sets.

According to Merriam-Webster:
•Statistics is the science of collecting, analyzing, interpreting, and
presenting masses of numerical data.

According to Oxford English Dictionary:
•Statistics is a branch of mathematics dealing with the collection, analysis,
interpretation, presentation, and organization of data.
10

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Statistics Terminologies
•Some of the most common terms you might come across in statistics are:
oPopulation: It is actually a collection of a set of individual objects or events whose
properties are to be analyzed.
oSample: It is the subset of a population.
oVariable: It is a characteristic that can have different values.
oParameter: It is numerical characteristic of population.
11

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Statistics Examples
Some real-life examples of statistics :
•Example 1: In a class of 45 students, we calculate their mean
marks to evaluate performance of that class.

•Example 2: Before elections, you might have seen exit polls.
Exit polls are opinion of population sample, that are used to
predict election results.
12

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Statistical analysis Definition
Statistical analysis:
•It is a systematic process for collecting, analyzing, interpreting,
and presenting large volumes of data in order to identify trends
and develop valuable insights.

•It involves applying statistical methods to understand patterns,
trends, correlations, and variability within datasets.

•Numerous disciplines, including (1) business, (2) economics, (3)
social sciences, (4) science, (5) and engineering, heavily rely on
statistical analysis.
13

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Main objectives of Statistical analysis
The primary objectives of statistical analysis are
•To make defensible decisions,
•Gain valuable insights,
• And derive reliable conclusions from data.

14

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Types of Statistical Analysis
15

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Types of Statistical Analysis
•There are different types of statistical analysis that can be used in
the process of data science

The main types of statistical analysis are:
1.Descriptive Statistical Analysis

2.Inferential Statistical Analysis


16

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Types of Statistical Analysis
The main types of statistical analysis are:
1.Descriptive Statistical Analysis
•It is a type of analysis that deals with the collection, interpretation,
analysis, and summarize of data in order to representing the data
in the form of graphs, tables, charts, and graphs rather than
drawing conclusions.

•This statistical analysis makes the data simpler to analyze.
•This type focuses on summarizing and describing data sets
without drawing conclusions about its contents.





17

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Types of Statistical Analysis
• Descriptive Statistical Analysis employs:
1.Measures of central tendency
oMean,
oMedian,
oMode
2.Measures of dispersion
oRange
oVariance
oStandard deviation






18
To provide a concise
overview of the data’s
characteristics.

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Types of Statistical Analysis
2.Inferential Statistical Analysis

•Inferential Statistical Analysis gives the conclusion by about the
population from the sample data.
•Inferential statistics helps in understanding and analyzing the
population sample data.
•This type of analysis delves deeper, drawing conclusions about a
population based on a sample of data. Hypothesis testing, chi-
square tests, t-tests, and ANOVA are some of the commonly used
inferential statistical techniques.





19

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Importance of Statistical Analysis
•It plays an important role in data science, offering valuable
insights into patterns, trends, and relationships within datasets.
ohelps in understanding the patterns, trends and relationship between
different variables in the data.

oStatistical analysis techniques can be used for the identification and
handling of the missing values, outliers and inconsistence in the data.

oStatistical analysis techniques helps in selecting the appropriate features
and create the new features for the model , which leads to the increased
efficiency of the model.






20

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Importance of Statistical Analysis
•The effectiveness of models, algorithms, and procedures is
assessed using statistical metrics and measures. F1-score, recall,
accuracy, precision, and other performance metrics are included in
this

•Statistical analysis supports risk management methods by
assisting in the measurement and evaluation of risks in a variety of
industries, including banking, insurance, and healthcare.







21

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science

22

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science

23
•Data science combines math and statistics, specialized
programming, advanced analytics, artificial intelligence
(AI) and machine learning with specific subject matter expertise to
extract knowledge and insights from structured and unstructured
data.
•Data science is an interdisciplinary field that uses scientific
methods, processes, algorithms, and systems to extract knowledge
and insights from structured and unstructured data.

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science

24

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

The Foundation of Data Science

25

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

1. Statistics: Discovering the patterns within data

•It serves as the bedrock of data science, providing the tools and
techniques to collect, analyze, and interpret data.
•It provides data scientists with the means to uncover patterns,
trends, and relationships hidden within complex datasets.
•By applying statistical concepts such as central tendency,
variability, and correlation, data scientists can gain insights into
the underlying structure of data
26
Data Science

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

2. Python: Versatile data scientist’s toolkit
Python has emerged as a prominent programming language in the
data science realm due to its versatility, readability, and an expansive
ecosystem of libraries.
Popular Python libraries for data science include:
•NumPy is a library for numerical computation. It provides a fast
and efficient way to manipulate data arrays.
•SciPy is a library for scientific computing. It provides a wide
range of mathematical functions and algorithms.
•Pandas is a library for data analysis. It provides a high-level
interface for working with data frames.
•Matplotlib is a library for plotting data. It provides a wide range
of visualization tools.
27
Data Science

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

3. Models: Bridging data and predictive insights
oModels are mathematical representations of real-world problem
learned from the data by applying some machine learning algorithm
such as such as classification, regression, clustering, and more that is
suitable for the problem.
oA model is also called a hypothesis.

28
Data Science

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

4. Domain knowledge: Contextualizing data insights
Domain knowledge is often referred to as a general discipline or field to which
data science is applied to.
An expert or specialist in a field such as biotech is said to possess domain
knowledge of that industry.
•Domain knowledge involves understanding the specific industry or field to
which the data pertains. This contextual awareness aids data scientists in
framing relevant questions, identifying meaningful variables, and
interpreting results accurately.
•Collaboration between data scientists and domain experts fosters a holistic
approach to problem-solving. Data scientists contribute their analytical
expertise, while domain experts offer insights that can refine data-driven
strategies.
29
Data Science

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Lifecycle
30

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Skills for Data Science.
1.Statistics and probability
2.Advanced mathematics(multivariate calculus and linear algebra)
3.Data visualization tools(translate information and data into visuals like
relationship maps, 3D plots, bar charts, histograms, line plots and pie charts)
4.Python or R programming Language
5.Data wrangling(the process of cleaning raw data, removing outliers,
changing null values and turning the data into a format that they can use in
different programs.
6.ML with AI & DL with NLP
7.Data Analysis
8.Problem-Solving Skill
9.Big Data




•It equips data scientists wit0h the means to uncover patterns,
trends, and relationships hidden within complex datasets.
•By applying statistical concepts such as central tendency,
variability, and correlation, data scientists can gain insights into
the underlying structure of data
31

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

32
Skills for Data Science.

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

What is Data Science Applications
Data Science has a wide array of applications across various
industries.

Healthcare:
Predictive Analytics: Predicting disease outbreaks, patient readmissions, and individual health
risks.
Medical Imaging: Enhancing image recognition to diagnose conditions from X-rays, MRIs, and CT
scans.
Personalized Medicine: Tailoring treatment plans based on genetic information and patient
history.
Finance:
Risk Management: Identifying and mitigating financial risks through predictive modeling.
Fraud Detection: Analyzing transactions to detect fraudulent activities.
Marketing:
Customer Segmentation: Grouping customers based on purchasing behavior and preferences for
targeted marketing.
Sentiment Analysis: Analyzing customer comments and social media interactions to measure
public sentiment.
Predictive Analytics: Forecasting sales trends and customer lifetime value.




33

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Applications
Retail:
Inventory Management: Optimizing stock levels based on demand forecasting.
Recommendation Systems: Providing personalized product recommendations to customers.
Price Optimization: Adjusting prices dynamically based on market trends and consumer behavior
.
Transportation:
Route Optimization: Enhancing logistics by determining the most efficient routes.
Predictive Maintenance: Forecasting equipment failures to schedule timely maintenance.
Autonomous Vehicles: Developing self-driving cars using machine learning algorithms.

Education:
Personalized Learning: Creating customized learning experiences based on student performance
and preferences.
Academic Analytics: Analyzing data to improve student retention and graduation rates.
Curriculum Development: Using data to develop and refine educational programs.

Entertainment:
Content Recommendation: Suggesting movies, shows, and music based on user preferences.
Audience Analytics: Understanding audience behavior to improve content delivery.
Production Analytics: Optimizing production schedules and budgets through data analysis.
34

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Tools
35

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Tools
1.The best programming language-driven tools:

1.1 R Language:


Features: Used for statistical analysis and graphics, rich visualization libraries
(ggplot2, ggplot2, dplyr, and tidyverse etc.), strong domain-specific packages.

Benefits: Excellent for statistical modeling and exploratory data analysis,
popular in academia and research.

Use Cases: Statistical analysis, bioinformatics, econometrics, data
visualization, and creating research reports. It is one of the most essential data
science skills that you must focus on.






36

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Tools

1.2 Python:


Features: Used for statistical computing and data analysis Readability,
versatility, extensive libraries (NumPy, Pandas, scikit-learn, etc.), large
community support.

Benefits: Data science tools python is suitable for beginners and experienced
programmers, efficient for various data science tasks, abundant resources and
support available.

Use Cases: Data manipulation and wrangling, Statistical analysis, machine
learning, deep learning, web development, scientific computing.






37

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Tools
2.Pandas



Features: Data Frames for structured data, flexible tools for data loading,
cleaning, transformation, and analysis.

Benefits: Streamlines data manipulation, making it easier to work with tables
and perform common operations.

Use Cases: Data preprocessing, feature engineering, calculating summary
statistics.

How to Install pandas Library
>> pip install pandas


38

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Tools
3.NumPy (known as Numerical Python)


Features: focuses on Multi-dimensional arrays(array manupulation), basic
mathematical functions, basic linear algebra routines.

Benefits: Efficient numerical computation, providing the foundation for
scientific computing in Python.

Use Cases: Matrix operations, numerical optimization, machine learning
(data representation).

How to Install NumPy Library
>> pip install numpy

39

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Tools
3.SciPy (known as Scientific Python)
Features: built on top of NumPy and extends its functionality by adding
high-level scientific and technical computing capabilities.
SciPy offers a broader spectrum of scientific tools, algorithms, and functions
for a wide range of domains, including optimization, signal processing,
statistics, and more.
Benefits: SciPy is organized into submodules, each catering to a specific
scientific discipline. This modular structure makes it easier to find and use
functions relevant to your specific scientific domain
Use Cases: solving complex differential equations, Advanced optimizing
functions using “optimize”, conducting statistical analysis, and working with
specialized mathematical functions, Multidimensional image processing,
special functions through its “special module.
How to Install SciPy Library
>> pip install SciPy

40

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Tools
4. Most popular tools used for data visualization:

4.1 Matplotlib (for data visualization)



Features: Provides building blocks for creating static plots (line charts, scatterplots,
bar charts, etc.), full customization.

Benefits: Foundation for other visualization libraries, offers granular control over
plot elements.
Use Cases: Exploratory data analysis, creating simple, static, animated, interactive
and complex visualizations for research publications.

How to Install Matplotlib Library
>> pip install matplotlib

41

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Tools

4.2 Seaborn (for data visualization)





Features: High-level interface, built on Matplotlib, which focuses on statistical
graphics and attractive visual themes.

Benefits: Enables creation of informative and visually appealing plots with a few
lines of code.

Use Cases: Exploring relationships between variables, distributions of data, and
comparisons across groups.

How to Install Seaborn Library
>> pip install Seaborn

42

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Tools
5. Popular tools used for Machine Learning:

5.1 scikit-learn (for Machine Learning)



Features: Wide range of machine learning algorithms (classification, regression,
clustering, etc.), model selection and evaluation tools.

Benefits: Provides a consistent interface for implementing standard machine
learning tasks.

Use Cases: Building predictive models, spam detection, and customer churn
analysis.

How to Install scikit-learn Library
>> pip install scikit-learn

43

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Tools

5.2 TensorFlow (for Machine Learning and Deep Learning)





Features: Low-level control over numerical computations, optimized for large-
scale neural networks, automatic differentiation.

Benefits: Flexibility in building deep learning architectures, and efficient training
on GPUs and TPUs.

Use Cases: Computer vision, natural language processing, building
recommendation systems.

How to Install TensorFlow Library >> python.exe -m pip install --upgrade pip
>> pip install TensorFlow and >> pip install --upgrade tensorflow-gpu --user
44

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Tools
6. Scrapy




Features: Designed for web scraping, it can also be used to extract data using
APIs (such as Amazon Associates Web Services)

Benefits: extracting the data you need from websites. In a fast, simple, yet
extensible way.

Use Cases: web scraping., building regular web crawlers, Extraction and storage
of collected data in databases

How to Install Scrapy Library
>> pip install Scrapy
45

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Tools
7. Popular tools used for databases:

7.1 MySQL (for databases)



Features: Relational database (SQL), ACID compliance, scalability, security.

Benefits: Mature and reliable for storing structured data, supports standard SQL
queries.

Use Cases: Backend for web applications, storing customer data, inventory
systems.

How to Install MySQL Library
>> pip install MySQL 46

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Tools

7.2 MongoDB (for databases)





Features: Document-oriented NoSQL database, JSON-like storage, flexible
schema, scalability.

Benefits: Handles semi-structured and unstructured data well, and supports agile
development.

Use Cases: Real-time analytics, content management systems, and storing logs.

47

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Tools
8. Popular tools used for business intelligence & data
communication:
8.1 Tableau/Power BI (for business intelligence )




Features: Business intelligence tools for data exploration, visualization, and
dashboard creation.

Benefits: Drag-and-drop interface for easy visualization creation, interactive
dashboards for sharing insights. With knowledge of such tools, you can improve
your salary as a data scientist.

Use Case: Creating intuitive data dashboards for decision-making, and
communicating data stories to stakeholders.

48

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

Data Science Tools
8. Popular tools used for big data processing :

8.1 Hadoop/Spark (for big data processing)







Features: Open-source frameworks for distributed computing and big data
processing.

Benefits: Handle massive datasets efficiently across clusters of machines.

Use Case: Large-scale data analysis, processing real-time data streams.

49

Dr. Belal Murshed
Statistical Analysis for Data Science
C
A
D
2
4
1
5

50
Thank you
Tags