DataScienceandVisualization_Mod_1_ppt.pptx

AnithaCL1 41 views 78 slides Jul 06, 2024
Slide 1
Slide 1 of 78
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78

About This Presentation

pptppt


Slide Content

Doing Data Science Module 1

What is Data Science? Big Data and Data Science Hype Getting Past the Hype / Why Now? Datafication The Current Landscape (with a Little History) Data Science Jobs A Data Science Profile Thought Experiment: Meta-Definition OK, So What Is a Data Scientist, Really? In Academia In Industry

Big Data and Data Science Hype Big Data, how big? Data Science, who is doing it? Academia have been doing this for years Statisticians have been doing this work.

Getting Past the Hype / Why Now The Hype: Understanding the cultural phenomenon of data science and how others were experiencing it. Study how companies, and universities are “doing data science”. Why Now: Technology makes this possible: infrastructure for large-scale data processing, increased memory, and bandwidth, as well as a cultural acceptance of technology in the fabric of our lives. This wasn't true a decade ago. Consideration should be to the ethical and technical responsibilities for the people responsible for the process.

Datafication Definition : A process of "taking all aspects of life and turning them into data:'' For Example: "Google's augmented-reality glasses “ datafy ” the gaze. Twitter “ datafies ” stray thoughts. Linkedin “ datafies ” professional networks:

Current Landscape of Data Science Drew Conway's Venn diagram of data science from 20l0, Hacking Skills Math and Statistics Substantive Expertise Machine Learning Data Science Traditional Research Danger Zone

Data Science Jobs Job descriptions: experts in computer science, statistics, communication, data visualization, and to have extensive domain expertise Observation : Nobody is an expert in everything, which is why it makes more sense to create teams of people who have different profiles and different expertise-together, as a team, they can specialize in all those things.

Data Science Profile Data science profile can be built based on their on skill levels in different domains. Computer science Math Machine learning Domain expertise Communication and presentation skills Data visualization As a data scientist, author can create a visualization oneself with a particular scale

Data Science Profile

Data Science Team

What is Data Science, Really? Data scientist in academia ? who in academia plans to become a data scientist?

What is Data Science, Really? Data scientist in academia ? who in academia plans to become a data scientist? statisticians, applied mathematicians, and computer scientists, sociologists, journalists, political scientists, biomedical informatics students, students from government agencies and social welfare, someone from the architecture school, environmental engineering, pure mathematicians, business marketing students, and students who already worked as data scientists.

What is Data Science, Really? Data scientist in academia ? who in academia plans to become a data scientist? statisticians, applied mathematicians, and computer scientists, sociologists, journalists, political scientists, biomedical informatics students, students from government agencies and social welfare, someone from the architecture school, environmental engineering, pure mathematicians, business marketing students, and students who already worked as data scientists. They were all interested in figuring out ways to solve important problems, often of social value, with data.

In Academia: an academic data scientist is a scientist, trained in anything from social science to biology, who works with large amounts of data, and must wrestle with computational problems posed by the structure, size, messiness, and the complexity and nature of the data, while simultaneously solving a real-world problem.

In Industry? What do data scientists look like in industry?

In Industry? What do data scientists look like in industry? It depends on the level of seniority A chief data scientist : setting everything up from the engineering and infrastructure for collecting data and logging, to privacy concerns, to deciding what data will be user-facing, how data is going to be used to make decisions, and how it’s going to be built back into the product.

manage a team of engineers, scientists, and analysts and should communicate with leadership across the company, including the CEO, CTO, and product leadership.

In Industry: Someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human. He/She spends a lot of time in the process of collecting, cleaning, and “ munging ” data, because data is never clean . This process requires persistence, statistics, and software engineering skills that are also necessary for understanding biases in the data, and for debugging logging output from code.

Statistical Inference Statistical thinking in the Age of Big Data Statistical Inference Populations and Samples Big Data Examples Big Assumptions due to Big Data Modeling

Statistical Thinking in the Age of Big Data Big Data?. First, it is a bundle of technologies. Second, it is a potential revolution in measurement. And third, it is a point of view, or philosophy, about how decisions will be—and perhaps should be—made in the future.

Statistical Thinking – Age of Big Data Prerequisites – massive skills!! Math/Comp Science: stats, linear algebra, coding. Analytical: Data preparation, modeling, munging, visualization , communication.

Statistical Inference The World – complex, random, uncertain. As we commute to work on subways and in cars, shopping, emailing, browsing the Internet and watching the stock market, as we’re building things, eating things, talking to our friends and family about things, this all processes potentially produces data. Data are small traces of real-world processes. which traces we gather are decided by our data collection or sampling method

This overall process of going from the world to the data, and then from the data back to the world, is the field of statistical inference . More precisely , statistical inference is the discipline that concerns itself with the development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic (random) processes.

Populations and Samples Population : population of India or population of world ? It could be any set of objects or units, such as tweets or photographs or stars etc. If we could measure the characteristics of all those objects : set of observations (N)

Modeling What’s a model? An attempt to understand the population of interest and represent that in a compact form which can be used to experiment/analyze/study and determine cause-and-effect and similar relationships amongst the variables under study IN THE POPULATION. Data model Statistical model- key variables and mathematical structure. Mathematical model- consists of mathematical expressions

Model Building Define Your Objective. First, define very clearly what problem you are going to solve. ... Collect Data. Gather data relevant to your objective. ... Clean Your Data. Data cleaning is a critical step to prepare your dataset for modeling. ... Explore Your Data. ... Split Your Data. ... Choose a Model. ... Train Your Model. ... Evaluate Your Model.

PROBABILITY DISTRIBUTION The probability distribution gives  the possibility of each outcome of a random experiment or event . A probability distribution is a function that describes  the likelihood of obtaining the possible values that a random variable can assume .

Probability Distributions (Page 31)

Fitting a model estimate the parameters of the model using the observed data. Overfitting : model isn’t that good at capturing reality beyond your sampled data.

MODULE 2 Exploratory Data Analysis and Data S cience Process

Exploratory Data Analysis (EDT) “It is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there , as well as those we believe to be there .” -John Tukey Traditionally presented as a bunch of histograms and stem-and-leaf plots.

Features EDT is a critical part of data science process. Represents a philosophy or way of doing statistics. No hypotheses and there is no model. “Exploratory” aspect means that your understanding of the problem you are solving, or might solve, is changing as you go .

Basic Tools of EDA Plots, graphs and summary statistics. Method of systematically going through the data, plotting distributions of all variables. EDA is a set of tools, it’s also a mindset. Mindset is about relationship with the data.

Philosophy of EDA Many reasons any one working with data should do EDA. EDA helps with de-bugging the logging process. EDA helps assuring the product is performing as intended. EDA is done toward the beginning of the analysis.

Data Science Process

A Data Scientist’s Role in This process

Doing Data Science Chapter 3

What is an algorithm? Series of steps or rules to accomplish a tasks such as: Sorting Searching Graph-based computational problems Because one problem could be solved by several algorithms, the “best” is the one that can do it with most efficiency and least computational time.

Three Categories of Algorithms Data munging, preparation, and processing Sorting, MapReduce , Pregel Considered data engineering Optimization Parameter estimation Gradient Descent, Newton’s Method, least squares Machine learning Predict, classify, cluster

Data Scientists Good data scientists use both statistical modeling and machine learning algorithms . Statisticians: Want to apply parameters to real world scenarios. Provide confidence intervals and have uncertainty in these. Make explicit assumptions about data generation. Software engineers: Want to create production code into a model without interpret parameters. Machine learning algorithms don’t have notions of uncertainty. Don ’ t make assumptions of probability distribution – implicit.

Linear Regression (supervised) Determine if there is causation and build a model if we think so. Does X (explanatory var ) cause Y (response var )? Assumptions: Quantitative variables Linear form

Linear Regression (supervised) Steps: Create a scatterplot of data Ensure that data looks linear (maybe apply transformation?) Find “line of least squares” or fit line. This is the line that has the lowest sum of all of the residuals (actual values – expected values) Check your model for “goodness” with R-squared, p-values, etc. Apply your model within reason.

Suppose you run a social networking site that charges a monthly subscription fee of $25, and that this is your only source of revenue. Each month you collect data and count your number of users and total revenue. You’ve done this daily over the course of two years, recording it all in a spreadsheet. You could express this data as a series of points. Here are the first four: S = {( x , y) = (1,25) , (10,250) , (100,2500) ,(200,5000)}

The names of the columns are total_num_friends , total_new_friends_this_week , num_visits , time_spent , number_apps_downloaded , number_ads_shown , gender, age, and so on.

Linear Line Equation y = β 0 + β 1 x β and β 1 ?? Fitting the model

Fitting the model

To find this line, you’ll define the “residual sum of squares” (RSS), denoted RSS (β) , to be:

Fitting Linear model model <- lm (y ~ x)

Extending beyond least squares We have a simple linear regression model using least squares estimation to estimate your β s. This model can also be build in three primary ways 1. Adding in modeling assumptions about the errors 2. Adding in more predictors 3. Transforming the predictors

Adding in modeling assumptions about the errors If you use your model to predict y for a given value of x , your prediction is deterministic ( y = β 0 + β 1 x) doesn’t capture the variability in the observed data. to capture this variability in your model, so you extend your model to: y = β 0 + β 1 x + ϵ

the error term —ϵ represents the actual error. the difference between the observations and the true regression line, which you’ll never know and can only estimate with your . the noise is normally distributed, which is denoted:

the conditional distribution of y given x is Need to estimate your parameters β0, β1, σ ( variance) from the data Then you estimate the variance ( σ 2 ) of ϵ, as: (mean squared error)

Evaluation metrics R-squared and p-values R-squared p-values To see the p-values, look at Pr ( > |t|) .

Cross-validation

Other models for error terms Adding other predictors y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 +ϵ. model <- lm(y ~ x_1 + x_2 + x_3)

Transformations. polynomial relationship y = β 0 + β 1 x + β 2 x 2 + β 3 x 3

KNN

The intution behind k-NN is to consider the most similar other items defined in terms of their attributes, look at their labels, and give the unassigned item the majority vote. If there’s a tie, you randomly select among the labels that have tied for first.

To automate it, two decisions must be made: first, how do you define similarity or closeness? Once you define it, for a given unrated item, you can say how similar all the labeled items are to it, and you can take the most similar items and call them neighbors , who each have a “vote.” how many neighbors should you look at or “let vote”? This value is k

overview of the process: Decide on your similarity or distance metric. Split the original labeled dataset into training and test data. Pick an evaluation metric. (Misclassification rate is a good one. We’ll explain this more in a bit.) Run k-NN a few times, changing k and checking the evaluation measure. Optimize k by picking the one with the best evaluation measure. Once you’ve chosen k , use the same training set and now create a new test set with the people’s ages and incomes that you have no

Similarity or distance metrics Euclidean distance Cosine Similarity Jaccard Distance or Similarity Mahalanobis Distance Hamming Distance Manhattan

Training and test sets Train Test split

Pick an evaluation metric Sensitivity( true positive/recall) is defined as the probability of correctly diagnosing an ill patient as ill Specificity( true negative) is defined as the probability of correctly diagnosing a well patient as well.

Choosing k Run k-NN a few times, changing k , and checking the evaluation metric each time.

k-Nearest Neighbor/k-NN (supervised) Used when you have many objects that are classified into categories but have some unclassified objects (e.g. movie ratings).

k-Nearest Neighbor/k-NN (supervised) Pick a k value (usually a low odd number, but up to you to pick). Find the closest number of k points to the unclassified point (using various distance measurement techniques). Assign the new point to the class where the majority of closest points lie. Run algorithm again and again using different k’s.

k-means (unsupervised) Goal is to segment data into clusters or strata Important for marketing research where you need to determine your sample space. Assumptions: Labels are not known. You pick k (more of an art than a science).

k-means (unsupervised) Randomly pick k centroids (centers of data) and place them near “clusters” of data. Assign each data point to a centroid. Move the centroids to the average location of the data points assigned to it. Repeat the previous two steps until the data point assignments don’t change.
Tags