A non-technical introduction to Data Science. Looking at Statistics, Computer Science, and Domain Knowledge.
Size: 33.32 MB
Language: en
Added: Oct 09, 2024
Slides: 110 pages
Slide Content
Data Science Damian Gordon
Contents Data Science Overview and Definitions History and Key Milestones Main Application Areas
Data Science Data science is… “ the science of Data ” (!) Cao, L. (2017) Data science: a comprehensive overview. ACM Computing Surveys (CSUR) , 50(3), pp.1-42.
Data Science What is Data? It’s a set of facts and figures
Data Science OK, so what is Data Science? Extracting insight and information from data sets to make better decisions. Kelleher, J.D., Tierney, B. (2018) Data Science . MIT press.
Data Science There is a (possibly apocryphal) story that is often used to illustrate data mining, and it’s called the “Beers and Nappies” story.
Data Science The story goes that a large American supermarket, usually it’s Walmart, was exploring its sales data from their cash registers. The data is stored one customer’s purchase after another, but when the supermarket mined the dataset, they looked at each product to see if it is commonly associated with any other products.
Data Science They found an unexpected pattern between the purchase of beers and the purchase of nappies. The supermarket starting to place those two products right beside each other on the supermarket floor and they made lots of money.
Data Science The explanation for the association between the products could not be deduced from the dataset, but the cashiers explain that if a couple with a baby have one partner at home minding the baby, and one going to work; the partner who is going to work will pop into the supermarket after work to buy some nappies, and will decide that they need to get themselves some beers as well ;-)
Data Science We normally create visualizations of the products and their relationships, e.g.
Data Science We have lots and lots of data, more every year:
Data Science Data Science is a combination of: Statistics Computer Science Domain Knowledge
Data Science Statistics Computer Science Domain Knowledge
Statistics
Data Science: Statistics Statistics is all about collecting data, analyzing it, modelling it, and making predictions.
Data Science: Statistics What is a model?
Data Science: Statistics What is a model? OK, let’s try an easier question…
Data Science: Statistics How does a model train differ from a real train?
Data Science: Statistics Smaller size Simpler features Different materials Lower cost Different Power supply
Data Science: Statistics How is a model train similar to a real train?
Data Science: Statistics Overall appearance Similar movement Similar sounds Design elements Good approximation
Data Science: Statistics So a good model is really an approximation of something else. It’s not supposed to be a perfect representation.
Data Science: Statistics The map of the London underground is a good example of this, it’s not an accurate map geographically.
Data Science: Statistics Tube Map Real Locations
Data Science: Statistics Proposed Tube Map Real Locations
Data Science: Statistics So, statistics is about exploring a collection of data (a “dataset”) by creating different models of the data to represent the key features.
Data Science: Statistics The models can be represented by formula:
Timeline of Data Science (Data Collection) 1100 BCE National Census In Egypt (Reported) 2 AD National Census In China (Preserved) 297 AD Rome Empire begins a 15-year cycle of Censuses 640 AD Second Rashidun caliph Umar begins a cycle of Censuses 1086 AD “ Domesday Book ”: Cen sus of England and Wales
Timeline of Data Science (Data Collection) From the British Museum, Papyrus BM 10068, may be a distribution of settlements in the late New Kingdom Thebes (~670 BCE).
Timeline of Data Science (Statistics) 5 BCE Athenians estimate height of Wall of Platea 1179 AD “ Trial of the Pyx ”: Sampling of Coins from Royal Mint 1364 AD “ Nuova Cronica ”: Year-by-year history of Florence 1580s AD Tycho Brahe : Uses the mean to estimate the l ocation of stars. 1599 AD Edward Wright : Uses the median to help in naval navigation. Mode Sampling Forecasting Mean Median
Data Science: Statistics Statistics is often considered to be part of Maths , but it really is more like Modelling than it is like Maths . Generally, in other forms of Maths there is one correct answer you are trying to calculate, and if use the right technique, you will get the right answer.
Data Science: Statistics But in Statistics you are not looking for the one “right answer” instead you are creating different statistical models to represent different aspects of the dataset, and based on these models you can make predictions about additional future values might be. Let’s look at an example …
Data Science: Statistics Real Data This is a collection of values that are represented as points on a graph.
Data Science: Statistics Real Data This is a simple model to represent those points as a line with a slope.
Data Science: Statistics Real Data Predicted Data Which is the best prediction for new data, based on the real data: A, B, or C ? A B C
Data Science: Statistics Any of the three predictions (A, B, and C) are possible in terms of new data, so there is no “right answer” , but based on the linear model we have created for the existing data, the line B looks like the most likely predictor of any new data.
Data Science: Statistics Real Data Instead of a line, let’s use a Sine wave to model the data
Data Science: Statistics Real Data
Data Science: Statistics Real Data So, what will the predicted data look like now?
Data Science: Statistics Real Data Predicted Data
Data Science: Statistics Real Data Predicted Data … or maybe these?
Data Science: Statistics So, the key takeaway here is that we choose the model, and different models will more accurately represent aspects of the existing data, and these different models will likely give different predictions for future data.
Data Science: Statistics Florence Nightingale (1820-1910) was a pioneer of Data Science, her contributions as a nurse sometimes overshadow her work as an amateur statistician, where she did an analysis of data associated with the number of deaths of solders in hospitals during the Crimean War (1853-1856), and identified the key causes. Her visualizations brought about important changes in sanitation in hospitals.
Francis Galton Florence Nightingale Karl Pearson William Gosset Gertrude Cox Egon Pearson Some Famous Statisticians
Computer Science
Data Science: Computer Science Computer Science is all about creating a series of instructions for a computer to perform a particular task. It also involves comparing how effective different sets of instructions are to perform a specific task.
Data Science: Computer Science The set of instructions are usually implemented as a computer program:
Data Science: Computer Science … and Computer Scientists (who look at the efficiency of different approaches), model these computer programs as ALGORITHMS , which are not just computer programs, but any general set of step-by-step instructions (that terminates) to perform any type of task.
Data Science: Computer Science Examples of Algorithms: Musical Scores, Knitting Patterns, Recipes, Computer programs
Data Science: Computer Science And then it’s easy to compare algorithms:
Data Science: Computer Science Top-Down Design (also known as stepwise design ) is breaking down a problem into steps. In Top-down Design an overview of the problem is described first, specifying but not detailing any first-level sub-steps. Each sub-step is then refined in yet greater detail, sometimes in many additional sub-steps, until the entire specification is reduced to basic elements.
Data Science: Computer Science e.g. Making a cup of tea…
Data Science: Computer Science Organise everything together . . . . . . . . .
Data Science: Computer Science Organise everything together Plug in kettle . . . . . . . .
Data Science: Computer Science Organise everything together Plug in kettle Put teabag in cup . . . . . . .
Data Science: Computer Science Organise everything together Plug in kettle Put teabag in cup Put water into kettle . . . . . .
Data Science: Computer Science Organise everything together Plug in kettle Put teabag in cup Put water into kettle Turn on kettle . . . . .
Data Science: Computer Science Organise everything together Plug in kettle Put teabag in cup Put water into kettle Turn on kettle Wait for kettle to boil . . . .
Data Science: Computer Science Organise everything together Plug in kettle Put teabag in cup Put water into kettle Turn on kettle Wait for kettle to boil Add boiling water to cup . . .
Data Science: Computer Science Organise everything together Plug in kettle Put teabag in cup Put water into kettle Turn on kettle Wait for kettle to boil Add boiling water to cup Remove teabag with spoon/fork . .
Data Science: Computer Science Organise everything together Plug in kettle Put teabag in cup Put water into kettle Turn on kettle Wait for kettle to boil Add boiling water to cup Remove teabag with spoon/fork Add milk and/or sugar .
Data Science: Computer Science Organise everything together Plug in kettle Put teabag in cup Put water into kettle Turn on kettle Wait for kettle to boil Add boiling water to cup Remove teabag with spoon/fork Add milk and/or sugar Serve
Data Science: Computer Science Step-wise refinement of step 1 (Organise everything together) 1.1 Get a cup 1.2 Get tea bags 1.3 Get sugar 1.4 Get milk 1.5 Get spoon/fork.
Data Science: Computer Science Step-wise refinement of step 2 (Plug in kettle) 2.1 Locate plug of kettle 2.2 Insert plug into electrical outlet
Data Science: Computer Science Step-wise refinement of step 3 (Put teabag in cup) 3.1 Take teabag from box 3.2 Put it into cup
Data Science: Computer Science Step-wise refinement of step 4 (Put water into kettle) 4.1 Bring kettle to tap 4.2 Put kettle under water 4.3 Turn on tap 4.4 Wait for kettle to be full 4.5 Turn off tap
Data Science: Computer Science Step-wise refinement of step 5 (Turn on kettle) 5.1 Depress switch on kettle
Data Science: Computer Science Over to you…
Timeline of Data Science (Computer Science) 1843 AD Ada Lovelace : Publishes the first algorithm, for computing Bernoulli numbers 1938 AD Alan Turing : Publishes "On Computable Numbers” 1944 AD John von Neumann : Proposes the Stored- Program Architecture 1952 AD Grace Hopper : Writes the first c omplier. 1968 AD Donald Knuth : Publishes Volume 1 of “The Art of Computer Programming”
Data Science: Computer Science In the context of Data Science, we know that using statistics we can analyse the data sets, but computer programs permit us to a lot of other exciting things with the data sets. One of those things is Data Cleaning …
Data Science: Computer Science DATA CLEANING (or Data Cleansing) is fixing or removing data that is incorrect (in some way) from the dataset.
Data Science: Computer Science Let’s imagine one of the columns of the dataset is a date, but different rows have different formats, e.g. 12-3-1992 06/11/1946 23 rd November 2022
Data Science: Computer Science We can write a computer program to reformat all of these dates into one common format, e.g. DD-MM-YYYY This is called Data Transformation .
Data Science: Computer Science Another issue might be that some of the rows of data are recorded multiple times. So we can write a program to scan for this kind of duplication. This is called Duplicate Elimination
Data Science: Computer Science One more issue to mention is that if a column has text in it, we can write programs to check if the text is suitable. This is called Parsing .
Data Science: Computer Science Another area the computer programs can help us with is in creating graphs to show trends in the data. This is called Data Visualisation .
Data Science: Computer Science These are just a few simple ways to use Computer Science in Data Science, there are loads and loads of more ways that it can be used in very sophisticated ways.
Alan Turing Ada Lovelace Donald Knuth John von Neumann Grace Hopper Edsger Dijkstra Some Famous Computer Scientists
Domain Knowledge
Data Science: Domain Knowledge Domain Knowledge simply means that if we are building a Data Science system for a particular group of people or for an organisation , we need to understand their domain, in other words, their models, their terminology, and their requirements.
Data Science: Domain Knowledge For example, If we were building a system for an economical analysis, we need to know the key models, such as: Cobb–Douglas model of production. Solow–Swan model of economic growth. Heckscher–Ohlin model of international trade. Ramsey–Cass–Koopmans model of economic growth. Gordon–Loeb model for cyber security investments.
Data Science: Domain Knowledge … And what terms mean, such as: Manufacturing Industry Output Price Index Agricultural Output Price Index Industrial Production Index Retail Sales Value Index Consumer Price Index Trade Surplus GDP and GNP
Data Science: Domain Knowledge So, it is very important to recognize that all domains have their own models and terms, and we need to know them to develop data science systems for them.
Data Science: Domain Knowledge It is also worth mentioning that some systems demand much greater attention than others, so working on developing a tool to help an online shopping website is important, but creating a tool to help doctors do medical diagnosis, or to help nuclear technicians control a nuclear power plant, should require a greater effort to understand the domain in a lot more detail.
Data Science: Domain Knowledge Domain knowledge is essential at all stages of the data science process, including: Understanding what the system is supposed to do Understanding what data needs to be collected Understanding what are the most important features of the data Understanding how the data should be modelled Understanding how to interpret the results generated by the system
Data Science: Domain Knowledge Some reasons why Domain Knowledge is important: CONTEXT As we have discussed, if we don’t understand the context, we don’t know which data is most important QUALITY In every dataset, there is some data that is wrong (anomalies, outliers, and biases), an expert knows. FEATURES In each dataset, there are variables (or features) that can be used, by the most important depends on the questions being asked. MODELS Different models can be applied to a problem, knowing which one is best depends on combining the domain expert and our knowledge.
Data Science
Data Science As we’ve said, Data Science is all about extracting insight and information from data sets to make better decisions. Kelleher, J.D., Tierney, B. (2018) Data Science . MIT press.
Data Science In Data Science, one of the key formal processes (methodologies) that businesses follow is called CRISP-DM ( CR oss I ndustry S tandard P rocess for D ata M ining), and it provides organisations with a step-by-step guide to using Data Science in businesses.
Timeline of Data Science 1962 AD John Tukey : Publishes “The Future of Data Analysis” 1974 AD Peter Naur : Introduces the term “Data Science” 1977 AD Founding of International Association for Statistical Computing 2001 AD William S. Cleveland : Publishes “Data Science: An Action Plan” 2006 AD Hadoop : First release of “Big Data” open-source database
Data Science ACTIVITY Interview the person beside you in class, ask them their name, their month of birth, and what is their favorite food and drink. We’ll write the months of birth, and the foods and drinks on the board, and see if there is any patterns.
Data Science: Bias https://www.linkedin.com/pulse/detecting-data-distortions-three-types-biases-every-manager-eppler/
Data Science: Survivor Bias Let’s take one, Survivor bias (also known as both Survivorship bias and Survival bias ). It is a type of sample selection bias that occurs when an individual mistakes a visible successful subgroup as the entire group. Let’s look at an example:
Data Science: Survivor Bias During WW II, Abraham Wald, a member of the Statistical Research Group (SRG) at Columbia University, to examine the distribution of damage to aircraft returning after flying missions to provide advice on how to minimize bomber losses to enemy fire.
Data Science: Survivor Bias Based on damage patterns to aircraft returning after flying missions, it had been suggested that the key areas to reinforce were the ones where the bullet damage was mainly observed. However, this would be a result of survivor bias because crucial data from fatally damaged planes was being ignored; those hit in other places did not survive.
Data Science: Survivor Bias Wald pointed out that the bullet holes in the returning aircraft represented areas where a bomber could take damage and still fly well enough to return safely to base. Therefore, he proposed that the Navy reinforce areas where the returning aircraft were unscathed.
Data Science: Some Software Tools Python (with Pandas, NumPy, Scikit-learn, Matplotlib) TensorFlow Hadoop R Programming Language WEKA Tableau
Data Science: Main Application Areas
Data Science: Main Application Areas HEALTHCARE Data Science has been very beneficial to the healthcare sector where it can be used to model and predict the trajectory of a disease, and potentially intercept the onset of a disease (at a molecular level). It can also help doctors develop treatment plans for patients.
Data Science: Main Application Areas FINANCE Data Science is used in numerous applications in the financial sector, include fraud detection, where large datasets may show hidden correlations between user behavior and a likelihood of fraudulent actions, to help predict thing like credit card fraud.
Data Science: Main Application Areas TRANSPORTATION By harnessing the potential of big data, transportation planners can tackle issues such as managing traffic congestion, optimizing public transportation routes, and reducing carbon emissions, by making more informed decisions.
Data Science: Main Application Areas MARKETING Marketing requires a thorough understanding of their customers, including their wants, needs, challenges, and pain points. Data science provides the tools and techniques needed to collect, analyze and interpret customer behavior.
Data Science: Main Application Areas ACTIVITY There are four more example topics left: Energy Consumption Sports Genetics Manufacturing Pick any one of these four, and have a Google to see how Data Science is used in that field.
Claude Shannon Ronald Fisher John Tukey Leslie Kaelbling William S. Cleveland Corinna Cortes Some Famous Data Scientists