Softwares & Tools for Data Analytics Lecture 6: Introduction to Data Visualisation Myint Moe Chit
2 Lecture Outline Goals of visualisation Usefulness of visualisation Basic data visualisation with R and Python Visualising categorical variables Visualising numerical variables Visualising the relationship between variables
3 Visualisation in Data Analytics "We cannot expect a small number of numerical values [summary statistics] to consistently convey the wealth of information that exists in data. Numerical reduction methods do not retain the information in the data.” William Cleveland The Elements of Graphing Data “ The simple graph has brought more information to the data analyst’s mind than any other device. —John Tukey The use of graphics to examine data is called visualisation.
4 Visualisation in Data Analytics An important step in the data science methodology is obtaining a visual representation of the data. This has multiple advantages: We are better at extracting information from visual cues, so a visual representation is usually more intuitive than a textual representation. A visualisation provides a concise snapshot and summarisation of the data. The goal of data visualisation is to convey a story to the viewer. This story could be in the form of general trends about the data or an insight.
5 A picture is worth a thousand words This visualisation summarises the relationship between BMI and pulse and corresponding health status. What do you discover? “The greatest value of a picture is when it forces us to notice what we never expected to see.”
6 What Makes a Good Visualisation? The McCandless Method Four elements to achieve success in data visualisation. Information , the data you are working with must be accurate Story , a clear, compelling, interesting, and relevant concept Goal , a specific objective or function for the visual Visual form , an effective use of metaphor or visual expression Source: https://www.informationisbeautiful.net/visualizations/what-makes-a-good-data-visualization/
7 Data Visualisation with R R includes basic graphing function ‘plot’. But we will use the ggplot2 package. > install.packages (“ggplot2”) > library (ggplot2) > ggplot(data = dta , aes (sex)) + geom_bar (fill = “blue") The main three components ggplo t command are: Data : The dta represents the data is being summarised. We refer to this as the data component. Aesthetic mapping : The plot uses several visual cues to represent the information provided by the dataset. aes (sex) represents sex variable from the dataset. We refer to this as the aesthetic mapping component. Geometry : geom_bar indicates the plot is a bar graph. This is referred to as the geometry component. To use ggplot2 you will have to learn several functions and arguments. These are hard to memorise , so we highly recommend you have the ggplot2 cheat sheet handy.
8 Summarising continuous numerical variable The first step of summarising a continuous numerical variable is to identify the distribution of the variable using a histogram or boxplot. Histograms reveal the overall shape of the frequencies in the groups. Suppose, we want to visualise the distribution of the weight of the respondents in our sample. Using a bar graph as shown below has no explanatory power because the variable is a continuous variable.
9 Data Visualisation with R To summarise a numerical variable: > ggplot( dta , aes (x=height)) + geom_histogram (bins = 10, fill = "blue") > ggplot( dta , aes (weight)) + geom_boxplot () + coord_flip () We can add more arguments. > ggplot( dta , aes (x=height, y=..density..)) + geom_histogram (bins = 10, fill = "blue") + geom_density (color="red", size=1.2) Using boxplot to summarise a numerical variable: Note: For a variable with right-skewed distribution and non-negative values (such as income, number of employees), we may need to use logarithmic scale for a histogram or boxplot.
10 Data Visualisation with R Use a stacked (clustered) bar graph to visualise the association between two categorical variables. > ggplot( dta , aes (sex, fill = status)) + geom_bar (position = "stack") > ggplot( dta , aes (sex, fill = status)) + geom_bar (position = “dodge") Use a multiple boxplot to visualise the association between a categorical and a numerical variable. > ggplot( dta , aes (status, bmi)) + geom_boxplot () + coord_flip () Use a scatterplot to visualise the association between two numerical variables. > ggplot( dta , aes (bmi, pulse)) + geom_point () + stat_smooth (method=" lm ") > ggplot( dta , aes (sex, fill = status)) + geom_bar (position = “fill")
11 Data Visualisation with Python We can use the pandas, matplotlib, seaborn packages for visualisation in Python. In pandas package, just add “plot” attribute with suitable graph type. To plot a bar chart for a categorical variable. We can add additional arguments, such as the colour of the bar, font size, etc in the bracket.
12 Data Visualisation with Python To summarise a numerical variable, use ‘ plot.hist ()’: dta ["weight"]. plot.hist () dta ["weight"]. plot.box () To add a density line, use “seaborn” package. sns.histplot ( dta ["weight"], bins=12, color='k’, kde =True) Using boxplot to summarise a numerical variable: dta.boxplot (column = ["weight"])
13 Data Visualisation with Python To visualise the relationship between two categorical variables, append “ plot.bar ” attribute to the cross-tab of the categorical variables. pd.crosstab ( dta.sex,dta.status ). plot.bar () Use a multiple boxplot to visualise the association between a categorical and a numerical variable. dta.boxplot (column=["pulse"], by="exercise", showmeans =True) Use Seaborn’s regplot to visualise the association between two numerical variables with a trend line. sns.regplot(x="BMI", y="pulse", data=dta) pd.crosstab ( dta.sex,dta.status ). plot.bar (stacked = True) pd.crosstab ( dta.sex,dta.status , normalize = “index”). plot.bar (stacked = True)
14 Scatter plot with a categorical variable We can also add a third dimension in a scatter plot by setting different colours of the dots for different group of a categorical variable. For example, we can assign different colour for observations with different health status. sns.scatterplot(x="BMI", y="pulse", data=dta, hue = "status") ggplot( dta , aes (bmi, pulse, colour = status)) + geom_point (size = 3)
15 Summary of the lecture In this section, we covered: the goals of visualisation usefulness of visualisation how to visualise categorical variables how to visualise numerical variables visualising the relationship between variables