Chapter 1: Data Governance At the end of this chapter, students should be able to understand: What is data governance? What are the guidelines for the ethical handling of data? What is data privacy?
What is Data Governance? Data governance can be thought of as a collection of people, technologies, processes, and policies that protect and help to manage efficient use of data. Through data governance, we can ensure that the quality and security of the data used is maintained. Data Governance covers the following aspects. Data Quality Data Security Data Architecture Data Integration and Interoperability Data Storage
Ethical Guidelines Ethics can be said to be the moral principles that govern the behavior or actions of an individual or a group. To begin with we must make sure that qualities such as integrity, honesty, objectivity, nondiscrimination are always part of the high-level principles which should be incorporated in all our processes. Software products and data are not always used for purposes which are good for society. Some of the guidelines include the following. Keep the data secure. Be as open and accountable as possible Use technologies that has the minimum intrusion.
Data Privacy Data privacy is the right of any individual to have control over how his or her personal information is collected and used. Data privacy covers the following aspects. How personal data is collected and stored by organizations. Whether and how personal data is shared with third parties. Government policies regarding the storage and sharing of personal information.
Data Privacy Data privacy is not just about secure data storage. There could be cases where personal identifiable information is collected and stored securely in an encrypted format, without any consent from the users regarding the collection of the data itself. In such cases, there is a clear violation of data privacy rules.
Chapter 2: Exploratory Data Analysis This chapter aims the concepts of exploring and cleaning data before performing machine learning operations on it. At the end of this chapter students should be able to understand: Univariate Analysis Multivariate Analysis Data Cleaning Techniques
Univariate Analysis
Univariate Analysis U nivariate analysis techniques
Univariate Analysis Use of statistical techniques in u nivariate analysis Some statistical methods for univariate analysis include looking at: Mean Median Mode Range Variance Maximum Minimum Quartiles Standard deviation.
Univariate Analysis Use of graphical techniques in uni variate analysis Some graphical methods for univariate analysis involve preparing: frequency distribution tables bar charts Histograms frequency polygons pie charts.
E xamples of graphical method for univariate analysis Scatter plot Boxplot Histogram
Bivariate Analysis
Bivariate Analysis What are the different methods to perform b i variate analysis? Bivariate analysis is usually done by using graphical methods like scatter plots line charts pair plots.
Multivariate Analysis
Multivariate Analysis What are the different methods to perform multi variate analysis? Different methods to perform multivariate analysis are: Canonical Correlation Analysis Cluster Analysis Contour plots Principal Component Analysis.
Data Cleaning Data cleaning has the following steps. Remove duplicate observations Remove irrelevant observations Remove unwanted outliers Fix data type issues Handle missing data
Chapter 3: Classification Algorithms I This chapter aims at helping students understand concept of Classification with Decision Trees. At the end of this chapter, students should be able to understand: What is a Decision Tree? Applications of Decision Trees How to create a Decision Tree?
Introduction to Decision Trees A decision tree is a diagrammatic representation of the decision-making process and has a tree like structure.
Applications of Decision Trees
Creating a Decision Tree To create a decision tree, follow the steps below
Chapter 4: Classification Algorithms II This chapter aims at helping students understand another important classification algorithm – K Nearest Neighbors. At the end of this chapter, students should be able to understand: What is K Nearest Neighbors? Pros and Cons of K-NN What is Cross Validation?
Introduction to K-Nearest Neighbors The K-NN algorithm works on the principal that similar things exist in close proximity to each other.
How K-NN algorithm works
Pros and Cons of using K-NN
Cross Validation Steps involved in cross validation are as follows.
Chapter 5: Regression Algorithms I This chapter aims at to introduce the concepts of regression to the students. By the end of this chapter, students should be able to understand: What is Linear Regression? What is Mean Square Deviation? What is Mean Absolute Error?
Introduction to Linear Regression Linear regression helps to explain the relationship between a variable y given the values of some variable x.
Mean Absolute Error Mean Absolute Error measures the average magnitude of the errors in a set of predictions.
Root Mean Square Deviation
Root Mean Square Deviation
Chapter 6: Regression Algorithms II This chapter aims to dive deeper into regression concepts by teaching students about regression with multiple variables. At the end of this chapter, students should be able to understand: Multiple Linear Regression Non-linear Regression
Multiple Linear Regression Multiple Linear Regression uses multiple independent variables to predict the outcome of a dependent variable.
Non-linear Regression The graph of non-linear regression follows equation of a curve.
Chapter 7: Unsupervised Learning This chapter aims at helping students understand concepts of Unsupervised Learning. At the end of this chapter, students should be able to understand: What is Unsupervised Learning? Applications of unsupervised learning What is Clustering? What is k -means Clustering?
Introduction to Unsupervised Learning
Introduction to Unsupervised Learning
Real world applications of Unsupervised Learning
Introduction to Clustering
Introduction to Clustering As shown in the diagram above, the input for a clustering algorithm is the original raw data and the output is a well clustered data set with three distinct clusters.
Introduction to Clustering There are many ways to perform clustering. Here are some of the main clustering methods.
K-Means Clustering What is K-Means clustering?
K-Means Clustering The k-means clustering algorithm works is as follows: