Introduction to
Machine Learning
For
Complete Beginners
pythonforengineers.com
Steps to machine learning
Gather Data Clean Data
Prepare input
for ML
Machine Learning
Algorithm
Test model on
real
data
ML model
Visualise Data
Gathering Data
Depending on the use case, this might be the
hardest part!
Data may have to be scraped from websites, or
manually collected (by doing surveys, or taking
measurements in a lab).
Data maybe spread over hundreds of files, in a
haphazard format
Clean the data
Even when you gather the data, it may not be
easily usable
Missing fields, data in different formats (inches
vs centimeter)
I have seen the same file have dates in 3
different formats: dd-mm-yy, mm-dd-yy and yy-
mm-dd
The data has to be made consistent and clear
Visualise Data
You do NOT need machine learning algorithms!
Sometimes, just visualising the data will show
you insights
Made up example:
Why did account cancellation jump in January?
What did we change in the service in that time?
November December January Feb
0
1
2
3
4
5
6
7
8
9
10
Cancellations of Accounts
Num Cancel
Preparing for machine learning
We need to choose which inputs we will use for
our learning, and what the expected output is
Machine Learning
Algorithm
Inputs
Expected output
model
Example
Titanic dataset contains: Name, age, address
etc.
Are all these fields useful?
What are the inputs?
What is the expected output?
Problems we will face
Overfitting
The algorithm does an excellent job of prediction.
But it only works on our test data
The algorithm has only learnt how to predict
with our exact data
Like Astrologers!!
Solutions
The test data is divided into a training and test
section
Only the training set is used to train the
algorithm
The test set is then used to check if the model
works for unseen data (as we know what the
expected output is for the test data)
Problem: The amount of data the algorithm has
is reduced
Engineering is about compromises
Your assignment
Look at dataset
Which fields will you be choosing?