01 — Training vs Test sets — Theory

One thing you have to understand first is that Machine Learning models learn from data. Knowing this, it is interesting to divide our dataset into Training Data & Test Data.

Ok, suppose we want to develop a program (model) that identifies whether an image is a dog or not a dog.

At first we will receive a set (sample) with several images of dogs, then we will take a part of this set (usually 70%) and give it to our model to learn by identifying common characteristics between dogs.

Ok, did we reserve 70% of our dataset (sample) for our algorithm to learn and the other 30%? So, these are the test data. We will pass the test data to our model and see how well it is learning. For example:

Is this a dog?

And our model will have to give a feedback saying if it is a dog or not.

See how interesting it is to divide the dataset (sample) into training and testing? Another example would be to identify a disease in patients, how would we know if our model has learned (or is learning) well if we let it learn from the entire data set?

Therefore, he will learn from one part (70% in our case) and we will reserve another part (30% in our case) to test and see how well he (our model) is learning.

02 — Training & Testing sets with Scikit-Learn

Now let’s see how it works in practice:


Now we will comment only on the crucial parts that were used to divide the data in Training and Test. First, we import the train_test_split() function.

Then we pass the following arguments to this function:

  • 1st — The data on the x-axis of the data set;
  • 2nd — Its correspondents on the y-axis;
  • 3rd — Finally, how much we reserve the data for testing: test_size = 0.30 = 30%.

Note that the train_test_split() function returns data already separated (randomly) into training and test data.

Now at last, we will train our model with training data only (as explained previously):

