Deep Neural Network for MNIST Classification

Here we use the dataset is called MNIST and refers to handwritten digit recognition.

Very visual problem
Extremely common
Very big and preprocessed

The dataset provides 70,000 images (28x28 pixels) of handwritten digits (1 digit per image).

The goal is to write an algorithm that detects which digit is written. Since there are only 10 digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), this is a classification problem with 10 classes.

Import the relevant packages

TensorFlow includes a data provider for MNIST that we’ll use.

import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds

# these datasets will be stored in C:\Users\*USERNAME*\tensorflow_datasets\...
# the first time you download a dataset, it is stored in the respective folder 
# every other time, it is automatically loading the copy on your computer

Data

That’s where we load and preprocess our data.

We want to create training, validation and test datasets.

The MNIST only have training and test datasets. Therefore we need to prepare validation data in this case.

Load the Data

tfds.load(name) loads a dataset from TensorFlow datasets.

as_supervised = True will load the data in a 2-tuple structure [input, target].
with_info = True will also provide a tuple containing info about version, features, # samples of the dataset. Here we use mnist_info to store the info.

1	mnist_dataset, mnist_info = tfds.load(name='mnist', with_info=True, as_supervised=True)

Note Here the mnist dataset only have training and test datasets. Therefore we need to prepare validation data in this case.

To Alter the Number of Samples

To cast a variable into a given data type, use tf.cast(x, datatype).

mnist_train, mnist_test = mnist_dataset['train'], mnist_dataset['test']

# Take 10% of the training dataset to serve as validation
# use the info attributes to get the number of examples of training dataset.
num_validation_samples = 0.1 * mnist_info.splits['train'].num_examples
# the number times float might become float value, so we cast it back into int.
num_validation_samples = tf.cast(num_validation_samples, tf.int64)

# Then same on the test dataset.
num_test_samples = mnist_info.splits['test'].num_examples
num_test_samples = tf.cast(num_test_samples, tf.int64)

Scale our Data

Normally we would also like to scale our data in some way to make the result more numerically stable. In this case we will simply prefer to have inputs between 0 and 1.

Since the Image values are 0 to 255 (256 different shades of grey), we just need to divide all the elements by 255 to get a value between 0 and 1.

Now we write a scale function, which will be mapped later.

def scale(image, label):
    # we make sure the value is a float
    image = tf.cast(image, tf.float32)
    # Divide by 255. the '.' indicates the result has to be a float value.
    image /= 255.
    return image, label

To scale our dataset:

dataset.map(*function*) applies a custom transformation to a given dataset. It takes as input a function which determinnes the transformation.

# the method .map() allows us to apply a custom transformation to a given dataset
# we have already decided that we will get the validation data from mnist_train, so 
scaled_train_and_validation_data = mnist_train.map(scale)

# finally, we scale and batch the test data
# we scale it so it has the same magnitude as the train and validation
# there is no need to shuffle it, because we won't be training on the test data
# there would be a single batch, equal to the size of the test data
test_data = mnist_test.map(scale)

Shuffle and Batch

We want the training data be as randomly spread as possible.

Shuffling - to keep the same information but in a different order.

Batching - the process of splitting data into $n$ batches. You can design the batch size in 1 batch.

When we are dealing enormous datasets, we can’t shuffle all data at once because we can’t possibly fit it all in the memory of the computer.

Use Batching to split data into batches.

1	BUFFER_SIZE = 10000

Note:

If 1 < BUFFER_SIZE < num_sample, we will be optimizing the computational power

If BUFFER_SIZE = 1, no shuffling will actually happen

If BUFFER_SIZE >= num_sample, shuffling will happen at once (uniformly)

To shuffle our data, use data.shuffle(BUFFER_SIZE)

1	shuffled_train_and_validation_data = scaled_train_and_validation_data.shuffle(BUFFER_SIZE)

Then use data.take(*number*) to create validation_data.

Since the left part is the Training samples, use data.skip(*unwanted number*) to create train_data.

validation_data = shuffled_train_and_validation_data.take(num_validation_samples)

# similarly, the train_data is everything else, so we skip as many samples as there are in the validation dataset
train_data = shuffled_train_and_validation_data.skip(num_validation_samples)

Now we set our batch size.

Note:

Batch size = 1 => Stochastic gradient descent (SGD)

Batch size = # of samples => (single batch) GD

1 < Batch size < # of samples => mini-batch GD

We use mini-batch GD here.

use dataset.batch(BATCH_SIZE) to combine the consecutive elements of a dataset into batches.

1
2
3

BATCH_SIZE = 100

train_data = train_data.batch(BATCH_SIZE)

Note:

For validation data, we won’t be backpropagating but only forward propagating. Therefore we don’t really need to batch.

Remember that Batching was useful in updating weights only once per batch.

When batching, we find the average loss.

However, the model expects our validation set in batch form too.

So we just create a single batch (Batch size = number of samples) to overwrite validation_data.

validation_data = validation_data.batch(num_validation_samples)

# We also don't need to batch the test_data.
# Just take the same approach we use with the validation set.
test_data = test_data.batch(num_test_samples)

Reshape Validation Data

Our validation data must have the same shape and object properties as the train and test data.

The MNIST data is iterable and in 2-tuple format (as_supervise = True).

Therefore we must extract and convert the validation data into inputs and targets appropriately.

next() returns the next item from the iterator.
iter() creates an object which can be iterated one element at a time
- e.g. in a for loop or while loop

1
2
3

# takes next batch (it is the only batch)
# because as_supervized=True, we've got a 2-tuple structure
validation_inputs, validation_targets = next(iter(validation_data))

Model

When thinking about a deep learning algorithm, we mostly imagine building the model.

Outline the model

The dataset provides 70,000 images (28x28 pixels) of handwritten digits (1 digit per image).

Since we don’t know CNNs yet, we don’t know how to feed such input into our net, so we must flatten the images.

Size of Input ( not layer ) = 28*28 = 784

and the goal is to write an algorithm that detects which digit is written. Since there are only 10 digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), this is a classification problem with 10 classes.

Size of Output layer = 10

When building the model, we will use tf.keras.Sequential() to stack the layers (laying down the model).

Note:

For the first layer (the Flatten layer),

each observation is 28x28x1 pixels, therefore it is a tensor of rank 3.
Since we don’t know CNNs yet, we don’t know how to feed such input into our net, so we must flatten the images.

tf.keras.layers.Flatten(input_shape=(original shape)) transforms a tensor into a vector
- In this case simply takes our 28x28x1 tensor and orders it into a (784,) vector
This allows us to actually create a feed forward neural network
If we dont need to Flatten the image, we can just skip this.

For Hidden and Output layers

tf.keras.layers.Dense(size, activation='*activation function*')
- takes the inputs, provided to the model and calculates the dot product of the inputs and the weights and adds the bias.
- Also where we can apply an activation function.
- In practice each neural network has a different optimal combination of activation functions.

we pick softmax activation for output layer since we want a probability distribution.

input_size = 784
output_size = 10
# In this case use same hidden layer size for both hidden layers. Not a necessity.
hidden_layer_size = 50
    
# define how the model will look like
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28, 1)), # Flatten
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    # the final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])

This is how our model looks like. We can alter the model if we want.

Objective Function

Optimization process happens when the optimization algorithm varies the models parameters until the loss function has been minimized.

Choose the optimizer and the loss function

We must specify the optimizer and the loss through the compile method we call on the model object.

model.compile(optimizer='*optimizer*', loss='*loss*', metrics=['accuracy']) configures the model for training.
- adam (adaptive moment estimation) is one of the best optimizer.
- cross entropy will normally be our first choice.
  - binary_crossentropy refers to the case where we’ve got binary encoding
  - categorical_crossentropy refers to the case where we’ve got one-hot encoding
  - sparse_categorical_crossentropy help us to apply one-hot encoding
- We could include metrics that we wish to calculate throughout the training and testing processes.

# we define the optimizer we'd like to use, 
# the loss function, 
# and the metrics we are interested in obtaining at each iteration
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

Training

That’s where we train the model we have built.

Setting Epochs

We use model.fit() to train the data.

When we fit the model, we need to specify the

training data
the total number of epochs
and the validation data we just created ourselves in the format: (inputs,targets)

# determine the maximum number of epochs
NUM_EPOCHS = 5

# fit the model
model.fit(train_data, epochs=NUM_EPOCHS, validation_data=(validation_inputs, validation_targets), verbose =2)

What Happens Inside an Epoch

At the beginning of each epoch, the training loss will be set to 0.
The algorithm will iterate over a preset number of batches, all from train_data.
Essentially the whole training set will be utilized but in batches. Therefore the weights and biases will be updated as many times as there are batches.
At the end of each epoch, we’ll get a value for the lost function indicating how the training is going.
We will also see a training accuracy because we added verbose
At the end of the epoch the algorithm will forward propagate the whole validation data set in a single batch through the optimized model and calculate the validation accuracy.

When we reach the maximum number of epochs, the training will be over.

In above example of First Epoch, the values are representing:

Information about number of Epoch: 1/5

Number of Batches : 540/540

if we had a progress bar it would fill out gradually

Time took per epoch : 4 seconds

Training loss: 0.4188

should be compared to the training loss across epochsgo

Training accuracy: 0.8816 = 88.16%

The accuracy shows in what % of the cases our outputs were equal to the targets

Validation loss: 0.2315

We keep an eye on the validation loss (or set early stopping mechanisms) to determine whether the model is overfitting

Validation accuracy: 0.9385 = 93.85%

We need to keep an eye on it. It is the accuracy of the model for the epoch

Hyperparameters Matters

The model we trained has ~90% accuracy.

Is that good? Well, not really. In fact, It is pretty bad. This is because we’re using a very simple model. With some changes in hyperparameters, we can get to a better accuracy.

Once we train our first model, we fiddle with the hyperparameters.

Basically, the Steps in improving model:

Create a model
Fiddle with the hyperparameters
Check validation accuracy (make sure don’t overfit)
Repeat Step 2 and 3 until a good validation accuracy

Fiddle Hyperparameters

Once we train our first model, we fiddle with the hyperparameters.

There are several main adjustments you may try.

Please pay attention to the time it takes for each epoch to conclude.

The width (the hidden layer size) of the algorithm. Try a hidden layer size of 200. How does the validation accuracy of the model change? What about the time it took the algorithm to train? Can you find a hidden layer size that does better?

Solution

The validation accuracy is significantly higher (as the algorithm with 50 hidden units was too simple of a model).

Naturally, it takes the algorithm much longer to train (unless early stopping is triggered too soon).

A hidden layer size of 500 (and not only) works even better.

Higher width of hidden layer size, Higher accuaracy of the model

The depth of the algorithm. Add another hidden layer to the algorithm. This is an extremely important exercise! How does the validation accuracy change? What about the time it took the algorithm to train? Hint: Be careful with the shapes of the weights and the biases.

Solution

Adding another hidden layer to the algorithm is done in the same way as in the lecture.

We simply add a new line in Sequential:

tf.keras.layers.Dense(hidden_layer_size, activation='relu')

We can see that the accuracy of the model does not necessarily improve. This is an important lesson for us. Fiddling with a single hyperparameter may not be enough. Sometimes, a deeper net needs to also be wider in order to have higher accuracy. Maybe you need more epochs?

ADDITIONAL TASK: Try this new model, but with a wider one (200-500 hidden units). Basically, combine this and the previous exercises

In any case, it takes longer for the algorithm to train.

The width and depth of the algorithm. Add as many additional layers as you need to reach 5 hidden layers. Moreover, adjust the width of the algorithm as you find suitable. How does the validation accuracy change? What about the time it took the algorithm to train?

Solution

This exercise is pretty much the same as the previous one. However, it will get us to a much deeper net. As we noted in the previous exercise, you a deeeper net may need to be wider to produce better results.

We tried with 1000 hidden units in each layer and 5 hidden layers.

The result (as you can see below) is that our model’s training was going very well, until it overfit. It did so by quite a lot.

It took my personal computer around 5-6 minutes to train the model.

What if you have more epochs?

Fiddle with the activation functions. Try applying sigmoid transformation to both layers. The sigmoid activation is given by the string ‘sigmoid’.

Solution

Find the part where we stack layers (Sequential()).

Adjust the activations from ‘relu’ to ‘sigmoid’.

Generally, we should reach an inferior solution. That is because relu ‘cleans’ the noise in the data (think about it - if a value is negative, relu filters it out, while if it is positive, it takes it into account). For the MNIST dataset, we care only about the intensely black and white parts in the images of the digits, so such filtering proves beneficial.

The sigmoid does not filter the signals as well as relu, but still reaches a respectable result (around 95%).

Try using softmax activations for all layers. How does the result change? Can you explain why that happens?

Fiddle with the activation functions. Try applying a ReLu to the first hidden layer and tanh to the second one. The tanh activation is given by the string ‘tanh’.

Solution

Analogically to the previous lecture, we can change the activation functions. This time though, we will use different activators for the different layers.

The result should not be significantly different. However, with different width and depth, that may change.

Additional exercise: Try to find a better combination of activation functions

Adjust the batch size. Try a batch size of 10000. How does the required time change? What about the accuracy?

Solution

Find the line that declares the batch size.

Change batch_size from 100 to 10000.
1
BATCH_SIZE = 10000
A bigger batch size results in slower training.

A bigger batch size results in slower training. That’s what we expected from the theory. We are taking advantage of batching because of the amazing speed increase.

Notice that the validation accuracy starts from a low number and with 5 epochs actually finishes at a lower number. That’s because there are fewer updates in a single epoch.

Try a batch size of 30,000 or 50,000. That’s very close to single batch GD for this problem. What do you think about the speed?You will need to change the max epochs to 100 (for instance), as 5 epochs won’t be enough to train the model. What do you think about the speed of optimization?

Adjust the batch size. Try a batch size of 1. That’s the SGD. How do the time and accuracy change? Is the result coherent with the theory?

Solution

Find the line that declares the batch size.

Change batch_size from 100 to 1.
1
batch_size = 1
A batch size of 1 results in the SGD. It takes the algorithm very little time to process a single batch (as it is one data point), but there are thousands of batches (54000 to be precise), thus the algorithm is actually slow. Remember that this depends on the number of cores that you train on. If you are using a CPU with 4 or 8 cores, you can only train 4 or 8 batches at once. The middle ground (mini-batching such as 100 samples per batch) is optimal.

Notice that the validation accuracy starts from a high number. That’s because there are lots updates in a single epoch. Once the training is over, the accuracy is lower than all other batch sizes (SGD was an approximation).

Adjust the learning rate. Try a value of 0.0001. Does it make a difference?

Solution

First, we have to define a custom optimizer (as we did in the TensorFlow intro).

We create the custom optimizer with:
1
custom_optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)
Then we change the respective argument in model.compile to reflect this:
1
model.compile(optimizer=custom_optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
Since the learning rate is lower than normal, we may need to adjust the max_epochs (to, say, 50).

The result is basically the same, but we reach it much slower.

While Adam adapts to the problem, if the orders of magnitude are too different, it may not have enough time to adjust accordingly.

Adjust the learning rate. Try a value of 0.02. Does it make a difference?

Solution

First, we have to define a custom optimizer (as we did in the TensorFlow intro).

We create the custom optimizer with:
1
custom_optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)
Then we change the respective argument in model.compile to reflect this:
1
model.compile(optimizer=custom_optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
While Adam adapts to the problem, if the orders of magnitude are too different, it may not have time to adjust accordingly. We start overfitting before we can reach a neat solution.

Therefore, for this problem, even 0.02 is a HIGH starting learning rate. What if you try a learning rate of = 1?

It’s a good practice to try 0.001, 0.0001, and 0.00001. If it makes no difference, pick whatever, otherwise it makes sense to fiddle with the learning rate.

Combine all the methods above and try to reach a validation accuracy of 98.5+ percent.

Achieving 98.5% accuracy with the methodology we’ve seen so far is extremely hard. A more realistic exercise would be to achieve 98%+ accuracy. However, being pushed to the limit (trying to achieve 98.5%), you have probably learned a whole lot about the machine learning process.

Here is a link where you can check the results that some leading academics got on the MNIST (using different methodologies): https://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results

Solution

After some fine tuning, I decided to brute-force the algorithm and created 10 hidden layers with 5000 hidden units each.
1
2
3
hidden_layer_size = 5000
batch_size = 150
NUM_EPOCHS = 10
All activation functions are ReLu.

There are better solutions using this methodology, this one is just superior to the one in the lessons. Due to the width and the depth of the algorithm, it took my computer 3 hours and 50 mins to train it.

Test the Model

After training on the training data and validating on the validation data, we test the final prediction power of our model by running it on the test dataset that the algorithm has NEVER seen before.

This is also the final stage of the machine learning process.

After we test the model conceptually we are no longer allowed to change the model again.

The main point of the test dataset is to simulate model deployment.

It is very important to realize that fiddling with the hyperparameters overfits the validation dataset.

The test is the absolute final instance. You should not test before you are completely done with adjusting your model.

Test data is a dataset that the model has truly never seen.
- If you start changing the model after this point, the test data will no longer be a data set the model has never seen.
- If you adjust your model after testing, you will start overfitting the test dataset, which will defeat its purpose.

The validation accuracy is a benchmark for how good the model is.
The test data set then is our reality check that prevents us from overfitting the hyperparameters.

Evaluate

Now, we can find out the final true accuracy of the model using the test data.

model.evaluate() return the loss value and metrics values for the model in ‘test mode’.
Note: getting a test accuracy very close to the validation accuracy shows that we have not overfit.

test_loss, test_accuracy = model.evaluate(test_data)

# We can apply some nice formatting if we want to
print('Test loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))

Practical Business example. Audiobooks

Our task is to create a machine learning algorithm that can predict if a customer will buy again.

Learn to Load csv file
Learn to deal with blank/ missing values
Learn to deal with binary (boolean) values
Real life example

Examine Raw Data

Here is the data, provided as a csv file.

When we put the csv file into Tensorflow, The column headers won’t be included as we want no text in the data.
Therefore we need to remove the first row. Just do it in the csv file.

The data is representing the 2 years and 6 months.

Column Detail:

ID -: just a name, we will skip it in the algorithm

Book length(mins)_overall -: sum of the lengths of all purchases

Book length(mins)_avg -: $\frac{\text{sum of the lengths of all purchases}}{\text{number of purchases}}$

Price_overall -: overall price paid

Price_avg -: average price paid

Review -: a boolean value shows if the customer left a review (1 = true, 0 = false)

Review 10/10 -: It measures the review of a customer from 1 to 10.

Note for customers who didnt left a review, just replace the blank cell with total average review score (which is 8.91 in this case, as status quo).

Minutes listened -: Total mintues listened is a measure of engagement

Completion -: $\frac{\text{total minutes listened}}{\text{overall book length}}$

Support Requests -: the total number of support requests, also a measure of engagement

e.g. forgotten password, assistance

Last visited minus Purchase date -: difference between the last time a person interacted with the platform and their first purchase date.

The bigger the difference the better the engagement.

If the value is 0, the customer has never accessed what he/she has bought.

Targets -: a boolean value whether a customer buy another book in the last 6 months of data.

we can count them as a conversion and the target will be 1, otherwise 0

Preprocess the Data

Extract the data from csv

We will use the sklearn preprocessing library, as it will be easier to standardize the data.

Input : All columns except first one and last one
- arbitrary customer IDs contain no useful information
Target : last column

import numpy as np
from sklearn import preprocessing

# Load the data
raw_csv_data = np.loadtxt('Audiobooks_data.csv',delimiter=',')

# The inputs are all columns in the csv, except for the first one [:,0]
# and the last one [:,-1] (which is our targets)
unscaled_inputs_all = raw_csv_data[:,1:-1]

# The targets are in the last column. That's how datasets are conventionally organized.
targets_all = raw_csv_data[:,-1]

Note : we can shuffle the indices before balancing (to remove any day effects, etc.)

However, we still have to shuffle them AFTER we balance the dataset as otherwise, all targets that are 1s will be contained in the train_targets.

This code is suboptimal

# When the data was collected it was actually arranged by date
# Shuffle the indices of the data, so the data is not arranged in any way when we feed it.
# Since we will be batching, we want the data to be as randomly spread out as possible
shuffled_indices = np.arange(unscaled_inputs_all.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets.
unscaled_inputs_all = unscaled_inputs_all[shuffled_indices]
targets_all = targets_all[shuffled_indices]

Balance the dataset

Main idea: all the number of different target must be the same.
- In this case, the number of targets 1s must match the same number of targets 0s.
Otherwise we need to balance the dataset.

# Count how many targets are 1 (meaning that the customer did convert)
num_one_targets = int(np.sum(targets_all))

# Set a counter for targets that are 0 (meaning that the customer did not convert)
zero_targets_counter = 0

# We want to create a "balanced" dataset, so we will have to remove some input/target pairs.
# Declare a variable that will do that:
indices_to_remove = []

# Count the number of targets that are 0. 
# Once there are as many 0s as 1s, mark entries where the target is 0.
for i in range(targets_all.shape[0]): # loop through target
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
          	# add excessive 0s targets to list
            indices_to_remove.append(i)

# Create two new variables, one that will contain the inputs, and one that will contain the targets.
# We delete all indices that we marked "to remove" in the loop above.
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

^ np.delete(array, obj to delete, axis) is a method that deletes an object along an axis.

Standardize the inputs

Standardizing the inputs will greatly improve the algorithm.

That’s the only place we use sklearn functionality. We will take advantage of its preprocessing capabilities.

At the end of the business case, you can try to run the algorithm WITHOUT this line of code.
The result will be interesting.

1 2	# It's a simple line of code, which standardizes the inputs. scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

Shuffle the data

Since we will be batching, we must shuffle the data.
- data should be as randomly spread as possible so batching will work.

# When the data was collected it was actually arranged by date
# Shuffle the indices of the data, so the data is not arranged in any way when we feed it.
# Since we will be batching, we want the data to be as randomly spread out as possible
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

^ np.arange([start],stop, step) is a method that returns a evenly spaced values within a given interval. More Info
^ np.random.shuffle(X) is a method that shuffles the numbers in a given sequence.

Split the dataset

Now we split it into training, validation and test.

We want a 80-10-10 distribution of training, validation, and test.
- Note the numbers need to be integers!

# Count the total number of samples
samples_count = shuffled_inputs.shape[0]

# Count the samples in each subset
train_samples_count = int(0.8 * samples_count)
validation_samples_count = int(0.1 * samples_count)

# The 'test' dataset contains all remaining data.
test_samples_count = samples_count - train_samples_count - validation_samples_count

Now we have the size of train, validation and test, let’s create the dataset.

# Training
# In our shuffled dataset, they are the first "train_samples_count" observations
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

# Validation
# They are the next "validation_samples_count" observations, folllowing the "train_samples_count" we already assigned
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

# Test
# They are everything that is remaining.
test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

We balanced our dataset to be 50-50 (for targets 0 and 1), but the training, validation, and test were taken from a shuffled dataset. Check if they are balanced, too. Note that each time you will get different values, as each time they are shuffled randomly.

# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

^ Remember to check the proportion of the data samples.

Save the data in .npz file

better to name them in a very semantic way so we can easily use them later.

# Save the three datasets in *.npz.
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

Special Reminder

Note

Each time we run the code in preprocess section, we will preprocess the data once again and generate completely new npz files.
- training, validation and test datasets will contain different samples

Create the machine learning algorithm

From now on, we start a new notebook file and play with the npz files we obtained.

1
2
3

# we must import the libraries once again since we haven't imported them in this file
import numpy as np
import tensorflow as tf

Load the Data

# use a temporary variable npz that will store each of the three Audiobooks datasets
npz = np.load('Audiobooks_data_train.npz')

# we extract the inputs using the keyword under which we saved them
# we need to ensure that they are all floats, cast them into floats
train_inputs = npz['inputs'].astype(np.float)
# targets must be int because of sparse_categorical_crossentropy (we want to be able to smoothly one-hot encode them)
train_targets = npz['targets'].astype(np.int)

# load the validation data in the temporary variable
npz = np.load('Audiobooks_data_validation.npz')
# we can load the inputs and the targets in the same line
validation_inputs, validation_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

# load the test data in the temporary variable
npz = np.load('Audiobooks_data_test.npz')
# the test inputs and the test targets
test_inputs, test_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

^ np.ndarray.astype() creates a copy of the array, cast to a specific type.
Targets must be integers if we want to apply one-hot encoding.

Outline the model

We have 10 predictors, therefore the size of input layer must be 10.
- Note we dont have to specify it in our model.
- Note the first hidden layer we need to specify our input
Output size is 2 because only 2 kinds of target.
This time we dont need to flatten because we have preprocessed our data already.
tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
Last layer with softmax because our model is a classifier

input_size = 10
output_size = 2
# Use same hidden layer size for both hidden layers. Not a necessity.
hidden_layer_size = 50
    
# define how the model will look like
model = tf.keras.Sequential([
    tf.keras.layers.Dense(hidden_layer_size, input_shape=(input_size,), activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    # the final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])

model.summary()

Choose the optimizer and the loss function

We define the optimizer we’d like to use, the loss function, and the metrics we are interested in obtaining at each iteration.

1	model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

Training

That’s where we train the model we have built.

To prevent overfitting, we can set a early stopping using the argument callbacks.
- tf.keras.callbacks.EarlyStopping(patience=2) connfigures the early stopping mechanism of the algorithm.
  - patience lets us decide how many consecutive increases we can tolerate.

# set the batch size
batch_size = 100

# set a maximum number of training epochs
max_epochs = 100

# set an early stopping mechanism
# let's set patience=2, to be a bit tolerant against random validation loss increases
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

# fit the model
# note that this time the train, validation and test data are not iterable
model.fit(train_inputs, # train inputs
          train_targets, # train targets
          batch_size=batch_size, # batch size
          epochs=max_epochs, # epochs that we will train for (assuming early stopping doesn't kick in)
          # callbacks are functions called by a task when a task is completed
          # task here is to check if val_loss is increasing
          callbacks=[early_stopping], # early stopping
          validation_data=(validation_inputs, validation_targets), # validation data
          verbose = 2 # making sure we get enough information about the training process
          )

Note: Indicating the batch size in model.fit() will automatically batch the data.

With the model, we will be able to predict customer future behavior correctly.

We can use this information for what we intended to.
We can focus our marketing efforts only on those customers who are likely to convert again.

Test the Model

As we discussed, after training on the training data and validating on the validation data, we test the final prediction power of our model by running it on the test dataset that the algorithm has NEVER seen before.

It is very important to realize that fiddling with the hyperparameters overfits the validation dataset.

The test is the absolute final instance. You should not test before you are completely done with adjusting your model.

If you adjust your model after testing, you will start overfitting the test dataset, which will defeat its purpose.

Evaluate

1
2
3

test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)

print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))