DNN - Code Examples Explained
Deep Neural Network for MNIST Classification
Here we use the dataset is called MNIST and refers to handwritten digit recognition.
- Very visual problem
- Extremely common
- Very big and preprocessed
The dataset provides 70,000 images (28x28 pixels) of handwritten digits (1 digit per image).
The goal is to write an algorithm that detects which digit is written. Since there are only 10 digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), this is a classification problem with 10 classes.
Import the relevant packages
TensorFlow includes a data provider for MNIST that we’ll use.
1 | import numpy as np |
Data
That’s where we load and preprocess our data.
We want to create training, validation and test datasets.
The MNIST only have training and test datasets. Therefore we need to prepare validation data in this case.
Load the Data
tfds.load(name)
loads a dataset from TensorFlow datasets.
as_supervised = True
will load the data in a 2-tuple structure [input, target].with_info = True
will also provide a tuple containing info about version, features, # samples of the dataset. Here we usemnist_info
to store the info.
1 | mnist_dataset, mnist_info = tfds.load(name='mnist', with_info=True, as_supervised=True) |
Note Here the mnist dataset only have training and test datasets. Therefore we need to prepare validation data in this case.
To Alter the Number of Samples
To cast a variable into a given data type, use tf.cast(x, datatype)
.
1 | mnist_train, mnist_test = mnist_dataset['train'], mnist_dataset['test'] |
Scale our Data
Normally we would also like to scale our data in some way to make the result more numerically stable. In this case we will simply prefer to have inputs between 0 and 1.
Since the Image values are 0 to 255 (256 different shades of grey), we just need to divide all the elements by 255 to get a value between 0 and 1.
Now we write a scale
function, which will be mapped later.
1 | def scale(image, label): |
To scale our dataset:
dataset.map(*function*)
applies a custom transformation to a given dataset. It takes as input a function which determinnes the transformation.
1 | # the method .map() allows us to apply a custom transformation to a given dataset |
Shuffle and Batch
We want the training data be as randomly spread as possible.
Shuffling - to keep the same information but in a different order.
Batching - the process of splitting data into batches. You can design the batch size in 1 batch.
When we are dealing enormous datasets, we can’t shuffle all data at once because we can’t possibly fit it all in the memory of the computer.
- Use Batching to split data into batches.
1 | BUFFER_SIZE = 10000 |
Note:
- If 1 < BUFFER_SIZE < num_sample, we will be optimizing the computational power
- If BUFFER_SIZE = 1, no shuffling will actually happen
- If BUFFER_SIZE >= num_sample, shuffling will happen at once (uniformly)
To shuffle our data, use data.shuffle(BUFFER_SIZE)
1 | shuffled_train_and_validation_data = scaled_train_and_validation_data.shuffle(BUFFER_SIZE) |
Then use data.take(*number*)
to create validation_data.
Since the left part is the Training samples, use data.skip(*unwanted number*)
to create train_data.
1 | validation_data = shuffled_train_and_validation_data.take(num_validation_samples) |
Now we set our batch size.
Note:
- Batch size = 1 => Stochastic gradient descent (SGD)
- Batch size = # of samples => (single batch) GD
- 1 < Batch size < # of samples => mini-batch GD
We use mini-batch GD here.
use dataset.batch(BATCH_SIZE)
to combine the consecutive elements of a dataset into batches.
1 | BATCH_SIZE = 100 |
Note:
For validation data, we won’t be backpropagating but only forward propagating. Therefore we don’t really need to batch.
- Remember that Batching was useful in updating weights only once per batch.
- When batching, we find the average loss.
However, the model expects our validation set in batch form too.
So we just create a single batch (Batch size = number of samples) to overwrite validation_data.
1 | validation_data = validation_data.batch(num_validation_samples) |
Reshape Validation Data
Our validation data must have the same shape and object properties as the train and test data.
- The MNIST data is iterable and in 2-tuple format (as_supervise = True).
Therefore we must extract and convert the validation data into inputs and targets appropriately.
next()
returns the next item from the iterator.iter()
creates an object which can be iterated one element at a time- e.g. in a for loop or while loop
1 | # takes next batch (it is the only batch) |
Model
When thinking about a deep learning algorithm, we mostly imagine building the model.
Outline the model
The dataset provides 70,000 images (28x28 pixels) of handwritten digits (1 digit per image).
Since we don’t know CNNs yet, we don’t know how to feed such input into our net, so we must flatten the images.
- Size of Input ( not layer ) = 28*28 = 784
and the goal is to write an algorithm that detects which digit is written. Since there are only 10 digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), this is a classification problem with 10 classes.
- Size of Output layer = 10
When building the model, we will use tf.keras.Sequential()
to stack the layers (laying down the model).
Note:
For the first layer (the Flatten layer),
each observation is 28x28x1 pixels, therefore it is a tensor of rank 3.
Since we don’t know CNNs yet, we don’t know how to feed such input into our net, so we must flatten the images.
tf.keras.layers.Flatten(input_shape=(original shape))
transforms a tensor into a vector- In this case simply takes our 28x28x1 tensor and orders it into a (784,) vector
- This allows us to actually create a feed forward neural network
- If we dont need to Flatten the image, we can just skip this.
For Hidden and Output layers
tf.keras.layers.Dense(size, activation='*activation function*')
- takes the inputs, provided to the model and calculates the dot product of the inputs and the weights and adds the bias.
- Also where we can apply an activation function.
- In practice each neural network has a different optimal combination of activation functions.
we pick softmax activation for output layer since we want a probability distribution.
1 | input_size = 784 |
This is how our model looks like. We can alter the model if we want.
Objective Function
Optimization process happens when the optimization algorithm varies the models parameters until the loss function has been minimized.
Choose the optimizer and the loss function
We must specify the optimizer and the loss through the compile method we call on the model object.
model.compile(optimizer='*optimizer*', loss='*loss*', metrics=['accuracy'])
configures the model for training.- adam (adaptive moment estimation) is one of the best optimizer.
- cross entropy will normally be our first choice.
binary_crossentropy
refers to the case where we’ve got binary encodingcategorical_crossentropy
refers to the case where we’ve got one-hot encodingsparse_categorical_crossentropy
help us to apply one-hot encoding
- We could include metrics that we wish to calculate throughout the training and testing processes.
1 | # we define the optimizer we'd like to use, |
Training
That’s where we train the model we have built.
Setting Epochs
We use model.fit()
to train the data.
When we fit the model, we need to specify the
- training data
- the total number of epochs
- and the validation data we just created ourselves in the format: (inputs,targets)
1 | # determine the maximum number of epochs |
What Happens Inside an Epoch
- At the beginning of each epoch, the training loss will be set to 0.
- The algorithm will iterate over a preset number of batches, all from train_data.
- Essentially the whole training set will be utilized but in batches. Therefore the weights and biases will be updated as many times as there are batches.
- At the end of each epoch, we’ll get a value for the lost function indicating how the training is going.
- We will also see a training accuracy because we added verbose
- At the end of the epoch the algorithm will forward propagate the whole validation data set in a single batch through the optimized model and calculate the validation accuracy.
When we reach the maximum number of epochs, the training will be over.
In above example of First Epoch, the values are representing:
- Information about number of Epoch: 1/5
- Number of Batches : 540/540
- if we had a progress bar it would fill out gradually
- Time took per epoch : 4 seconds
- Training loss: 0.4188
- should be compared to the training loss across epochsgo
- Training accuracy: 0.8816 = 88.16%
- The accuracy shows in what % of the cases our outputs were equal to the targets
- Validation loss: 0.2315
- We keep an eye on the validation loss (or set early stopping mechanisms) to determine whether the model is overfitting
- Validation accuracy: 0.9385 = 93.85%
- We need to keep an eye on it. It is the accuracy of the model for the epoch
Hyperparameters Matters
The model we trained has ~90% accuracy.
Is that good? Well, not really. In fact, It is pretty bad. This is because we’re using a very simple model. With some changes in hyperparameters, we can get to a better accuracy.
Once we train our first model, we fiddle with the hyperparameters.
Basically, the Steps in improving model:
- Create a model
- Fiddle with the hyperparameters
- Check validation accuracy (make sure don’t overfit)
- Repeat Step 2 and 3 until a good validation accuracy
Fiddle Hyperparameters
Once we train our first model, we fiddle with the hyperparameters.
There are several main adjustments you may try.
Please pay attention to the time it takes for each epoch to conclude.
- The width (the hidden layer size) of the algorithm. Try a hidden layer size of 200. How does the validation accuracy of the model change? What about the time it took the algorithm to train? Can you find a hidden layer size that does better?
Solution
The validation accuracy is significantly higher (as the algorithm with 50 hidden units was too simple of a model).
Naturally, it takes the algorithm much longer to train (unless early stopping is triggered too soon).
A hidden layer size of 500 (and not only) works even better.
- Higher width of hidden layer size, Higher accuaracy of the model
- The depth of the algorithm. Add another hidden layer to the algorithm. This is an extremely important exercise! How does the validation accuracy change? What about the time it took the algorithm to train? Hint: Be careful with the shapes of the weights and the biases.
Solution
Adding another hidden layer to the algorithm is done in the same way as in the lecture.
We simply add a new line in Sequential:
tf.keras.layers.Dense(hidden_layer_size, activation='relu')
We can see that the accuracy of the model does not necessarily improve. This is an important lesson for us. Fiddling with a single hyperparameter may not be enough. Sometimes, a deeper net needs to also be wider in order to have higher accuracy. Maybe you need more epochs?
ADDITIONAL TASK: Try this new model, but with a wider one (200-500 hidden units). Basically, combine this and the previous exercises
In any case, it takes longer for the algorithm to train.
- The width and depth of the algorithm. Add as many additional layers as you need to reach 5 hidden layers. Moreover, adjust the width of the algorithm as you find suitable. How does the validation accuracy change? What about the time it took the algorithm to train?
Solution
This exercise is pretty much the same as the previous one. However, it will get us to a much deeper net. As we noted in the previous exercise, you a deeeper net may need to be wider to produce better results.
We tried with 1000 hidden units in each layer and 5 hidden layers.
The result (as you can see below) is that our model’s training was going very well, until it overfit. It did so by quite a lot.
It took my personal computer around 5-6 minutes to train the model.
What if you have more epochs?
- Fiddle with the activation functions. Try applying sigmoid transformation to both layers. The sigmoid activation is given by the string ‘sigmoid’.
Solution
Find the part where we stack layers (Sequential()).
Adjust the activations from ‘relu’ to ‘sigmoid’.
Generally, we should reach an inferior solution. That is because relu ‘cleans’ the noise in the data (think about it - if a value is negative, relu filters it out, while if it is positive, it takes it into account). For the MNIST dataset, we care only about the intensely black and white parts in the images of the digits, so such filtering proves beneficial.
The sigmoid does not filter the signals as well as relu, but still reaches a respectable result (around 95%).
Try using softmax activations for all layers. How does the result change? Can you explain why that happens?
- Fiddle with the activation functions. Try applying a ReLu to the first hidden layer and tanh to the second one. The tanh activation is given by the string ‘tanh’.
Solution
Analogically to the previous lecture, we can change the activation functions. This time though, we will use different activators for the different layers.
The result should not be significantly different. However, with different width and depth, that may change.
Additional exercise: Try to find a better combination of activation functions
- Adjust the batch size. Try a batch size of 10000. How does the required time change? What about the accuracy?
Solution
Find the line that declares the batch size.
Change batch_size from 100 to 10000.
1 BATCH_SIZE = 10000
- A bigger batch size results in slower training.
A bigger batch size results in slower training. That’s what we expected from the theory. We are taking advantage of batching because of the amazing speed increase.
Notice that the validation accuracy starts from a low number and with 5 epochs actually finishes at a lower number. That’s because there are fewer updates in a single epoch.
Try a batch size of 30,000 or 50,000. That’s very close to single batch GD for this problem. What do you think about the speed?You will need to change the max epochs to 100 (for instance), as 5 epochs won’t be enough to train the model. What do you think about the speed of optimization?
- Adjust the batch size. Try a batch size of 1. That’s the SGD. How do the time and accuracy change? Is the result coherent with the theory?
Solution
Find the line that declares the batch size.
Change batch_size from 100 to 1.
1 batch_size = 1A batch size of 1 results in the SGD. It takes the algorithm very little time to process a single batch (as it is one data point), but there are thousands of batches (54000 to be precise), thus the algorithm is actually slow. Remember that this depends on the number of cores that you train on. If you are using a CPU with 4 or 8 cores, you can only train 4 or 8 batches at once. The middle ground (mini-batching such as 100 samples per batch) is optimal.
Notice that the validation accuracy starts from a high number. That’s because there are lots updates in a single epoch. Once the training is over, the accuracy is lower than all other batch sizes (SGD was an approximation).
- Adjust the learning rate. Try a value of 0.0001. Does it make a difference?
Solution
First, we have to define a custom optimizer (as we did in the TensorFlow intro).
We create the custom optimizer with:
1 custom_optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)Then we change the respective argument in model.compile to reflect this:
1 model.compile(optimizer=custom_optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])Since the learning rate is lower than normal, we may need to adjust the max_epochs (to, say, 50).
The result is basically the same, but we reach it much slower.
While Adam adapts to the problem, if the orders of magnitude are too different, it may not have enough time to adjust accordingly.
- Adjust the learning rate. Try a value of 0.02. Does it make a difference?
Solution
First, we have to define a custom optimizer (as we did in the TensorFlow intro).
We create the custom optimizer with:
1 custom_optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)Then we change the respective argument in model.compile to reflect this:
1 model.compile(optimizer=custom_optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])While Adam adapts to the problem, if the orders of magnitude are too different, it may not have time to adjust accordingly. We start overfitting before we can reach a neat solution.
Therefore, for this problem, even 0.02 is a HIGH starting learning rate. What if you try a learning rate of = 1?
It’s a good practice to try 0.001, 0.0001, and 0.00001. If it makes no difference, pick whatever, otherwise it makes sense to fiddle with the learning rate.
Combine all the methods above and try to reach a validation accuracy of 98.5+ percent.
Achieving 98.5% accuracy with the methodology we’ve seen so far is extremely hard. A more realistic exercise would be to achieve 98%+ accuracy. However, being pushed to the limit (trying to achieve 98.5%), you have probably learned a whole lot about the machine learning process.
Here is a link where you can check the results that some leading academics got on the MNIST (using different methodologies): https://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results
Solution
After some fine tuning, I decided to brute-force the algorithm and created 10 hidden layers with 5000 hidden units each.
1
2
3 hidden_layer_size = 5000
batch_size = 150
NUM_EPOCHS = 10All activation functions are ReLu.
There are better solutions using this methodology, this one is just superior to the one in the lessons. Due to the width and the depth of the algorithm, it took my computer 3 hours and 50 mins to train it.
Test the Model
After training on the training data and validating on the validation data, we test the final prediction power of our model by running it on the test dataset that the algorithm has NEVER seen before.
This is also the final stage of the machine learning process.
After we test the model conceptually we are no longer allowed to change the model again.
- The main point of the test dataset is to simulate model deployment.
It is very important to realize that fiddling with the hyperparameters overfits the validation dataset.
The test is the absolute final instance. You should not test before you are completely done with adjusting your model.
- Test data is a dataset that the model has truly never seen.
- If you start changing the model after this point, the test data will no longer be a data set the model has never seen.
- If you adjust your model after testing, you will start overfitting the test dataset, which will defeat its purpose.
-
The validation accuracy is a benchmark for how good the model is.
-
The test data set then is our reality check that prevents us from overfitting the hyperparameters.
Evaluate
Now, we can find out the final true accuracy of the model using the test data.
-
model.evaluate()
return the loss value and metrics values for the model in ‘test mode’. -
Note: getting a test accuracy very close to the validation accuracy shows that we have not overfit.
1 | test_loss, test_accuracy = model.evaluate(test_data) |
Practical Business example. Audiobooks
Our task is to create a machine learning algorithm that can predict if a customer will buy again.
- Learn to Load csv file
- Learn to deal with blank/ missing values
- Learn to deal with binary (boolean) values
- Real life example
Examine Raw Data
Here is the data, provided as a csv file.
- When we put the csv file into Tensorflow, The column headers won’t be included as we want no text in the data.
- Therefore we need to remove the first row. Just do it in the csv file.
The data is representing the 2 years and 6 months.
Column Detail:
- ID -: just a name, we will skip it in the algorithm
- Book length(mins)_overall -: sum of the lengths of all purchases
- Book length(mins)_avg -:
- Price_overall -: overall price paid
- Price_avg -: average price paid
- Review -: a boolean value shows if the customer left a review (1 = true, 0 = false)
- Review 10/10 -: It measures the review of a customer from 1 to 10.
- Note for customers who didnt left a review, just replace the blank cell with total average review score (which is 8.91 in this case, as status quo).
- Minutes listened -: Total mintues listened is a measure of engagement
- Completion -:
- Support Requests -: the total number of support requests, also a measure of engagement
- e.g. forgotten password, assistance
- Last visited minus Purchase date -: difference between the last time a person interacted with the platform and their first purchase date.
- The bigger the difference the better the engagement.
- If the value is 0, the customer has never accessed what he/she has bought.
- Targets -: a boolean value whether a customer buy another book in the last 6 months of data.
- we can count them as a conversion and the target will be 1, otherwise 0
Preprocess the Data
Extract the data from csv
We will use the sklearn preprocessing library, as it will be easier to standardize the data.
- Input : All columns except first one and last one
- arbitrary customer IDs contain no useful information
- Target : last column
1 | import numpy as np |
Note : we can shuffle the indices before balancing (to remove any day effects, etc.)
However, we still have to shuffle them AFTER we balance the dataset as otherwise, all targets that are 1s will be contained in the train_targets.
This code is suboptimal
1 | # When the data was collected it was actually arranged by date |
Balance the dataset
- Main idea: all the number of different target must be the same.
- In this case, the number of targets 1s must match the same number of targets 0s.
- Otherwise we need to balance the dataset.
1 | # Count how many targets are 1 (meaning that the customer did convert) |
- ^
np.delete(array, obj to delete, axis)
is a method that deletes an object along an axis.
Standardize the inputs
- Standardizing the inputs will greatly improve the algorithm.
That’s the only place we use sklearn functionality. We will take advantage of its preprocessing capabilities.
At the end of the business case, you can try to run the algorithm WITHOUT this line of code.
The result will be interesting.
1 | # It's a simple line of code, which standardizes the inputs. |
Shuffle the data
- Since we will be batching, we must shuffle the data.
- data should be as randomly spread as possible so batching will work.
1 | # When the data was collected it was actually arranged by date |
- ^
np.arange([start],stop, step)
is a method that returns a evenly spaced values within a given interval. More Info - ^
np.random.shuffle(X)
is a method that shuffles the numbers in a given sequence.
Split the dataset
Now we split it into training, validation and test.
- We want a 80-10-10 distribution of training, validation, and test.
- Note the numbers need to be integers!
1 | # Count the total number of samples |
Now we have the size of train, validation and test, let’s create the dataset.
1 | # Training |
- We balanced our dataset to be 50-50 (for targets 0 and 1), but the training, validation, and test were taken from a shuffled dataset. Check if they are balanced, too. Note that each time you will get different values, as each time they are shuffled randomly.
1 | # Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test. |
^ Remember to check the proportion of the data samples.
Save the data in .npz file
- better to name them in a very semantic way so we can easily use them later.
1 | # Save the three datasets in *.npz. |
Special Reminder
Note
- Each time we run the code in preprocess section, we will preprocess the data once again and generate completely new npz files.
- training, validation and test datasets will contain different samples
Create the machine learning algorithm
From now on, we start a new notebook file and play with the npz files we obtained.
1 | # we must import the libraries once again since we haven't imported them in this file |
Load the Data
1 | # use a temporary variable npz that will store each of the three Audiobooks datasets |
- ^
np.ndarray.astype()
creates a copy of the array, cast to a specific type. - Targets must be integers if we want to apply one-hot encoding.
Outline the model
-
We have 10 predictors, therefore the size of input layer must be 10.
- Note we dont have to specify it in our model.
- Note the first hidden layer we need to specify our input
-
Output size is 2 because only 2 kinds of target.
-
This time we dont need to flatten because we have preprocessed our data already.
-
tf.keras.layers.Dense
is basically implementing: output = activation(dot(input, weight) + bias) -
Last layer with softmax because our model is a classifier
1 | input_size = 10 |
Choose the optimizer and the loss function
We define the optimizer we’d like to use, the loss function, and the metrics we are interested in obtaining at each iteration.
1 | model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) |
Training
That’s where we train the model we have built.
- To prevent overfitting, we can set a early stopping using the argument
callbacks
.tf.keras.callbacks.EarlyStopping(patience=2)
connfigures the early stopping mechanism of the algorithm.patience
lets us decide how many consecutive increases we can tolerate.
1 | # set the batch size |
Note: Indicating the batch size in
model.fit()
will automatically batch the data.
With the model, we will be able to predict customer future behavior correctly.
- We can use this information for what we intended to.
- We can focus our marketing efforts only on those customers who are likely to convert again.
Test the Model
As we discussed, after training on the training data and validating on the validation data, we test the final prediction power of our model by running it on the test dataset that the algorithm has NEVER seen before.
It is very important to realize that fiddling with the hyperparameters overfits the validation dataset.
The test is the absolute final instance. You should not test before you are completely done with adjusting your model.
If you adjust your model after testing, you will start overfitting the test dataset, which will defeat its purpose.
Evaluate
1 | test_loss, test_accuracy = model.evaluate(test_inputs, test_targets) |
Use the Model to Predict
model.predict()
Save and Load Model
Tensorflow - Save and load models
Save Model
1 | model.save("filename.h5") |
Load Model
1 | classifier = tf.keras.models.load_model('filename.h5') |
Reference
The Data Science Course 2020: Complete Data Science Bootcamp