Introduction
This post is designed show the internal changes of an Artificial Neural Network (ANN/NN), it shows the outputs of the neurons from the beginning of a Backpropagation algorithm to convergence.
The hope is for a better understanding of why we use a second hidden layer, local minimums, to how many internal nodes are required and their impact on the final solution.
The Dataset
1 |
0.1 |
0.1 |
1 |
13 |
0.9 |
0.5 |
1 |
25 |
0.7 |
0.4 |
0 |
||
2 |
0.3 |
0.1 |
1 |
14 |
0.8 |
0.7 |
1 |
26 |
0.3 |
0.5 |
0 |
||
3 |
0.5 |
0.1 |
1 |
15 |
0.9 |
0.7 |
1 |
27 |
0.2 |
0.55 |
0 |
||
4 |
0.7 |
0.1 |
1 |
16 |
0.1 |
0.9 |
1 |
28 |
0.4 |
0.55 |
0 |
||
5 |
0.9 |
0.1 |
1 |
17 |
0.3 |
0.9 |
1 |
29 |
0.1 |
0.65 |
0 |
||
6 |
0.1 |
0.3 |
1 |
18 |
0.5 |
0.9 |
1 |
30 |
0.2 |
0.65 |
0 |
||
7 |
0.3 |
0.3 |
1 |
19 |
0.7 |
0.9 |
1 |
31 |
0.3 |
0.65 |
0 |
||
8 |
0.5 |
0.3 |
1 |
20 |
0.9 |
0.9 |
1 |
32 |
0.4 |
0.65 |
0 |
||
9 |
0.9 |
0.3 |
1 |
21 |
0.7 |
0.2 |
0 |
33 |
0.5 |
0.65 |
0 |
||
10 |
0.1 |
0.5 |
1 |
22 |
0.6 |
0.3 |
0 |
34 |
0.2 |
0.7 |
0 |
||
11 |
0.5 |
0.5 |
1 |
23 |
0.7 |
0.3 |
0 |
35 |
0.3 |
0.7 |
0 |
||
12 |
0.7 |
0.5 |
1 |
24 |
0.8 |
0.3 |
0 |
36 |
0.4 |
0.7 |
0 |
||
37 |
0.3 |
0.8 |
0 |
The dataset chosen is designed to look like holes in the ground and chosen particularly to challenge the network to surround the blue circles and separate them.
Ideal solution
The ideal solution is for the network to perfectly surround the blue circles in a class of its own and try to avoid a connected solution like the purple enclosure
After surrounding the holes, the main challenges are
- Clean separation between the holes
- Commitment of resources to seal Region 1 cleanly
- Will grid position in Filler 1 be a 1 or 0
- How will the network represent box 1, will it be closed or open, it is not designed to be a boundary and was left deliberately ambiguous
Animations
The following animations of neural networks will show the output in a Red mesh and the hidden layers in various colours, some colours may be used more than once. The animations will show the nodes in the hidden layer with the output node but not the hidden layers together. The animations shown are from the start of the algorithm, from the random weights up to convergence; the bias nodes are not shown.
One hidden layer (2-15-1)
A NN of 2-10-1 and two NN of 2-15-1 were tried before a solution was found, despite the network learning the data set to the desired convergence threshold of 10% total error, three hidden nodes were found to be practically unused.
Unused in the sense that after all the backpropagation of errors had taken place; their Z output value did not rise above 11%, in fact if the data set convergence threshold errors was set at 1%, I am convinced that their contribution would have been even less. These are the Yellow, Dark Orange and Blue meshes.
They can be seen when the animation has an elevation of 0°, although these colours were used twice, the other instances had Z output values that were much higher. This suggests a network with a single hidden layer of 12 nodes may be sufficient to learn the dataset
Two hidden layers (2-6-3-1)
A network using 2 hidden layers, 6 nodes in the first and 3 in the second
2nd hidden layer + output node
1st Hidden layer + Output node
Two hidden layers (2-8-6-1)
A network using 2 hidden layers, 8 nodes in the first and 6 in the second, the output node is animated separately this time, but inverted
2nd hidden layer + output node
1st Hidden layer + Output node
Observations (so far)
The initial thinking was that a single hidden layer would not be able to learn the data set because the zero output values were isolated/not continuous. The smallest network observed to have met the challenges and learn the dataset contained two hidden layers of six and three, 2-6-3-1; unfortunately the animation was not captured.
The Sigmoid curve; the “S” shape, is clearly visible in the layer closer to the input nodes but when a second hidden layer is used, it is not noticeable. This shows a function of functions to develop in the second layer, allowing more complex models to be learned.
Convergence seems to happen in three observable ways
- The Convergence threshold is met and the data set is fully learnt; this seldom happens with minimum resources.
- Convergence on a local minimum, this can happen near the global minimum or simply quite early, such as when the network gets stuck in a local loop and oscillates between numbers, this can happen high up if there is insufficient momentum to roll out of the local minimum, simply trapped.
- Convergence can simply happen due to the path taken by the network, unused resources are cut off and the path chosen simple does not have enough resources to complete the local data set, it simply runs out of road
Other animations will be uploaded with observations, such as the variations in the number and types of local minimums that occurred. The networks with a single hidden layer, even with sufficient hidden nodes tended to get stuck in local minimums more often and took much longer to train than the NN with two hidden layers