At Udacity’s self driving nanodegree course I had to solve a really exciting problem: teaching a car to drive itself, using nothing else but the incoming camera images (so far only from a simulated world). There was another requirement: I can use only deep learning techniques, so I can’t tell anything my car explicitly about the world. However, here’s the result — I’m quite proud of it ;)
But still, its a little bit of a mistery.. how does it work? It feels really unrealistic that using ~1GB of video data and a pretty simple neural network model I can teach a car to keep itself on the track.
So I decided to figure out how my car sees the world
My neural network model is very simple. This is the structure:
- Incoming camera data (a 320 x 160 RGB image)
- A couple of image transformation (cropping, reducing the palette and normalization)
- 2 convolutional layers, each followed by RELU actication layers
- 2 fully connected layers
- A 1x1 output, which is the predicted steering angle
The trained network can yield a good enough result — and I guess it should be somehow reflected on the inside layers. So let’s have a a look!
When I fed this image into the network, the first convolution layer has the following values (I’m doing a crop transformation before, so the car looks only on the road).
Apparently, out network tries to decompose the color channels a way that some layers are returning a very large contrast between the road and it’s edges — and it makes sense, as our network tries to figure out the camera’s position compared to the road. Though the interesting things is happening on the next activation layer:
The network gets rid a lot of pixels which are not the par of the edge— and its made on different color channels with different results. For the given image it would be enough to use only one channel to achive a good enough result, but we should remember that this network is trained to recognize road edges with different colors and light conditions — so there is a kind of redundancy in the network.
The second convolutional layer does the same thing than the first — decomponses the incoming 9 layers to 18 layers; so on the output of the second convolutinal layer every original RGB channel decomposed into 6 channels.
The second activation layer apperantly returns the most intense values on the edges — however on some decomposed channels there are almost no results. Probably these layer are activated by inputs with different color combinations.
After the second activation layer the network ‘flattens out” itself, and creates a 14400 element long vector (on the plot it’s re-shaped into 2 dimensions). It’s no longer human-readable, but we can see that there is a pattern. Probably it’s heavily redundant, which makes sense — it should return to the next level the same values for very diffent input data (for example, the car both should turn right when there is a lake or a tree on the left).
The next layer has only 10 cells. There is nothing to understand here for a human. The only important thing is that it outputs only one final value: the steering angle to the final, output layer.
There are a lot of fascinating things even in such a simple network. It is still almost unbeliveble for me that this works after ~2 minute of training. (10 epochs, 10.000 samples per epoch).
Secondly, that this network has 145.749 traineable parameters — and for a neural network, it is not much. In fact, it is very tiny network, as I can train it even on my laptop. Doing such a thing so easily was impossible 10 years ago.
Thirdly, and this is the most mind-blowing fact: using a very similar architecture (and a power-plant to train it :) NVIDIA created a network which can drive a real car, in the real world.