Friday, December 30, 2016

Lernmi: Building a Neural Network in Ruby

Neural networks

A neural network is based somewhat on a real brain - the idea being that data comes into a bunch of neurons at the front, gets propagated through multiple layers, and an answer comes out the end. We've covered this a little in another blog post, so here we'll just focus on what goes on under the covers.

Math time

A neural network is a type of non-linear function that can approximate any other function. How well it can approximate is based on the number of links it has between its layers of neurons. In other words, more neurons and more layers gives you a better (but slower) neural network. If you want to have a play with this idea, have a look at the Tensorflow Playground - if you can't get it to match, try a mix of more layers and more neurons per layer.

More math time

We have a non-linear function, and we need to approximate something based on stochastic gradient descent. In other words, we get a bunch of samples as input, and we need to adjust the network based on what the output *should* be, and then the network will change to give us answers that are slightly closer to what we expect. If our training data is a diverse sample of our real data, then we can train a neural network on our training data, and it'll also work as well with other data that it hasn't seen before.

This does seem kind of like magic though, right? What's going on under the covers?

The calculus bit

We generally use a method called Stochastic Gradient Descent, or a method related to it. The "stochastic" part means we randomly choose samples from our training data so that the model doesn't get biased, and all the little nudges we give it start to shape it in the way we want it to behave. The "gradient descent" part means for each sample, we look at how wrong the network currently is (the "error"), and calculate a gradient (or slope, if you like), and nudge the network in that direction.

This approach is robust and it works for a lot of things, but the hard part with a neural network is trying to find the gradient for every single weight in the network - if your network is 10 layers deep (a small network by modern standards), how does an error at the output adjust the weights 10 layers back at the start?


We use partial derivatives to find the gradient at each layer, and then use that gradient to adjust the network. At each neuron, we take the sum of all of the downstream gradients (they get "backpropagated" to us), multiplied by the derivative of the activation function applied to the forward-propagated data. We then take this number, multiplied by the weight of the previous link, and add a fraction of this (multiplied by the training rate) to the current weight of that link. We also take this number multiplied by the forward-propagated data and backpropagate that to the previous neuron.

Simple... right?

This is some fiddly stuff if you're not currently a calculus student, and it took me a couple weeks to get my head around. You can find it all on the wikipedia page for backpropagation, and there are a couple of papers floating around the internet that explain it in different ways.

It shouldn't be this hard

The idea of multiplying partial derivatives is a simple one, and I spent a long time thinking about how to make this accessible. The architecture that I've settled on for Lernmi has made a couple of trade-offs, but I feel it breaks out the backpropagation into bite-sized pieces, and I hope that it makes it easier to understand.


I've published a neural network application in ruby, hosted here. The logic is broken up between Neurons and Links, and instead of the word "gradient", I've used the word "sensitivity" as it better translates to what we're using it for - high sensitivity means we adjust the weight more, low sensitivity means we adjust it less.

When we propagate forwards, it's fairly simple. Each Link in a layer takes the output from it's input Neuron, multiplies it by its weight, and inputs it to its output Neuron. Each Neuron sums all of its inputs, applies the activation function (the sigmoid function), and waits for the next Link to take the resulting value.

When we backpropagate, the logic is spread across the Neurons and Links. For the output layer, the sensitivity is just the error (the actual output minus the expected output). Each Link in the last layer takes the sensitivity from its output Neuron, and multiplies this by the sensitivity of the activation function - we call this the "output sensitivity". The reason we do this is because our neurons use their activation function when they propagate, so we need to find the sensitivity of that, and we multiply it by the sensitivity of the output Neuron to get the sensitivity of the Link. Perhaps a better name could be the "link sensitivity"? We'll call it the "output sensitivity" for now though.

We use the output sensitivity in two ways: when we update the weight, we multiply it by its input value (the larger our input value, the more we probably contributed to the error), and adjust the weight of this Link by that amount. We also need to send a sensitivity back to the input Neuron - we multiply the output sensitivity by our weight, so that the earlier layer knows how much this Link contributed to the error.

Whose fault is this?

Don't be surprised if this is confusing the first time your read through. I would love to find a way to simplify this further - the principle is as simple as asking "what part of the network is to blame for the error". The math is a bit frustrating, as we're ultimately multiplying partial derivatives along the length of the network, but the goal of it is to do lots of tiny adjustments until the network is a good-enough approximation of the underlying data.

What are we approximating though?

This is where it gets a little freaky. It turns out that everything - handwriting, faces, pictures of various different objects, they all have a mathematical approximation. With a big enough neural network, you can recognise people, animals, objects, and various different styles of handwriting and signwriting. Even in a board game like chess or go, you can make a mathematical approximation of which moves are better, to the point where you can have a computer that can play at a world class level.

This isn't the same as intelligence; the computer isn't thinking, but at the same time, what is thinking anyway? Is what we call "thinking" and "judging" just a mathematical approximation in our heads?

Thursday, December 29, 2016

Reading handwritten numbers with a neural network

Computer vision with neural nets

We're going to do a quick dive into how to get started with neural networks that can read text and recognise things in images. There are already a ton of techniques for computer vision that rely on a bunch of clever math, but there's something alluring about a platform as easy as a neural network.

Neural networks

The idea is super super simple. You have a layer of input neurons, and they send their values along to the next layer, and this continues until you reach the end.

CC BY-SA 3.0

The secret sauce here is the "weights" between each layer - a weight of 1.0 means we pass through the value as is, a weight of -1.0 means we pass through the opposite. We also use an "activation function" which is a way to add a some complexity to the numbers - ultimately this is what allows us to make neural networks process data in a meaningful manner.

When we talk about "training" a neural network, what we do is pass through some input data, look at the output, and then "backpropagate" the error. In other words, we tell the network what it should have given us as output, and then it goes back and adjust all its weights a little bit as a result. We consider a network trained when it gives us the right answer most of the time. We consider a neural network "overtrained" when it returns the right answer for data it has been trained on, but still gives us the wrong answer for similar data that it hasn't seen before. This isn't something you need to worry about now, but it's a good thing to keep in mind if you're having trouble in the future.

The MNIST database

This is the best place to start. The MNIST database is a set of 70,000 handwritten digits split into two sets - 60,000 for training on, and 10,000 for testing on. You get a high score by training on the training set and then guessing as many of the testing set as possible. This means that you can't just memorise all the numbers - you need to have a program that can actually read digits and recognise them.

You can get a copy of the MNIST database from

Building a number reader

We'll be using Keras - it's a python library that lets you build and use neural networks. There's already some sample code for training on MNIST, so let's just go through how that works:

For starters, download keras

pip install keras

Then fire away


Leave it for a bit, and watch the output

Train on 60000 samples, validate on 10000 samples
Epoch 1/20
60000/60000 [==============================] - 9s - loss: 0.2453 - acc: 0.9248 - val_loss: 0.1055 - val_acc: 0.9677
Epoch 2/20
60000/60000 [==============================] - 9s - loss: 0.1016 - acc: 0.9692 - val_loss: 0.0994 - val_acc: 0.9676
Epoch 3/20
60000/60000 [==============================] - 10s - loss: 0.0753 - acc: 0.9772 - val_loss: 0.0868 - val_acc: 0.9741
Epoch 4/20
60000/60000 [==============================] - 12s - loss: 0.0598 - acc: 0.9818 - val_loss: 0.0748 - val_acc: 0.9787
Epoch 5/20
60000/60000 [==============================] - 12s - loss: 0.0515 - acc: 0.9843 - val_loss: 0.0760 - val_acc: 0.9792
Epoch 6/20
60000/60000 [==============================] - 12s - loss: 0.0433 - acc: 0.9873 - val_loss: 0.0851 - val_acc: 0.9796
Epoch 7/20
60000/60000 [==============================] - 11s - loss: 0.0382 - acc: 0.9884 - val_loss: 0.0773 - val_acc: 0.9820
Epoch 8/20
60000/60000 [==============================] - 11s - loss: 0.0342 - acc: 0.9900 - val_loss: 0.0829 - val_acc: 0.9821
Epoch 9/20
60000/60000 [==============================] - 11s - loss: 0.0333 - acc: 0.9901 - val_loss: 0.0917 - val_acc: 0.9812
Epoch 10/20
60000/60000 [==============================] - 12s - loss: 0.0297 - acc: 0.9915 - val_loss: 0.0943 - val_acc: 0.9804
Epoch 11/20
60000/60000 [==============================] - 11s - loss: 0.0262 - acc: 0.9927 - val_loss: 0.0961 - val_acc: 0.9823
Epoch 12/20
60000/60000 [==============================] - 11s - loss: 0.0244 - acc: 0.9926 - val_loss: 0.0954 - val_acc: 0.9823
Epoch 13/20
60000/60000 [==============================] - 12s - loss: 0.0248 - acc: 0.9938 - val_loss: 0.0868 - val_acc: 0.9828
Epoch 14/20
60000/60000 [==============================] - 12s - loss: 0.0235 - acc: 0.9938 - val_loss: 0.1007 - val_acc: 0.9806
Epoch 15/20
60000/60000 [==============================] - 12s - loss: 0.0198 - acc: 0.9946 - val_loss: 0.0921 - val_acc: 0.9837
Epoch 16/20
60000/60000 [==============================] - 15s - loss: 0.0195 - acc: 0.9946 - val_loss: 0.0978 - val_acc: 0.9842
Epoch 17/20
60000/60000 [==============================] - 15s - loss: 0.0208 - acc: 0.9946 - val_loss: 0.1084 - val_acc: 0.9843
Epoch 18/20
60000/60000 [==============================] - 14s - loss: 0.0206 - acc: 0.9947 - val_loss: 0.1112 - val_acc: 0.9816
Epoch 19/20
60000/60000 [==============================] - 13s - loss: 0.0195 - acc: 0.9951 - val_loss: 0.0986 - val_acc: 0.9845
Epoch 20/20
60000/60000 [==============================] - 11s - loss: 0.0177 - acc: 0.9956 - val_loss: 0.1152 - val_acc: 0.9838
Test score: 0.115194263857
Test accuracy: 0.9838

What just happened here? What do all those numbers mean? Is that good or bad?

Building a neural network

Let's have a look at the code. There's a bunch of imports and prep at the start, but the important stuff is buried right at the bottom

model = Sequential()
model.add(Dense(512, input_shape=(784,)))

We start off by creating a keras.Sequential() object, and then we add layers onto it. The first layer is a Dense layer that takes in a 784-point vector - this is the same size as the handwritten numbers. Each number is a 28x28 pixel black and white image, and 28x28 = 784 pixels. We set this in the "input_shape" parameter, and the other parameter is the number 512 - that's how many neurons are in this layer.

A Dense layer is the bread and butter of neural nets - they connect every neuron in the layer before to every neuron in the layer after. You'll see there are 3 being used here - the first two have 512 neurons each, and the third one has 10 neurons - we'll find out why in a second.

After the first Dense layer we have an Activation layer. The activation function we use here is "relu" - a Rectified Linear Unit. All it does is pass through any positive numbers, and round up any negative numbers to 0. These are an important part of neural networks - otherwise we're just adding the same numbers to each other.

After the Activation layer we have a Dropout layer. This is to fix a problem called "overfitting" - when your neural network memorises the training data but doesn't actually learn to recognise. You can detect this when you see your training loss drops but your validation loss stays high.

So this is how we build our neural network, but where do the images come from? How do the images and labels fit into this?

Preparing your data

This is an important step. The input data is a set of images and corresponding labels, like follows:

A sample number 2
With Keras we can just import the dataset, but you'd normally load your images with a library like PIL and then convert them into numpy arrays by hand. First, let's look at the code we're using here:

(X_train, y_train), (X_test, y_test) = mnist.load_data() X_train = X_train.reshape(60000, 784) X_test = X_test.reshape(10000, 784) X_train = X_train.astype('float32') X_test = X_test.astype('float32') X_train /= 255 X_test /= 255

This gives us lists of X_train, y_train, X_test, and y_test. The _train lists have the images, and the _test lists have the labels. We use the numpy.reshape method to change the images from 2-dimensional 28x28 pixel images to 1-dimensional 784 pixel images, as we're just using a Dense layer next. We then convert from integers to float32, and scale from the 0-255 range to the 0.0-1.0 range as neural networks tend to work best with numbers in this range.

This is all we need to do here - the rest just works!

What's next?

This is a fairly basic intro to Keras and neural networks, and there's a lot more you can do from here. We lose a lot of information by flattening the image to a 1-dimensional array, so we can get some improvements by using Convolution2D layers to learn a bit more about the shape of the numbers. We can also try making the network a bit deeper by layering more Convolution2D layers, and look at different training techniques.

That's all for now though, so get out there and start training some robots!

Tuesday, December 8, 2015

MPLS testbed on Ubuntu Linux with kernel 4.3

MPLS in the kernel

Linux 4.3 was released last month, and one of the long-awaited features was MPLS support in the kernel. There is still a the odd bug to iron out, but you can get a working MPLS testbed with the current kernel source (plus a single patch to fix a showstopper).

Building the kernel

  1. Download the source of kernel 4.3 from here:
  2. Unpack the tarball (tar -xf linux-4.3.tar.xz)
  3. Enter the newly-created linux-4.3 directory, run make menuconfig, and enable lwtunnel support, mpls-iptunnel support, mpls-gso support, and mpls-router support.
  4. Apply the patch from (this fixes a problem with sending MPLS packets)
  5. Build the kernel: make -j `getconf _NPROCESSORS_ONLN`
  6. Once this has finished, build the debian packages: make -j `getconf _NPROCESSORS_ONLN` deb-pkg LOCALVERSION=-mplsfix
  7. This will create a bunch of .deb files in the parent directory - copy both linux-image-4.3.0-mplsfix_amd64.deb and linux-headers-4.3.0-mplsfix_amd64.deb to the machine you want to install your new kernel on
  8. Install the kernel with dpkg -i [package name]
  9. Reboot, select Advanced options for booting Ubuntu, and choose your new kernel
  10. You are all ready to go!
edit: easier way with a docker container:

Enabling MPLS

The MPLS modules aren't loaded by default, so you'll need to load them yourself:

modprobe mpls_router
modprobe mpls_gso
modprobe mpls_iptunnel
sysctl -w net.mpls.conf.enp0s9.input=1
sysctl -w net.mpls.conf.lo.input=1
sysctl -w net.mpls.platform_labels=1048575

You'll need to set net.mpls.conf.[interface-name].input=1 for any other interfaces that you plan to receive MPLS packets on, otherwise the MPLS route table won't accept your routes.

Applying MPLS routes

The latest release of iproute2 isn't quite ready, so we'll need to live life on the bleeding edge and build this from source too

git clone git://
cd iproute2
sudo make install

Once this is done, we can see that iproute2 has a few more options available for us - try ip route help and see what is available.

Some route examples:

Routing to with label 100: ip route add encap mpls 100 via inet

Label swapping 100 for 200 and sent to ip -f mpls route add 100 as 200 via inet

Decapsulating label 300 and delivering locally: ip -f mpls route add 300 dev lo

Testbed setup

We're going to make use of network namespaces here to set up a couple of hosts. The plan is as follows:
  • Base machine: has veth0 (plugs into veth1) and veth2 (plugs into veth3)
  • Host1: Has veth1 (plugs into veth0)
  • Host2: Has veth3 (plugs into veth2)
We will use label 111 for traffic from host1 to host2, and label 112 for traffic from host2 to host1. We will use penultimate hop popping here (as opposed to label swapping), but feel free to play with this and get different results.

Setup (all executed as root):

ip link add veth0 type veth peer name veth1
ip link add veth2 type veth peer name veth3
sysctl -w net.mpls.conf.veth0.input=1
sysctl -w net.mpls.conf.veth2.input=1
ifconfig veth0 up
ifconfig veth2 up
ip netns add host1
ip netns add host2
ip link set veth1 netns host1
ip link set veth3 netns host2
ip netns exec host1 ifconfig lo up
ip netns exec host1 ifconfig veth1 up
ip netns exec host2 ifconfig lo up
ip netns exec host2 ifconfig veth3 up
ip netns exec host1 ip route add encap mpls 112 via inet
ip netns exec host2 ip route add encap mpls 111 via inet
ip -f mpls route add 111 via inet
ip -f mpls route add 112 via inet

Testing (executed as root due to netns):

ip netns exec host2 ping -I


tcpdump -envi veth0
tcpdump: listening on veth0, link-type EN10MB (Ethernet), capture size 262144 bytes
21:14:14.687380 9a:08:f4:cf:aa:9c > 12:c7:db:9d:a5:25, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 53781, offset 0, flags [DF], proto ICMP (1), length 84) > ICMP echo request, id 1359, seq 1, length 64
21:14:14.687404 12:c7:db:9d:a5:25 > 9a:08:f4:cf:aa:9c, ethertype MPLS unicast (0x8847), length 102: MPLS (label 112, exp 0, [S], ttl 64)
(tos 0x0, ttl 64, id 19009, offset 0, flags [none], proto ICMP (1), length 84) > ICMP echo reply, id 1359, seq 1, length 64
21:14:15.701789 9a:08:f4:cf:aa:9c > 12:c7:db:9d:a5:25, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 53845, offset 0, flags [DF], proto ICMP (1), length 84) > ICMP echo request, id 1359, seq 2, length 64
21:14:15.701810 12:c7:db:9d:a5:25 > 9a:08:f4:cf:aa:9c, ethertype MPLS unicast (0x8847), length 102: MPLS (label 112, exp 0, [S], ttl 64)
(tos 0x0, ttl 64, id 19246, offset 0, flags [none], proto ICMP (1), length 84) > ICMP echo reply, id 1359, seq 2, length 64

tcpdump -envi veth2
tcpdump: listening on veth2, link-type EN10MB (Ethernet), capture size 262144 bytes
21:14:45.714220 8e:d5:9d:07:9a:5c > d6:8a:7c:5e:5b:0f, ethertype MPLS unicast (0x8847), length 102: MPLS (label 111, exp 0, [S], ttl 64)
(tos 0x0, ttl 64, id 55648, offset 0, flags [DF], proto ICMP (1), length 84) > ICMP echo request, id 1363, seq 1, length 64
21:14:45.714251 d6:8a:7c:5e:5b:0f > 8e:d5:9d:07:9a:5c, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 22394, offset 0, flags [none], proto ICMP (1), length 84) > ICMP echo reply, id 1363, seq 1, length 64
21:14:46.717538 8e:d5:9d:07:9a:5c > d6:8a:7c:5e:5b:0f, ethertype MPLS unicast (0x8847), length 102: MPLS (label 111, exp 0, [S], ttl 64)
(tos 0x0, ttl 64, id 55848, offset 0, flags [DF], proto ICMP (1), length 84) > ICMP echo request, id 1363, seq 2, length 64
21:14:46.717570 d6:8a:7c:5e:5b:0f > 8e:d5:9d:07:9a:5c, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 22412, offset 0, flags [none], proto ICMP (1), length 84) > ICMP echo reply, id 1363, seq 2, length 64

It works!

Next steps

We have software routers such as Quagga and BIRD, and these speak some of the more traditional protocols such as OSPF and BGP. We now need LDP daemons, and other linux software to stand up l2vpn and l3vpn.

Thanks to the team on the netdev mailing list, they have been super responsive and helpful.

Thursday, July 9, 2015

Deepdream: What do all the layers do?

I spent last night getting my computer prepped for some deep dreaming, and it left me thinking: What do all the different layers do? There's over a hundred to choose from, so why not iterate through them all and see what happens?

I used this as my starting point:

My base picture is one I took from a plane out of Queenstown (munged version here), resized to 400px wide, and run through the layers as follows: