Saturday, December 20, 2014

How I built my own fountain code

What problem are we trying to solve?

Our phones and computers can only send data in very small chunks at a time. The internet deals with this by breaking up files (webpages, images, videos etc) into chunks and hoping that they can all be reassembled at the other end. This works fairly well, most of the time, and this is the reason that most of the internet runs on Transmission Control Protocol (TCP).

Isn't this good enough?

TCP has been "good enough" for a few decades now, but it would be nice to have some other options. Here are a couple of scenarios where it would be nice to have something different

Live content

When you're streaming a live event, you can't pause and wait for the download to catch up, or else it's no longer live anymore. TCP isn't suitable for this, as its way of dealing with packet loss is slowing down and resending the lost packets. There are codecs that can handle small amounts of packet loss, but it would be nice if we could trade off more bandwidth to hide any packet loss that might happen

Content Distribution Networks (CDNs)

Chances are that you downloaded this webpage from a different server to the one that I uploaded it to. The global nature of the internet means that requests for content are often directed to a server that's close to you. The way the internet currently works means you can only really download a file from a single server at a time - it's difficult to request it from multiple sources, and there's a chance you'll get multiple copies of the same chunks if you do.

How does a fountain code help?

We can describe TCP as being like a jigsaw puzzle, or a stamp collection. If you have a 1000 piece jigsaw puzzle, and lose a few of the pieces, you can't just get any other random pieces - they have to be the right pieces, or else the puzzle isn't complete. Similarly, the value of a stamp collection is generally in its completeness, rather than the sheer number of stamps there are. 1000 different stamps from a particular collection are valuable, whereas 1000 of the same stamp is rather boring.

So what is a fountain code?

Fountain codes are a type of forward error correction (FEC); specifically, a type of erasure code. An ideal fountain code is like filling up a glass of water at a fountain. You don't care which droplets land in the glass, but you do care about whether the glass is full. The current state-of-the-art fountain codes are based on Luby Transform Codes, which are based on a thing called the XOR operator.

XOR operators

If I have two numbers, 1010, and 1011, I can combine them using an XOR. 1010 xor 1011 = 0001. You line up the two numbers, and if the digits are the same, you write down a 0 - if the digits are different, then you write down a 1. XOR operators are cool because they work just as well in reverse - 1010 xor 0001 = 1011, and 1011 xor 0001 = 1010. We can use this when we break up a file to create extra pieces that can help us out if we get stuck (as a side note, this is how Stereo sound works on FM radio).

LT codes

For example, let's say we have a file that we break up into 5 pieces: a b c d e. We can also send other pieces, such as "a xor c", or maybe "b xor d xor e". If we send all 7, but the other person only receives 5, then they might be able to reconstruct the original message. For example, if you have pieces a, c, d, e, and "b xor d xor e", then you can xor d and e against "b xor d xor e" to extract b - this is the missing piece!

These sound cool - aren't they good enough?

Short answer: you need to pay to use LT codes or Raptor codes (except in New Zealand where software patents are illegal!) (I am not a lawyer btw). They're also the first practical implementation of fountain codes, and it stands to reason that there might be another way to build them that relies on different math.

They can also be a bit picky sometimes about *which* pieces you get. An ideal fountain code would work with any combination of pieces, but may have other issues that make it less practical. There's also a non-zero chance that there's a way cooler code out there, just waiting to be found!

... so I built my own

To quote my old math lecturer Geoff Whittle (congratulations by the way), "last night while I was lying awake thinking about math, I had an idea", so I got up and started writing down notes, and have spent the day re-learning field theory and cutting code. I wanted a system that didn't care at all which pieces you got, and managed to cobble something together in the end using linear equations.

Simultaneous equations

Remember doing these in high school? They usually looked something like this:

x + 2y = 25
3x + 4y = 9

The idea was that as long as you had as many equations as unknowns, you could usually solve the equations. The way you'd solve this was something along the lines of

(1) x + 2y = 25
(2) 3x + 4y = 9
(subtract (1) from (2) three times)
(3a) 0x - 2y = -66
(3b) -2y = -66
(divide by -2)
(4) y = 33
(substitute y for 33 in eqn 1)
(5a) x + 2*33 = 25
(5b) x + 66 = 25
(subtract 66)
(6a) x + 66 - 66 = 25 - 66
(6b) x = -41
x = -41
y = 33


There's no reason why x and y can't stand for multiple things though. Let's try this again:

x + 2y = 25, 49, 10, 58, 23, 95, 23, 47
3x + 4y = 9, 12, 67, 34, 23, 67, 98, 23

This looks kinda odd, but it still works. In addition, we've already done the work solving the first pair (25 and 9), so instead of solving again, we can just run the 6-steps for each of the other pairs and they'll solve too.

We can make this bigger if you like:

a + 31b + 4c + 2d + 45e + 5f = 12, 45, 65, 29 ..
3a + 24b + 2c + 23d + 23e + 11f = 56, 234, 92, 30 ..
6a + 45b + 4c + 65d + 56e + 89f = 39, 12, 6, 3 ..
5a + 6b + 34c + 4d + 78e + 56f = 34, 12, 67, 45 ..
23a + 23b + 12c + 43d + 12e + 23f = 4, 79, 20, 3 ..

It's worth pointing out that these equations are arbitrarily chosen - the receiver needs to get 5 of them to reconstruct the message, but if they miss a piece then we just choose a random new set of coefficients, calculate the answers, and send the packet down the line.

Results

It works! If you want 10 pieces, you can rebuild the message from *any* 10 pieces - I've run this about 50 times today and have yet to see decoding fail.

Improvements

You'll notice I haven't used any fractions or decimal points. There's not enough room here to get into this today, but I've used Galois Fields of order 2^8 with generator polynomial x^8 + x^4 + x^3 + x + 1. In short, if you redefine what addition and multiplication mean, and keep the numbers in the range of 0-255 (one byte, 8 bits) then you can avoid fractions while speeding up the calculations.

Problems

There's quite a bit of overhead involved. Each packet has to have the coefficients on the front, and needs one coefficient for each piece you've split the message into. If you have 100 pieces, you need 100 bytes of overhead on the front. If you have 1000 pieces, you have 1000 bytes of overhead on the front. If we keep in mind that an internet packet only holds about 1400 bytes, it means we have some serious limits here. If we want to send a 100KB message, we need to split it into about 100 pieces, and each piece has about 10% coding overhead. A 1MB message is out of the question. There are ways around this, where a 1GB file gets split up into 10,000 x 100KB messages, and you dump pieces onto the wire and the receiver lets you know once they've reconstructed parts, but this wasn't quite the ideal fountain code that I was hoping for.

In addition to this, the reference implementation really struggles to do the calculations once you get past 20-50 pieces. I suspect that most of this is due to the code being in python/numpy, and there are plenty of optimisations that could be made.

update 21/12/2014

The coefficients are random anyway, so I've changed things to make it just transmit a 4-byte seed for python's random.randint(). Now it'll handle any number of pieces with constant byte overhead, but 130 (for a 175KB file) pieces still takes a bit of effort to process (read: 2 minutes at each end to encode/decode). Given the encode time is similar to the decode time, I suspect the slowness is caused by the matrix multiplication code, or some double-copying of the GF(2^8) multiplication table, rather than my Gauss-Jordan implementation being dodgy. This gives me hope that I can speed things up by a few orders of magnitude and start testing with multiple MB files soon!

update 22/12/2014

It's been brought to my attention that this work has previously been covered by Muriel M├ędard at MIT. Looks like I have a bit more reading to do :)

Future work

There's still an ideal fountain code out there, waiting to be found. I sincerely hope somebody finds my work useful, but in any case, it's been a good day's work.

Code pls

Sunday, November 30, 2014

Bootstrapping LXCs behind an APT proxy

Had some fun getting lxc-create to do its stuff this morning, the /etc/default/lxc config file has half a hint but not a whole one:

# MIRROR to be used by ubuntu template at container creation:
# Leaving it undefined is fine
#MIRROR="http://archive.ubuntu.com/ubuntu"
# or 
#MIRROR="http://<host-ip-addr>:3142/archive.ubuntu.com/ubuntu"

# LXC_AUTO - whether or not to start containers symlinked under
# /etc/lxc/auto
LXC_AUTO="true"

# Leave USE_LXC_BRIDGE as "true" if you want to use lxcbr0 for your
# containers.  Set to "false" if you'll use virbr0 or another existing
# bridge, or mavlan to your host's NIC.
USE_LXC_BRIDGE="true"

# If you change the LXC_BRIDGE to something other than lxcbr0, then
# you will also need to update your /etc/lxc/lxc.conf as well as the
# configuration (/var/lib/lxc/<container>/config) for any containers
# already created using the default config to reflect the new bridge
# name.
# If you have the dnsmasq daemon installed, you'll also have to update
# /etc/dnsmasq.d/lxc and restart the system wide dnsmasq daemon.
LXC_BRIDGE="lxcbr0"
LXC_ADDR="10.0.3.1"
LXC_NETMASK="255.255.255.0"
LXC_NETWORK="10.0.3.0/24"
LXC_DHCP_RANGE="10.0.3.2,10.0.3.254"
LXC_DHCP_MAX="253"

LXC_SHUTDOWN_TIMEOUT=120

The line that it's missing is "you'll also need to set the SECURITY_MIRROR field or else it dies halfway" - so add the following to your /etc/default/lxc and you're good to go:

MIRROR="http://IP-OF-PROXY:3142/archive.ubuntu.com/ubuntu"
SECURITY_MIRROR="http://IP-OF-PROXY:3142/security.ubuntu.com/ubuntu"


Tuesday, November 25, 2014

How to install Ubuntu Server 14.04 on a Mac

We have an old Mac tower in the office that I've been using as a headless server, and last time around I couldn't even get 12.04 server to install - I needed to install 12.04 desktop.

This time around, the Ubuntu server 14.04 install works fine... except it hangs at boot

Looking at the output, it has an issue with plymouth-upstart-bridge, then tries to mount swap, and then it does nothing...

I came across this post that had the answer - booting will work if you boot in recovery mode (grub -> advanced options -> ubuntu recovery) and then resume the boot. The permanent fix is to add "nomodeset" to grub.

Instructions from the beginning


  1. Copy ubuntu 14.04 amd64 server iso onto a USB drive (sudo dd if=ubuntu-14.04-server-amd64.iso of=/dev/sdX bs=1M) (*don't blame me* if you accidentally override your main harddrive)
  2. Plug usb key into server, boot up (if it doesn't boot off USB you may need to hold down ALT to get the magic screen and then choose EFI USB boot)
  3. Run install like you normally would
  4. Find that it hangs on boot
  5. Boot again but in recovery mode (at GRUB choose "advanced options" and boot into recovery mode)
  6. Log in
  7. sudo vim /etc/default/grub
  8. Edit the line GRUB_CMDLINE_LINUX_DEFAULT="" to say GRUB_CMDLINE_LINUX_DEFAULT="nomodeset"
  9. Save file
  10. sudo update-grub
  11. Reboot
  12. *magic it works yay*

Tuesday, August 5, 2014

Private servers on EC2

Build all the servers (or, how to setup a secure-ish private minecraft server on EC2)

At the time of writing, Amazon EC2 gives you a free virtual machine for the first 12 months of a new account. It's not huge, but it has a CPU and 1GB of RAM, and that gives you a few possibilities. A VM in the cloud is cool, but there's an obvious downside to not having it in your own house/office/parents' basement

Open servers on the internet

If you want to do anything cool with your VM, chance are you'll end up opening ports, and introducing security holes. You may say that you're not worried about this, that your server isn't super important, but if someone can own your VM, then they're one step closer to breaking into your own computer, and that's a bad thing.

In short, you need a way to lock things down, and that's what we're going to do here. But first, let's spin up an EC2 VM

Ubuntu VMs on EC2


Go to the Compute menu, and click on Launch Instance



Check the Free tier only box, and click on the Ubuntu 64 bit VM


You'll see the green highlighted free tier eligible text next to the smallest VM, select that and click Next: Configure Instance Details


Leave as is and click next


Nothing to see here, click next


You can add tags if you want, but we don't need them for anything in particular right now


You'll want to make some adjustments to the security groups - click Add Rule and make a custom TCP rule opening up port 1723 - this will allow PPTP VPN connections.


Your config should look like mine here


And now you're good to go, click Launch and head to the final step


Make a name for your keypair and download it



Annnnnnnd you're all good to go!

To log in (instructions for OSX/Linux, windows users can figure out how to do this in PuTTY themselves), use the -i switch to load the private key (make sure you chmod it to 600), and log in with the user ubuntu. On the "compute" screen you'll be able to see the IP for your server, log in as follows:

ssh -i keyfile.pem ubuntu@IP

And you're in! Feel free to have a play with this as is, or move onto the next step - setting up containers.

Linux containers (LXC)

Making them work

LXCs are the secret sauce that'll let you make faux-VMs to put things in. Install LXC with the following:

sudo apt-get update
sudo apt-get install lxc

This will load up everything you need to build an LXC. To build your LXC, use the following. We'll call our container "minecraft", for no particular reason:

sudo lxc-create -t ubuntu -n minecraft

This will take a long time, so get yourself a coffee or three. This only happens the first time, as it sets up the template - subsequent LXCs will take seconds when you make them.

Once this finishes, start it up and log in (CTRL+A followed by the Q key will get you out of the console)

sudo lxc-start -d -n minecraft
sudo lxc-console -n minecraft

Here's your container! If for some reason, you were wanting to install minecraft, you would install the JDK and download the minecraft JAR file from the appropriate website. The below lines will install version 1.7.10, you'll probably want whatever the latest version is.

sudo apt-get install default-jdk
wget https://s3.amazonaws.com/Minecraft.Download/versions/1.7.10/minecraft_server.1.7.10.jar

Making things a little nicer

Your LXCs will use DHCP when you spin them up, which is great for getting things going at first, but we want a little bit more control. Go back to the host machine (CTRL+A, then Q, if you aren't in it already) and open up /etc/default/lxc-net.conf and change the DHCP settings so we can open up some space - change the settings in there to the following:

LXC_BRIDGE="lxcbr0"
LXC_ADDR="10.0.3.1"
LXC_NETMASK="255.255.255.0"
LXC_NETWORK="10.0.3.0/24"
LXC_DHCP_RANGE="10.0.3.200,10.0.3.240"
LXC_DHCP_MAX="40"

We'll also want to give our LXC a static IP, so get back into your LXC

sudo lxc-console -n minecraft

and edit /etc/network/interfaces (if you're still in the host you can edit this at /var/lib/lxc/minecraft/rootfs/etc/network/interfaces if you like). Remove the line

iface eth0 inet dhcp

and replace with the following

iface eth0 inet static
        address 10.0.3.250
        netmask 255.255.255.0
        gateway 10.0.3.1

Save and close, and power off your container for now

sudo poweroff

This is all we need from the LXC for now, let's set up the VPN.

PPTP VPNs

We're using PPTP today because you can get it working in minutes, and it has native clients in Ubuntu and OSX. As per the instructions on the Ubuntu website, we'll install as follows:

sudo apt-get install pptpd

Then edit /etc/pptpd.conf, scroll to the bottom, and set up addresses for clients. I've done this as follows:

localip 10.0.3.1
remoteip 10.0.3.2-22

Open /etc/ppp/pptpd-options, and add DNS

ms-dns 8.8.8.8
ms-dns 8.8.4.4

Then edit /etc/ppp/chap-secrets and set up accounts. This shouldn't need to be said, but just in case security isn't clear, DO NOT USE THESE LOGINS AS EVERYONE ON THE INTERNET KNOWS WHAT THEY ARE NOW

user1 * password1 *
user2 * password2 *

Now restart the PPTP server

/etc/init.d/pptpd restart

And give it a shot!

Putting it all together

Weird stuff can happen if you haven't shut everything down, so the safest way to apply your LXC settings is reboot the whole VM

reboot

Now log back in, start up your LXC, and then try to log in from the outside. Connect to your server with your PPTP account, then once the VPN is up, try to ping your LXC on 10.0.3.250. If you've installed a minecraft server on it, log into the LXC and fire it up with the following:

java -Xms512M -Xmx1G -jar minecraft_server.1.7.10.jar nogui

Edit the newly-generated eula.txt to say true, start up again, and test it out... remember to use the LXC's internal IP (10.0.3.250) when connecting though!

Troubleshooting


  • Try turning it off and on again. I am not joking.
  • Make sure you're fully upgraded (apt-get upgrade/apt-get dist-upgrade) - LXC has dependencies on newer versions of certain code, but doesn't automatically upgrade them for you.

What have we done?!

What's the point in putting a VM inside a VM in the cloud? The reason you might consider this, is because while cloud resources are cheap and plentiful, they're also hard to configure properly to allow full access to you, but completely lock out everyone else. I can't make any guarantees on the security of this - LXC is still new, and there are further things we can do to lock down the internal bridge - but I hope this has been a useful start to get people thinking about bridging out into private servers in public clouds such as EC2.

Tuesday, May 20, 2014

10G > 1G

10Gb/s ain't what it used to be

It was only a few years ago that 10Gb/s kit cost 10's of thousands of dollars and needed massive XenPaks to plug in as optics. It's now 2014, 10Gb/s SFPs cost about $200 each, and the closest I get to a XenPak is the broken one I use as a bottle opener.

Because it's so cheap, it's a no-brainer to put 10Gb/s NICs in your servers, but there's no guarantee that your network can support 10Gb/s the whole way through. You might think that a 1Gb/s bottleneck in your network isn't a big deal, and that TCP can fumble around and find the top speed for your connection, but you might be disappointed to hear that it's not that easy.

TCP is dumb

TCP doesn't have any parameters, internal or external, for how fast it's sending data. It has a window of how much unaccounted data has been sent, and this window moves along as acknowledgements are received. The size of this window sets an upper bound on the average speed (based on latency and packet loss, feel free to explore this), but not on the maximum speed. This becomes a problem as bandwidth and latency both increase.

Long fat networks

The TCP window keeps track of every byte "in flight" - in other words, all data that has been sent and not acknowledged. It can't send any more data until the first lot of data has been acknowledged, and it needs a buffer (window) to track this. The smallest this buffer can be is latency x bandwidth, and this number can get very big very quickly. If you're trying to send at 1Gb/s to a destination 160ms away, you need a window of 20MB - if you want to do this at 10Gb/s, you need a window of 200MB! Compared to the 128GB flash drives that clip onto your keyring, this doesn't seem like a huge amount, but to the switches and routers in your network, this is a lot to soak up if your traffic has to go across a slow part of the network

How 10 goes into 1

If your sender and receiver have 10Gb/s connections, but the network has a 1Gb/s segment in the middle, you can run into interesting problems. With 160ms of latency in the way, your sender dumps 20MB onto the network at 10Gb/s - in the time it takes to arrive at the start of the 1Gb/s segment, the 1Gb/s segment can only send 2MB - leaving 18MB to be dealt with. If you have big enough buffers, then this will eke out into the network at 1Gb/s and everything will be fine!

However, 20MB is a big buffer - it holds 160ms of data. We've all seen buffer bloat (when buffers fill up and stay full and add extra latency to the network), and hear about it being a bad thing, but this is an instance where buffers are *very* important. If you have no buffers, your TCP stream starts up and immediately drops 90% of its packets, and things go very bad, very fast.

Labbing this up

You can simulate this yourself between two Linux machines with tc and iperf. First, plug them into each other at 10Gb/s, make sure they're tuned (net.ipv4.tcp_rmem/wmem need to have the last parameter set to about 64MB), and test between them. Assuming sub-millisecond latency, you should see very close to 10Gb/s of TCP throughput (if not, the servers are mistuned, or underpowered).

box1: iperf -i1 -s
box2: iperf -i1 -c box1 -t300

Looking good? If not, you're out of luck in this instance - TCP tuning is outside the scope of this post, go and ask ESnet what to do.

Assuming this is all working, we'll add some latency as an egress filter on box1, and see what happens

sudo tc qdisc add dev eth0 netem delay 50ms limit 10000

Try the iperf again - is it still working well? If you're not averaging 7-8Gb/s then you might want to do some tuning, and come back when it's looking better.

Now we've got a known-good at 50ms, let's try simulating a 1Gb/s bottleneck in the network. Apply an ingress policer to box2 as follows:

sudo tc qdisc add dev eth0 handle ffff: ingress
sudo tc filter add dev eth0 parent ffff: protocol ip prio 1 u32 match ip src 0.0.0.0/0 police rate 1000mbit burst 100k mtu 100k drop flowid :1

Try your iperf again - how does it look? When I tested this, I couldn't get more than 10Mb/s - something is seriously wrong! Let's try and send some UDP through to see what's happening

box1: iperf -i1 -s -u -l45k
box2: iperf -i1 -u -l45k -c box1 -t300 -b800m

Odd... we can send 800Mb/s of UDP with little or no packet loss, but can't get more than 20Mb/s of TCP?!

The fix for this is adding a shaper to make sure nothing gets dropped by the policer. We can add this on box1 in this instance as follows:

sudo tc qdisc del dev eth0 root
sudo tc qdisc add dev eth0 root handle 1: tbf rate 900mbit burst 12000 limit 500m mtu 100000
sudo tc qdisc add dev eth0 parent 1: netem delay 50ms limit 10000

You'll notice we deleted and then re-added the latency - this is just a limitation of how we chain these qdiscs together. But give it a shot - try an iperf between the two boxes with TCP, and magic - you can get 850Mb/s now!

A sustainable fix

We're not going to add shapers everywhere just because we know that some parts of our network have bad bottlenecks. It's okay though - smart people are working on a fix. By adding a knob to TCP where we explicitly say how fast we want it to go, we can make TCP pace itself, instead of emptying out its window onto the network, and then getting upset when it doesn't get magically taken care of. This is still experimental, but I'm keen to hear if anyone has had any luck with it - this is a very important step forward for TCP, and will become gradually more important as our networks get longer and fatter.