Writing
A Simple CPU on the Game of Life - Part 4

by Nicholas Carlini 2021-12-30



This is the fourth article in a series of posts that I've been making on creating digital logic gates in the game of life. The first, couple of articles started out with how to create digital logic gates and use them in order to construct simple circuits. In this post we're going to actually build a first real computer: a (2-stage pipelined) unlimited register machine. And later on ([5]) we'll make an even better computer.

Pictured below is the execution of an unlimited register machine which is running a program that factors the number 15 it does this in just a few minutes. You might think that this is a simple task, but let me just remind you that it was only a few years ago that quantum computers managed this impressive feat.


 



Background

At this point I'm not going to explain what Conway's game of life is, or I use it to build circuits that run on the game of life. I've done that three times before and these backgrounds get boring to write. Fortunately for you, even though this is the fourth time I'm writing about this, in practice you only need to read the last post to get caught up enough for this one to make sense.


CPU Overview

Before you can reasonably understand why I'm making the desgn decisions I am, I finally have to tell you what a URM actually is. First described as a mathematical formaliziation of what is computable, a URM is a Turing complete, four-instruction CPU that should (in theory) have an unlimited numver of registers, each of which can hold unbounded integers. Because we're dealing with the real world, our CPU is going to allow just 16 registers, each of which will be able to hold four bits.

An unlimited register machine can do only three operations on the registers:

  1. Increment (INC Rx) a target register by one.
  2. Decrement (DEC Rx) a target register by one.
  3. Jump (JNZ Rx, Addr) to a target instruction if the current register is nonzero.
Now these three simple operations don't seem like much, but it turns out they're actually Turing complete. Which is kind of amazing. To see this let's first observe that we can use these primitives to emulate some other common instructions
  1. ZERO (ZRO Rx) a target register by continuously decrementing the register as long as it's not zero: DEC Rx; JMP Rx, -1.
  2. Unconditionally jump (JMP Addr) to a target instruction by reserving a temporary register Ry that is always positive, and running (JNZ Ry, Addr).

So, for example, if we wanted to add the contents of Register 1 to the contents of Register 2 you might implement something like

        ADD:
          jnz R1, END
          dec R1
          inc R2
          jmp ADD
        END:
      	
which will result in R2 now containing R1+R2 (and zeros R1 in the process). With that out of the way, behold, a CPU that implements this:

The overall layout of our computer is going to be entirely standard. In the upper left corner we have our clock, which works exactly as described last time. This is going to keep everything synchronized so we run each instruction one by one. To the right of the clock is the register file, which holds 16 four-bit registers using the D-flip-flops that we developed last time. This is the entirety of the RAM of this machine---64 bits. To the right of the register file is a little tiny Arithmetic Logic Unit (the ALU), which computes the actual mathematical functions the CPU supports. Because this is a URM, it can only increment. And below the register file is another series of D-flip-flops this time storing the program counter, which is what holds the current instruction we're executing. Below the program counter is the ROM which holds the program text that we're going to execute. This ROM can hold up to 128 different (four-bit) instructions. To the left of the program ROM is another collection of the flip-flops. These flip-flops actually allow this CPU to have a two-stage pipeline that can both fetch and execute instructions simultaneously. Finally at right is the output of the computer in the form of four 7-segment displays. These are almost exactly the same seventh segment displays that I built in the last post.

Efficient Circuit Design Criteria

The basic construction of a CPU is actually relatively simple, and we really only need to do just a few things to turn a collection of digital logic gates into a full computer. Now because this is an article that's meant to focus on designing the game of life computer and not on computers generally I don't want to spend (much) on the generic bits that would let you design a computer with normal logic gates. Other people have done that. Instead what I want to focus on is what changes when you're building a CPU on top of the game of life.

Let's start with the key difference. When building real computers, the ultimate design criteria is to try and minimize (something like) the number of total transistors that are used (or, more accurately, the depth of the circuit, but let's not get pedantic). The reason that this matters is because on physical hardware, the transistor is the item that is most expensive both in terms of real dollars but also in terms of latency. While electricity can flow through a wire at some thing almost like the speed of light, there is a fairly significant propagation delay when you go through a transistor and so deep circuits that have lots of transistors from end to end will be much slower than circuit only have just a few transistors from end to end.

In contrast, on the game of life the only thing that matters is the total length of the circuit along the longest path. It doesn't matter if this is going through what we would call a “wire” or if it's going through an AND gate or an OR gate: the only thing that matters is the total distance covered by the gliders. And it's roughly the same in the two cases.

And so our design is not going to be focused on minimizing the total number of invidvual compoennts that we have to use, because the cost of using a circuit is only very slightly larger than the cost of just having a wire go through empty space. As such there are lots of clever designs that are more efficient in actual hardware because they minimize the depth of the numbers of transistors but are significantly less efficient when building them on the game of life because they require more space to be laid out on a two dimensional grid.

The other important difference that allows us to design efficient circuits is everything is guaranteed to run completely determinstically. To repeat an example discussed in a previous post, an actual clock on a real piece of hardware is going to oscilate with almost the same frequency but might have some small amount of variance between 1 clock cycle and the next. In contrast or clock on the digital logic game of life is going to have exactly the same frequency every single time and will never drift. This allows us to rely on the exact length of the wire in order to determine the sequence of operations that are going to happen.

You might recall how earlier we used to fact to design an edge detector that would reliably determine whenever the signal has gone from 1 to 0 by building a small circuit that A AND (NOT B) and connecting the same input to both A and B, but because there is a slightly longer wire leading to B, it does what we want. An example of this is reproduced at right.

The ALU

Let's get started by looking at the ALU, because it's by far the simplest piece here. All it does is receive one 4-bit integer as input and either (a) increment the input, (b) decrement the input, or (c) do nothing to the input. But instead of building separate increment/decrement circuits, we take advantage of the fact that decrementing an integer is the same as bitwise inverting the value, then incrementing it, then bitwise inverting the result. So we just need to control whether or not we negate everything, and then have a single increment circuit.

Unfortunately, there's nothing more interesting to say about the ALU that makes it special to implement on the game of life. So we'll just move on.

 

The Register File

Again, remember that the objective here is to minimize the total footprint in terms of area of this circuit and not the total number of individual gates. More gates tightly packed is better than fewer gates loosly packed.

The overall design of the register file is fairly simple. The upper left-hand corner of the register file receives as input the current clock and a four bit selector that represents which register should be active. From the right side, the register file takes as input the data that should be written into one of the registers (which is chosen by the selection register coming in from the upper left). The only output is also on the right, which emits the current value of whatever register is currently selected.

The reason for this layout is to minimize minimize the total length of circuit again. Once we have an instruction to perform for example increment register five what we do is we need to retrieve the value of register five and then the value of a judge to five leaves on the right side of the circuit and goes into the LU. From here we perform the incremental operation and then immediately send it backwards in the opposite direction to go back to store the value that we just received.

Notice that this only allows us to access one register at a time for both reading and writing this severely limits the design of our program because it would not allow us to for example using x86 style instruction that adds from register one into register to. We only have one register that we can access in any clock cycle

 

One thing that I do here is find plaes where it would be helpful to design a new kind of logic gate that does something very peculiar but will allow us to lay out all circuits a little better. So for example the register file actually has two special purpose purpose logic gates. The first of these is a logic gate that computes the AND operation but instead of having just one output it has two outputs. The first output is exactly what you expect: A AND B. But then the second output is just exactly a copy of whatever the second input is. At right is a demonstration of this circuit actually being used: notice that whatever the value is that coming up from the bottom always continues up to the top whether or not the a signal is present but we only have output going to the right if both signals are present that as we had performed the end operation`

 

The other logic gate is one that takes a signal coming on from the left, and emits the negation of this signal going down, while still allowing the input to pass along the right. An example of this shown at left.

The reason these kinds of gates are useful is that it allows us to lay out a circuit more optimally without having to have additional wires routing signals around and duplicating them multiple times. Most related to the design of the register file, these two gates allow a very efficient 2x2 box that acts like a latch that I actually showed in the last post; and as a result of the small size of this latch, we can then build a D flip-flop with a total area 60x smaller than the initial designs.

 

The ROM

The ROM works by encoding individual bits as a two dimensional grid of AND gates (32 rows, and 32 columns), where the presence of an AND gate indicates the bit is set, and the absence indiciates the bit is not set. And while this idea does sound simple, the difficult piece is actually making it possible to read data out of this format.

To read out of this grid, we send a constant stream of 1s in from the right side heading left, and a cnostant stream of 1s from the top. So normally, whether or not an AND gate is present, the signal will just continue as normal and nothing interesting happens. In order to read from a particular column, we send a 0 down this column instead of a 1. This makes it so that the horizontal wires that are usually carying 1s now start to cary 0 in the positions where the AND gate is present, and 1 in the rest.

And because we only activate one column at a time, this partly-on and partly-off signal will reach the left end of the circuit unaffected going through all of the rest of the AND gates that have a 1 signal coming in from the top. Finally on the left we just OR together the result of the four groups of 32-bit signals.

 

Instruction Execution

Alright, with all the low-level details out of the way, the final part of the computer that needs describing is how we actually execute each instruction.

Each instruction is encoded as an 8-bit format, where the first 4-bits directly set the control wires in the CPU, and the second 4-bits are always the register that's being accessed (because every instruction accesses exactly one register). The jump instructions require an address, which is stored as a second 8-bit instruction following the jump instruction.

Instruction Writeback Jump Zero Decrement Register Index Extra Data
No-Op 0 0 0 0 Reg
Increment 1 0 0 0 Reg
Decrement 1 0 0 1 Reg
Set Zero 1 0 1 0 Reg
Conditional Jump 0 1 0 0 Reg 8-bit address
Unconditional Conditional Jump 0 1 1 1 Reg 8-bit address

Most of the normal instructions are easy to implement, and actually as you can see I've included instructions to also zero a register along with an unconditional branch because it's easy to do.

The way that the jump instructions work is that on the first clock cycle, they compute (and then store for one cycle in a latch) whether or not the jump should be taken on the next cycle. Then, when we load the next instruction, if the latch says that we're loading a jump that should be taken, we don't treat it as an instruction and instead write the contents of the ROM into the program counter.

2-Stage Pipeline

The final piece of complexity in our CPU is that it's actually pipelined in two stages. At a high level this is a classic 2-stage pipelined CPU where one stage does instruction decode/execute/writeback and the other stage fetches the next instruction from ROM.

Remarkably, this nearly doubles the speed of our computer. (Or, more precisely, it makes it possible to double the clock speed without causing any problems.) The reason here is that it turns out the path the gliders have to take to get from the program counter, through the ROM, and into the control logic on the left side of the circuit is nearly as long as the path from the control logic, through the register file, the ALU, and back.

There's one trouble with jumps and a 2-stage pipeline, and that's the fact that if a jump is taken, then we won't have that next isntruction yet ready in the pipeline. So we have two options. One would be to stall for a cycle and not do anything. This is fairly waistfull, though, and so I'm not going to do that. Instead, I'll steal a page from MIPS and declare that the instruction after a branch is always executed, regardless of whether or not the condition was true; if you have some work that should always happen it can go here, otherwise just stick a no-op there and pretend it was a stall.

FIN

So we're done! A full CPU on the game of life. What's next? Well, actually, I'm going to completely scrap everything we've built and do it all over again, to make things even more efficient. But that can wait for next year. (edit: It's me! After a year! Here you go with a link where I yet again redesign everything.)




If you want to be notified the next time I write something (maybe like this, maybe not, who knows) enter your email address here.
There's also an RSS Feed if that's your thing.