Archive for the ‘computer’ Category

What are all those pins for?

Wednesday, October 28th, 2009

I recently built myself a new computer using an Intel Core i7 920 CPU. This CPU has more pins (well, “lands” actually, since they are just flat conducting areas that touch pins in the socket) than any other yet produced, 1366 of them to be precise. I was wondering why so many were needed, so I grabbed the datasheet and made a map:

Power:
     VSS
     VCC
     VCCPLL
     VTTA
     VTTD
     VDDQ

Memory:
     DDR0 data      other
     DDR1 data      other
     DDR2 data      other

Other:
     QPI data      other
     Other
     reserved

Idle speculation follows (I don’t have any background in CPU or motherboard design):

The pins roughly divide into six sections: two for memory data, one for other memory-related signals, one for power, one for the QPI bus and one that is mostly reserved.

That there are a lot of power pins is not surprising – this CPU can use as much as 145A of current, which is enough to vaporize any one of those tiny connections, so it has to be spread out amongst ~300 of them for each of power and ground. Having two very big pins for power would probably make the mechanical engineering of the CPU much more difficult and would push the responsibility for branching out that power onto the CPU, whereas it is better done by the motherboard.

It’s interesting that the ground lands are mostly spread out but the power lands are mostly together. I’m not sure why that should be – I would expect them both to be spread out. Perhaps the 8 or 9 big groups of VCC on the north edge each correspond to a single “power line” on the motherboard (and hence are grouped together) while the distributed ground lands are needed to supply electrons for the signal lands.

Three DDR3 channels also use a lot of lands – 192 for data alone and almost as many again for addresses, strobes and clocks.

Another thing that surprised me is that there are so many reserved lands (~250 of them). Initially I thought that this was because the socket was designed before the designers knew how many pins they would actually need, so they made sure to design for the absolute maximum. However, a good chunk of the reserved lands are used by the Xeon 5500 CPUs, which use the same socket – in particular for memory error detection/correction and the second QPI bus (which is presumably in the northwest corner).

Top posting

Saturday, October 24th, 2009

Before I started working at Microsoft, I used to always reply to emails by quoting them, breaking up the quoted text into pieces and then replying to each of the pieces directly below, for example:

From: andrew@reenigne.org
To: xyz@example.com
Subject: Re: Hello

Xyz <xyz@example.com> wrote:
> Hello,

Hello!

> how are you today

I'm fine, thank you.

This style is called inline replying with trimming. This is a fine system because the person I’m replying to gets reminded of what they wrote, and I don’t have to write things like “In regards to the part of your email where you asked me how I was today,”.

The most common other system is top posting, which looks like this:

From: andrew@reenigne.org
To: xyz@example.com
Subject: Re: Hello

Hello! I'm fine, thank you.

Xyz <xyz@example.com> wrote:
> Hello, how are you today

This is the natural default with Microsoft Outlook. In the geek circles I had moved in before working at Microsoft, this style was greatly frowned upon. However, it is ubiquitous at Microsoft. I’m not sure whether this is because it’s the default style in Outlook or whether it’s the default style in Outlook because it is ubiquitous at Microsoft. However, once I had forced myself to “do as the Romans do” and top post, I found that it does actually make more sense in that environment. This is for two reasons:

  1. When the conversation falters due to lack of knowledge about something, it’s very common to “loop in” an expert to give their two cents by adding them to the CC line. In order for the expert to have some context, it’s useful to have the previous history of the conversation right there in the email, so he or she can read it (bottom to top).
  2. With each email carrying the entire thread, emails can get pretty long. It’s inconvenient to have to scroll all the way to the bottom of each email to see the latest reply (especially if you’re just a spectator rather than a contributor to a busy thread) so it’s better for the replies to be at the top than at the bottom.

It’s still useful to reply inline as well sometimes – at Microsoft this is done by quoting the email you’re replying two twice – once, in its entirety, at the bottom and once (suitably chopped and trimmed) inline. I used to do this quite frequently as it’s the best way I’ve found (pre-Argulator) of addressing each point individually. However, one of my managers once told me that if the conversation got sufficiently complex that I felt it was best to do that, I should instead “take it offline” and schedule a face-to-face meeting instead to hash out these issues. However, I felt (and still feel) that inline email replies are better than face-to-face meetings for such complicated issues – in face to face meetings there’s less time to think about your answer, and points can get lost – as the conversation progresses it can only follow one “branch” of the argument tree, and without explicitly maintaining a stack it’s very easy for branches to get forgotten about.

Does the human brain tap into a third form of computing?

Wednesday, October 21st, 2009

There are two forms of computing currently thought to be possible in our universe. One is the classical, deterministic computing that we all know and love. Many people think the human brain is a kind of (very large and complicated) classical computer. However, it is still unknown whether (and if so, how) a classical computer can give rise to consciousness and subjective experience.

The second form of computing is quantum computing, where you essentially run a pile of classical computers in superposition and allow their outputs to interfere in order to obtain the result. Anything quantum computers can do can also be done by classical computers (albeit much more slowly). The human brain might be a quantum computer, but (unless there’s something about quantum computing that we don’t yet understand) that still doesn’t solve the problem of consciousness.

A third form of computing is possible if you have a time machine. I’ve speculated before that the human brain could be a time travelling computer. These computers are faster still than quantum computers, but still can’t compute anything that can’t in principle (given long enough) be computed by a classical computer, so this still doesn’t solve the consciousness problem.

Could it be that by accident of evolution the human brain has tapped into a form of computing that is qualitatively different from classical computing, much as birds and bees have tapped into a qualitatively different method of flying (flapping) than the method use in our aeroplanes? While this smells of dualism, I think it’s a possibility that can’t be fully discounted without a complete theory of physics.

One such qualitatively different form of computing is the infinity machine. This can verify true things in finite time even if there is no finite proof that those things are true. Thus it can find completely new truths that are not provable by conventional mathematics.

It seems rather unlikely that the infinity machine is possible in our universe (quantum mechanics puts an absolute limit on clock speed) but there could be other forms of computation that we’ve just never thought of.

Penrose’s Orchestrated Objective Reduction theory is one such possibility.

CGA: Why the 80-column text mode requires the border color to be set

Saturday, October 17th, 2009

The original IBM Color Graphics Adapter has a curious quirk – it won’t by default display colour on the composite output in 80-column text mode. By looking at the schematics, I’ve figured out why this is, and what the CGA’s designers could have done differently to avoid this bug. The following diagram illustrates the structure of the various horizontal and vertical sync pulses, overscan and visible areas in the CGA.

There are two horizontal sync pulses – there’s the one generated by the 6845 (the 160-pixel wide red/grey/yellow band in the diagram) and there’s the one output to the monitor (the 64-pixel wide grey band within it). The CGA takes the 6845’s hsync pulse and puts it through various flip flops to generate the output hsync pulse (delayed by 2 LCLKs and with a width of 4 LCLKs) and also the color burst pulse (in yellow, delayed by 7 LCLKs and with a width of 2 LCLKs).

The 6845 can generate an hsync pulse anywhere from 1 to 16 clock ticks in width. The IBM’s BIOS sets it up at 10 ticks (as shown in the diagram). However, in 80-column text mode those ticks are only half as wide, so only extend 3/4 of the way through the output hsync pulse. The 6845’s hsync pulse ends before the color burst pulse gets a chance to start, so it never happens and the display will show a monochrome image.

By changing the overscan color to brown, one can create one’s own color burst signal at the right point in the signal, and this was the usual way of working around the problem (possibly the only way that works reliably)

By changing the 6845’s pulse width to the maximum of 16, one could generate the first half of the color burst pulse (I think) and some monitors might recognize this as a color burst.

If the CGA’s designers had started the output hsync pulse at the beginning of the 6845’s hsync pulse (or delayed by only 1 LCLK instead of 2) then using the maximum pulse width would have been sufficient to generate the correct color burst. I guess they were just trying to center the output hsync pulse and the color burst within the 6845 pulse, without thinking of the high-res case.

The diagram also shows why interlaced mode doesn’t work on the CGA – the output vertical sync pulse is generated in a similar way to the output horizontal sync pulse, only it’s 3 lines instead of 4 LCLKs. It always starts at the beginning of an output hsync pulse, so a field can’t start halfway through a scanline.

CGA: Reading the current beam position with the lightpen latch

Friday, October 16th, 2009

Here is a little known trick that a genuine IBM Color Graphics Adapter can play, that I noticed when looking at its schematic recently. There are two ports (0×3db and 0×3dc) which are related to the light pen. A read to or write from 0×3db clears the light pen strobe (which you need to do after reading the light pen position so that you’ll be able to read a different position next time). A read to or write from 0×3dc sets the light pen strobe – what’s the point of that? One possibility might be to implement a light pen that signals the computer in a different way (via an interrupt) rather than being connected directly to the CGA card. That wouldn’t work very well though – the interrupt latency of the original IBM PCs was extremely high.

Another possibility is to allow the programmer to directly find the position of the beam at any moment, to an accuracy of 2 scanlines (in graphics modes) and one character width (1/40th of the visible screen width in graphics modes and 40-column text modes, 1/80th of the visible screen width in 80-column text modes). Read from 0×3db and 0×3dc and then read the light pen CRTC registers to find out where the beam was when you read from 0×3dc. This technique is so obscure it probably won’t work on non-IBM CGA cards, so its usefulness is rather limited. Might be useful for an oldskool demo, though. I’ll be sure to implement this technique when I finally get around to making my extremely accurate PC emulator.

Building 3D chips

Thursday, October 15th, 2009

In the not-too-distant future, we’ll hit a limit on how small we can make transistors. The logical next step from there will be to starting building up – moving from chips that are almost completely 2D to fully 3D chips. When that happens, we’ll have to figure out a way to cool them. Unlike with a 2D chip, you can’t just stick a big heatsink and fan on top because it would only cool one surface, leaving the bulk of the chip to overheat. What you need is a network of cooling pipes distributed throughout the chip, almost like a biological system.

I suspect these pipes would work best if they go straight through the chip and out the other side. At small scales, fluid is very viscous and trying to turn a corner would probably slow down the flow too much. So suppose you have a cubic chip with lots of tiny pipes going in one face and coming out the opposite face. The next problem is that, if the fluid is all moving the same way, one side of the chip (the “incoming fluid” side) would get much hotter than the other. The effect could be mitigated somewhat by having some of the pipes flowing in the opposite direction. Ideally you’d want fluid coming in on all 6 faces to maximize cooling. Another possibility is pipes that split up within the chip. A wide pipe of cold fluid will have a similar effect as several smaller pipes of warmer fluid (the increase in fluid temperature is offset by the extra surface area). It would be an interesting puzzle to try to model the heat flows and come up with optimal pipe configurations. In doubling the side of the chip, one probably has to increase the proportion of chip volume dedicated to cooling by some factor 2n – I wonder what this fractal dimension is.

For most efficient cooling, one would probably want to take the cooling fluid from the CPU and any other hot parts of the system and compress it (just like the coolant in a fridge), allowing it to expand inside the CPU. Then rather than having lots of noisy fans one has one noisy compressor (which would probably be easier to acoustically isolate – maybe even by putting it outside). Fans are a big problem for noise and reliability – my main desktop machine (at the time of writing) has five of them, of which two have failed and a third is on its last legs.

Another major problem that will need to be solved is pluggable cooling lines. People expect to be able to build their own computers, which means that it must be possible to plug together a CPU, motherboard, graphics card and cooling system without an expensive machine. That means we’ll need some kind of connector for plugging the coolant lines from the CPU (and other hot components) into the cooling system. Ideally it will be easy to connect up and disconnect without the possibility of introducing dirt or air into the coolant lines, and without the possibility of coolant leaks. I suspect that whoever invents such a connector will make a lot of money.

Sometimes, doing things incrementally hurts more than it helps

Monday, October 12th, 2009

Usually the best way to make a major change to a piece of code is to try to break it down into small changes and to keep the code working the same after each such small change. The idea being that if you make too many changes and break the code too badly, you might never get it working again. Without working code it can be difficult to figure out what the next step should be.

But sometimes, incremental changes just don’t work. In particular, if you’re making major architectural changes, trying to construct something that is 90% original architecture and 10% new architecture is going to involve just as much extra work to try to make the incompatible pieces fit. In these cases, sometimes the only thing you can do is take the whole thing to pieces and put it back together again the way you want it.

Scaling/scanlines algorithm for monitor emulation

Monday, October 12th, 2009

For my TV emulation, I wanted to render scanlines nicely and at any resolution. xanalogtv does vertical rescaling by duplicating rows of pixels, which unfortunately makes some scanlines appear wider than others. Blargg’s NTSC filters don’t do any vertical rescaling at all.

The first thing I tried was a sinc interpolation filter with the kernel scaled such that the scanline only covered 70% of the pixels vertically (essentially modelling the scanlines as long thin rectangles). This worked great except that it was far too slow because of the sinc function’s infinite extent (I was doing a multiplication for each combination of horizontal position, vertical position and scanline). So I windowed the kernel with a Lanczos window. I got annoying aliasing effects using less than 3 lobes. With 3 lobes it was still too slow because each pixel was a weighted sum of 3-4 separate scanlines. Also, because of the negative lobes I needed extra headroom which meant I either had to reduce my colour resolution or use more than 8 bits per sample (which would also be slow).

The next thing I tried was a Gaussian kernel. This has several nice features:

  1. The Fourier Transform of a Gaussian is also a Gaussian, which is also a better approximation of a scanline than a rectangle (the focussing of the electron beam isn’t perfect, so to a first approximation their distribution around the beam center is normal).
  2. It dies off much more quickly than the sinc function.

The Gaussian kernel also gave a good image, so I kept it.

The next thing I wanted to do was improve the speed. I still had several scanlines contributing to every pixel. However, that doesn’t make much physical sense – the scanlines don’t really overlap (in fact there is a small gap between them) so I figured I should be able to get away with only using the highest coefficient that applies to each pixel. I tried this and it worked beautifully – no difference in the image at large sizes and it speed the program up by a factor of several. The downside was at small sizes – the image was too dark. This is because the filter was set up so that each pixel would be the average of several scanlines, but if only one scanline is contributing then then the brightness is 1/several. To fix this I just divided all the coefficients by the largest. There’s no mathematical justification for this, but it looks fine (apart from the fact that some of the scanlines don’t contribute to the picture at all).

If each pixel is only in one scanline, lots more optimizations are possible – for example, one can generate the image progressively, a scanline at a time, which helps keep data in the caches.

Finally, I still needed it to be faster so I moved all the rescaling (vertical and horizontal) to the GPU. I came up with a devishly clever hack to implement the same scanline algorithm on the GPU. No shader is needed – it can be done just using textures and alpha blending. There are two passes – the first draws the actual video data. The second alpha-blends a dark texture over the top for the scanlines. This texture is 1 texel wide and as many texels high as there are pixels vertically.

One other complication is that I wanted the video data texture to be linearly interpolated horizontally and nearest-neighbour interpolated vertically. This was done by drawing this texture on a geometry consisting of a number of horizontal stripes, each of which has the same v-texture-coordinate at its top as at its bottom.

Sometimes, figuring out the right architecture is half the battle

Sunday, October 11th, 2009

I went through quite a few design revisions to get to the pipeline architecture I described yesterday. Some ideas I tried out and then abandoned:

  • Giving the filters the responsibility of keeping the data they needed.
  • Filters telling other filters how many samples they should consume or produce.
  • A FilterGraph object which held all the buffers and which had methods to make and break connections.
  • All the reader and writer methods being on the buffers.
  • Lookahead reader methods for filters.
  • Filters encapsulating other filters that they communicate with.
  • Having separate consume() and produce() methods.
  • Having the Reader and Writer functionality as part of the Consumer and Producer classes (this introduced a surprisingly significant overhead)
  • A Connection object to encapsulate the buffer.

It’s rather difficult to tell what’s going to work well and what isn’t until you actually write some code. And then it takes some trial and error work to hit upon the right pattern. Assumptions must be called into question. Prejudices must be discarded. Darlings must be killed. But you know when you’ve got it right because the rest then practically writes itself.

Pipeline architecture

Saturday, October 10th, 2009

My software TV/monitor emulator is best thought of as a bunch of filters for transforming a signal in certain ways, connected together in a pipeline:

  • Decoding composite signals to YIQ
  • Transforming YIQ signals to RGB
  • Horizontal rescaling
  • Ghosting due to signal reflection in the cable
  • Adding noise

Because that’s the best way to think about it, that’s how I’d like to implement it. Then it will be easier to remove/replace filters that I don’t need or that I want to implement in a different way. The filters.h file in the crtsim source implements this architecture.

When you have a pipeline, there are two ways to drive it. One is “consumer pulls” and the other is “producer pushes”. In this case, the consumer is the code that actually renders the window to the screen. In “consumer pulls” mode, this code will fire at probably 60 times per second (potentially depending on the refresh rate of the monitor on the host machine) and each time it does, it will ask the filter which supplies its data for enough data to render one frame (or field, if we’re doing interlaced signals). This filter will then in turn ask the next one along the chain for data and so on up the chain until we get to the code that actually generates the composite signal.

In “producer pushes” mode, the producer generates data at a constant rate (possibly fixed by some other source of time in the system such as the audio device outputting at the correct rate). This data is then pushed from each filter to the next in the chain until it gets to the consumer. When the consumer has collected enough data to render a frame, a frame is rendered.

So for the purposes of emulating a TV or monitor as part of a microcomputer system emulator, “consumer pulls” and “producer pushes” modes can be thought of as “video rate driven” and “audio rate driven” modes respectively. Most emulators are hard-coded to do one or the other. But which one is best is determined by the user’s hardware and what they’re doing with the emulated system (video driven mode will generally smoother graphics while audio driven mode will generally give more stable audio). So ideally we’d like to make the choice user-selectable.

A third possibility is for the producer code to decide when to draw a frame and to call for a window redraw, which causes a data pull through the filter chain. However, I’ve discounted this idea because that is an incorrect placement of responsibility. The producer doesn’t (and shouldn’t) know about the state of the monitor. Even if it has just produced a vsync pulse it doesn’t necessarily mean its time for a new frame (if the monitor is “rolling” as it will do momentarily when the signal timebase changes, it won’t be).

There is another factor in pipeline design which is how the data stream corresponds to function calls. The simplest way would be to have each sink call the corresponding source each time it needs a sample (in pull mode) or each source call its corresponding sink each time a sample is available (in push mode). However, there are potentially quite a few filters and (because they are all replaceable at run time) each call from one filter to the next will be a virtual function call. That means that the compiler can’t inline the code and the CPU’s pipelining will get screwed up. According to this one can expect a virtual function call overhead of maybe 13 nanoseconds compared to inlined code (a crude test, but sufficient for order-of-magnitude calculations). Since most of our samples will be at 14MHz (4 times the NTSC color carrier frequency) that’s only about 5 virtual function calls per sample before we’ve used up all our CPU.

So each function call really needs to transfer a pile of samples, not just one, and we will need to have circular buffers in between the filters to keep the data. How many samples should we transfer at once? A good rule of thumb for figuring that out is that one’s hottest code and data should fit in L1 cache (which is maybe 64Kb on modern CPUs). Divide that up by the number of steps in the pipeline and we’re looking at low-single-digit numbers of Kb to be passed at once. A scanline’s worth of data (910 samples, give or take) is probably about right. That reduces the virtual function call overhead by three orders of magnitude which puts it well into the negligible range. Conceivably one could try benchmarking with lots of different “samples per call” values and then pick the one with the best overall performance (taking into account both call overhead and cache misses). I’ll probably do this at some point.

One disadvantage of the pipeline architecture is that it introduces some variable amount of latency – not enough to normally be visible to end users, but this does complicate one thing that I want to emulate – light pens. A light pen is just a fast light sensor that can be placed anywhere on the screen. When the electron beam passes underneath it, it sends a signal to the computer. The computer knows where the beam is supposed to be at any given moment, so it can figure out where the light pen is. However, for an emulator to have proper lightpen support, it needs to have very low latency between the screen and the machine emulation. For this reason, I might abandon the pipeline architecture and just hard-code all the signal munging effects I care about in the CRT simulator itself, processing a line at a time and stopping when the horizontal sync pulse is found. Then, if the lightpen is anywhere on the next line the CRT will be able to tell the machine emulation exactly when the lightpen is going to be triggered.