Archive for the ‘computer’ Category

The ALFE type system

Friday, October 12th, 2012

I want to write a series of posts about the rich type system I want to implement for ALFE. This post will serve as an index to them.

I recently tried doing a bit of programming in Haskell, and although I didn't produce anything worth showing off, I did get a bit of a feel for the language and how it works. One of my favorite things about Haskell is the rich type system, and I want to use some of its ideas in ALFE. Some of the ALFE types below are inspired by Haskell types.

Given types Foo, Bar and Baz I would like ALFE to have the following types:

  • Foo* - a pointer type.
  • Foo(Bar) (aka Foo -> Bar) - a function type.
  • Foo | Bar - a "sum" type.
  • Foo? - a "Maybe" type.
  • (Foo, Bar) - a "product" type.
  • [Foo] - a "sequence" type.
  • Foo[2] - a fixed length array type.
  • Bottom - the "bottom" type.
  • Variant - the "variant" type.
  • Label - the "label" type.
  • Complex<Foo> - a complex number type.
  • Rational<Foo> - a rational type.
  • String - the string type (usually represented in memory as UTF-8 in some form).
  • CodePoint - a type capable of holding any Unicode code point.
  • Fixed<Foo, Bar> - a fixed point fractional type.
  • Boolean - the boolean type.
  • Int1, Int2, Int4, Int8, Byte, Int16, Int32, Int64, Int128, Int256, Int512, UInt1, UInt2, UInt4, UInt8, UByte, UInt16, UInt32, UInt64, UInt128, UInt256, UInt512, NInt1, NInt2, NInt4, NInt8, NByte NInt16, NInt32, NInt64, NInt128, NInt256, NInt512 - fixed-width integer types.
  • Word, Int, HalfWord, QuarterWord, DoubleWord, QuadWord, UWord, UInt, UHalfWord, UQuarterWord, UDoubleWord, UQuadWord, NWord, NInt, NHalfWord, NQuarterWord, NDoubleWord, NQuadWord, FastInt1, FastInt2, FastInt4, FastInt8, FastByte, FastInt16, FastInt32, FastInt64, FastInt128, FastInt256, FastInt512, FastUInt1, FastUInt2, FastUInt4, FastUInt8, FastUByte, FastUInt16, FastUInt32, FastUInt64, FastUInt128, FastUInt256, FastUInt512, FastNInt1, FastNInt2, FastNInt4, FastNInt8, FastNByte FastNInt16, FastNInt32, FastNInt64, FastNInt128, FastNInt256, FastNInt512 - machine-dependent integer types.
  • WordString, Integer, Unsigned - arbitrary-precision integer types.
  • Float16, Float32, Float64, Float128 - fixed-length floating-point types.
  • Float - machine-dependent floating point types.
  • Floating - arbitrary-precision floating point type.
  • Concrete (possibly also specializations like Length, Time, Area, MagneticFluxDensity etc.)

There's still a lot about this that isn't finalized. For example, I might use some templates to avoid having all those fixed-width and machine-dependent types.

Government as singleton

Thursday, October 11th, 2012

Sometimes I read things on the internet written by libertarians. I used to read reddit a lot and that has always been a bit of a libertarian stronghold. Occasionally I read things on reason.com. I think I'm pretty squarely on the left side of most political systems but libertarianism is one part of conservatism that I do have some sympathy for. I like the idea of legalizing all drugs, for example, not getting into wars without a really good reason, and generally not imposing unnecessary rules on people or companies.

Those who describe themselves as libertarians often seem to take this principle to extremes, though. One sentiment in particular that I've noticed being expressed repeatedly is that all taxation is theft. I disagree with this - theft is illegal while taxes are legal. Also, one has a say in taxes (via voting and other forms of participation in the political process) but no say in getting robbed. Finally, one doesn't get anything back from getting robbed but taxes pay for useful government services and infrastructure. To me the accusation seems like an instance of the worst argument in the world.

Another thing that libertarians say is that government has a monopoly on (legal) aggression and violence. That sounds like a bad thing (because monopolies are bad, and violence and aggression are bad). But in this case two bads make a good - you don't want multiple violent organizations competing, as that would not tend to minimize the amount of violence. Since we can't eliminate legalized violence altogether (since otherwise we would have no way to arrest an uncooperative murder suspect) it's best that a single (accountable) organization has that monopoly.

There are other things that government has a natural monopoly over - things that benefit society as a whole but won't get done by the markets because there isn't any profit in being the one to do them - things like making sure the poor have enough to eat and access to life-saving and preventative medical care. Another case is where having multiple competing organizations would cause practical difficulties: I don't want my house to be connected to six different electrical networks, six different water supplies, six different sewers, six different telephone networks, have driveways connecting to six different road networks, have six different garbage collectors coming by and I don't want my town to be served by six incompatible rail networks. The basic provision of utilities like these is best done by monopoly - even if building, servicing and billing can have competing providers (here in the UK I can choose from many different electricity companies to bill me each month, but they all have the same number to call if there is a power cut). Similarly, the military is probably best done centrally, otherwise different militaries representing different interests of the same country might end up fighting each other!

It makes sense to have one single organization take care of all the things that need to be done that can't or won't be taken care of by markets for one reason or another, and that organization is what we call government.

There's a similar concept in software engineering called a singleton object - a single chunk of memory where you put all the things that there is only one of in the program. It's bad engineering practice to stuff in the singleton that doesn't really need to be there, because it leads to inflexible programs that allow you to have only one at a time of something that you might want to have multiple instances of. Similarly, it's bad practice for the government to do stuff that the market can do better - I wouldn't want government to get into the business of designing and making laptops, for example, since the market does a great job at that.

XT reset switch

Wednesday, October 10th, 2012

I've mentioned before that I have a system for remotely running programs on my XT. Of course, since I'm running native code (mostly written in assembly code) on a machine without any kind of memory protection (and especially since I'm sometimes doing crazy things like disabling DRAM refresh), these programs will inevitably sometimes crash the entire machine. I don't want to have to go and reset it manually every time this happens, so I needed a remote reset switch.

I found this post on the Vintage Computer Forums describing a modification that is almost exactly what I want. Th exactly the modification I wanted to make, except that it's connected to a manual switch. It's trivial to go from a switch that grounds a line to programmatically grounding a line, though, so I just hooked it to the Arduino-based keyboard port manufacturing test device hack I already had, and now sending a particular byte over the serial line causes the Arduino to ground the power good line for 250ms and thereby reset the XT.

Computer-controlled TV

Tuesday, October 9th, 2012

Before I bought my XT (but when I was planning on getting one) I knew I'd need a monitor for it. Fortunately for me, the region I was living in had recently transitioned from analog to digital TV signals, so a lot of people were disposing of TVs that were perfectly good apart from not having a built in ATSC decoder. So I didn't have to walk past very many yard sales before I found an ideal little 13" Orion TV1334A. The only downside was that it didn't have a remote control, which I discovered that I needed to change the brightness and contrast (without changing the brightness and contrast, the old-style CGA has rather poor composite output). I tried to make a fake remote using an Arduino, an IR LED that I pulled out of an old DVD remote, IRemote and LIRC's remotes database, but unfortunately I found that the latter did not contain codes for the TV1334A.

Fortunately I was able to find a suitable remote on eBay for a reasonable price (albeit still slightly more than I paid for the TV, which annoys me slightly). If I had just done that to start with that would probably be the end of it, but now that I've gone to the trouble of writing some IR code for the Arduino I want to be able to use it! This isn't as useless as it might sound, because it would be nice to be able to control the TV from the computer - to turn it on from another room, run a CGA program remotely, view the result over a webcam and then turn it off again.

Improved composite mode support for DOSBox

Sunday, October 7th, 2012

I recently contributed to a DOSBox patch to make composite output work in all CGA graphics modes. Most CGA composite games use BIOS mode 6 (port 0x3d8 value 0x1a, aka 640x200 1bpp mode) which gives a nice range of colours. However, there are a rare few games which use a 2bpp mode but have 3.57MHz vertical lines which are quite obviously designed to yield a different palette on the composite output. DOSBox currently shows the output of such games as they would appear on a digital (aka TTL or RGBI) monitor, which isn't always what the game author intended - the example that started the thread was Fooblitsky.

I had already written code to simulate composite CGA in 2bpp mode (and indeed all modes) but there is a complication - DOSBox is based around a 256 colour palette and my code assumes 24-bit colour. The first 16 colours are also reserved for the digital CGA colours so that the palette entries don't have to be reloaded when switching between composite and digital, so only 240 colours can be used for composite CGA. The current DOSBox CGA composite implementation uses 80 palette entries - 16 colours (one for each bit pattern) times 5 brightness levels (0, 1, 2, 3 or 4 pixels lit). I realized that actually only 16 of these palette entries are really needed, since the same information is being used to dereference both the "bit pattern" table and the "brightness" table. Also, DOSBox's current implementation isn't quite right since some of the colour fringes are desaturated - look at the top tapper screenshot (the one from DOSBox) here and compare it to the more correct version below - look in particular at the middle of the "SODA" sign in the window, the right edge of the D and the left edge of the O, which is grey in DOSBox (missing its fringing entirely). Since DOSBox uses a (1, 1, 1, 1) kernel for its NTSC filter, there are only actually 16 possible colour combinations (though they are permutated depending on the colour carrier phase).

I realized that a similar technique might be possible for the 2bpp modes. Each output composite pixel depends on the colours of four consecutive pixels of hdot width. These pixels cover at most 3 consecutive ldots, so any given pixel position depends on at most 6 bits of video data. It also depends on 2 bits of x position, so we need 8 bits of palette entries, or 256 colours - just slightly too many. There's an easy way to reduce the amount of palette entries though - half of the output hdots depend on only 2 consecutive ldots (since the sampled hdots exactly cover 2 ldots). So even hdots have 16 possible colours and odd hdots have 64, for a total of 16+64+16+64 = 160 colours - plenty to spare. What's more, the "render" part of the new algorithm is even faster than it used to be (although the mode setting part is probably much slower - however it's run sufficiently rarely that there's no need to optimize it).

ISA bus sniffer

Friday, October 5th, 2012

One of the trickiest parts of writing my cycle exact 8088 emulator is going to be figuring out exactly when each part of each instruction is executed - in particular, at what point during each instruction's execution is each of its bytes removed from the prefetch queue? And (for instructions which do IO) at which points during the execution are those IO requests sent from the Execution Unit to the Bus Interface Unit?

I was originally thinking that I would have to devise a clever set of experiments to find out - make a hypothesis about the timings, devise an experiment which behaved differently depending on whether that hypothesis was true or not (existence proof: if such an experiment were not possible I wouldn't care about the result for emulation purposes), rinse and repeat until the observed behavior of the emulator stops deviating from the observed behavior of the actual machine.

However, I have learned that there is a easier way to go about it. It turns out that the CPU outputs a couple of bits of information concerning the state of the prefetch queue on two of its pins (QS0 and QS1), allowing us to distinguish between 4 possible operations which can occur on each cycle: first byte of opcode removed, subsequent byte of opcode removed, queue emptied and no change. Being able to read that information (along with exactly what the bus is doing) would make figuring all this out much easier. I didn't want to use a logic probe to do this because (among other reasons) I wanted to be able to set up a large number of experiments and run them all automatically. So instead I have designed an ISA card which (completely transparently to the PC or XT it's plugged into) uses a microcontroller to sample the state of many lines and transmit the results to another PC over a serial connection.

Compared to a real logic probe we can only sample a few lines at once, only gather a couple of KB of samples at once and can't sample very often (I think 4.77MHz should be possible), but the experiments I care about are all deterministic so we can just repeat the experiment enough times to gather all the data I want. Here's the schematic for the bus sniffer and here's what the board layout looks like:

I've ordered a PCB from BatchPCB (the first time I've actually had a PCB professionally made) so we'll see how it works!

Adventures in CRTC lockstep

Monday, October 1st, 2012

Once I had achieved CGA lockstep, I tried some test programs. This image was made by cycling through the possible palette registers as quickly as possible (i.e. it's running a big unrolled loop of "INC AX; OUT DX,AL" to the palette register):

That worked great, except that in making it I noticed that the pattern wasn't always starting the same way - half the time the first visible scanline had different set of colours. Somehow a bit of state was leaking through my lockstep routine!

After a while I figured out that it was due to the way I was getting the CRTC into lockstep with the CGA and CPU. The smallest frame that the 6845 CRTC can do is two character clocks (1 character by 2 scanlines - a 1 scanline high frame doesn't work with that CRTC). I thought I could get around this by going into high-res mode - then 1 character clock is 1 hchar so a frame would be 1 lchar and we'd be in a known place in the frame once we were in a known place in the CGA cycle.

Have you spotted the problem yet? The problem is that I don't know what the phase relationship is between the CGA clock and the CRTC clock - the first hchar of the frame could be the left or the right hchar within the lchar! And in fact, which it turned out to be was decided at random on startup.

With a bit of fiddling I eventually came up with a way to get the CRTC into lockstep as well. The trick hinges on the fact that if we set up the CRTC parameters so that one of the scanlines is displaying a normal visible image and one is overscan, we can tell which scanline is which by reading the display enable bit of the CGA status register. Then we delay an odd number of lchars if the display enable bit is set one way and an even number of lchars if it's set the other way (it doesn't matter which is which). Because we want to keep the CGA and CPU in lockstep as well, the difference in the codepath lengths must also be a multiple of 3 lchars, so delaying for X lchars one way and X+3 the other works fine.

That's about all there is to it. The full lockstep routine is on github. Once lockstep is entered it'll persist until you wait for a time that depends on an external event (such as reading from disk/serial/parallel/ethernet/joystick or waiting for a keystroke). That doesn't mean that lockstep mode games and trackmos are impossible, though. The keyboard can be read by polling (pretty much all PC software directly or indirectly uses an interrupt for keyboard access but it isn't compulsary and I've done it by polling a few times). You just have to make sure the code paths are the same length no matter whether a key was pressed or not and no matter which key was pressed if there is one, which can be done by adding suitable delays. Disk access is a bit more difficult, since there's going to be a DMA bus access at some unpredictable point, and after it's happened you'll be out of lockstep. I think the solution is to HLT after the disk access is complete and restart execution on a timer interrupt. In the event that lockstep between CGA and PIT isn't possible, regaining lockstep once the timer interrupt has occurred should be possible by delaying for N ccycles for some N between 0 and 15 and a CGA memory access. Another possible way is to make sure the CPU is running code that is either:

  1. BIU-bound with no wait states, or
  2. that is EU-bound and never exhausts the prefetch queue

for the entire time that the accesses might be happening. That way the time taken to run the code doesn't depend on exactly when the accesses occur.

Adventures in CGA lockstep

Sunday, September 30th, 2012

As part of my project to emulate an IBM PC or XT with cycle accuracy, I need to be able to get the machine into a completely known and consistent state, which I call lockstep. That way I can run a program many times and be sure of getting exactly the same result each time.

This is a bit tricky, because while all the PC's clocks are derived from a single 14.318MHz crystal, they divide it in different ways. The CPU clock is made by dividing this frequency by 3, the PIT clock is made by dividing it by 4 and the CGA clock is made by dividing it by 16.

Getting the CPU clock in lockstep with the CGA clock is the difficult bit, since the CGA clock is in lockstep with the PIT clock by definition (assuming that such a lockstep is possible - I'm not sure offhand if the phase relationship between the PIT and the CGA clock is always the same or if it's randomized on startup - the latter would make it more complicated to use the PIT in lockstep mode, but that's not really a big problem since the point of lockstep mode is to be able to do timing statically).

Since the CGA clock and the CPU clock have periods which are relatively prime numbers of hdots, it's definitely possible to get them into lockstep. Once I had a rough idea of what the CGA wait states were, I realized that achieving lockstep ought to be possible with a combination of delays and CGA accesses. The algorithm would be:

  1. Do a CGA memory access, reducing the number of possible relative phases from 16 to 3.
  2. Delay for A ccycles.
  3. Do a CGA memory access, reducing the number of possible relative phases from 3 to 2.
  4. Delay for B ccycles.
  5. Do a CGA memory access, reducing the number of possible relative phases from 2 to 1.

Delaying for 16 ccycles gives the same relative phases as delaying for 0 ccycles, so the problem boils down to finding A mod 16 and B mod 16. That's only 256 possibilities (and probably quite a few of those will work) so trial and error works fine. Delaying for particular numbers of cycles is okay too - the 8-bit MUL instruction takes 69 ccycles plus 1 ccycle for each set bit in AL, so as long as you don't mind waiting for 69 ccycles you can get a delay of any number of ccycles you like.

But there's a more fundamental problem - how do we recognize when we've succeeded? The definition of lockstep involves consistency - ending up in a known end state no matter what the initial state. So in order to determine whether we're in lockstep or not, we really need to be able to control the initial state - in other words, in order to figure out whether we're in lockstep or not we first need to be in lockstep! That's a bit of a chicken-and-egg problem.

If I knew exactly what the CGA wait states were at this stage I could have figured out the right A and B values on purely theoretical principles, but I didn't - my examinations of the CGA schematics left some questions (particularly in areas involving how the 8088 treats READY signals occuring at different clock phases, and how some apparent race conditions actually turn out in real hardware). It was only in the course of achieving lockstep that I discovered what the CGA wait states actually are.

I had a few false starts involving identifying the 6 behavior classes for the 27 possible transition tables involved in the long-term behaviors of repeated sections of code. For example, if a piece of repeated code has 3 possible long-term behaviors depending on the relative phase at the start, I know that the repeated section must leave the relative phase alone.

But that was getting rather complicated, and I wasn't really getting anywhere until I hit on a better way - I realized that I could visualize exactly how long a piece of code was taking by running it and then changing the CGA palette register, which has an immediate effect, so marks the position on the screen where the electron beam was pointing when the register changed.

That's only useful if the transition happens in exactly the same place on the screen in each frame (otherwise you don't get a stable image and can't see what's going on). Which sounds like we're back to the chicken-and-egg problem again. But it's a more limited kind of lockstep that we need for this particular experiment - we don't need absolute lockstep, just a way of getting code to run at a consistent place relative to the raster beam, from frame to frame. That is to say, it doesn't matter if next time we run the program the image appears in a different place on the screen.

Fortunately, there's a way to do this on the PC even if we don't have full lockstep, since we can use the PIT to introduce a lockstep that's just consistent from frame-to-frame, not from run-to-run. If we set a timer to go off exactly 262*912/12 = 19912 PIT cycles, it'll occur exactly once every frame. That's not quite enough though, because interrupts don't quite have an instantaneous effect on the CPU - the CPU does finish whatever instruction it's currently executing before starting an interrupt. So you have to make sure it's not executing any instructions - i.e. is in the halt state. Another complication is that I had to disable all other interrupts and the DRAM refresh in order to avoid them messing up the timings, which meant that I had to access each DRAM row within a certain period of time lest the capacitors discharge and the memory contents decay, which meant that I couldn't leave the CPU halted for too long!

Once I had a stable image I was able to generate the 16 different CGA/CPU relative phases with multiply instructions, and made a diagonal line that advanced 3 hdots (1 ccycle) on each line just by cycle counting. Then by placing CGA memory accesses between this diagonal line and the palette write I was able to see exactly what the CGA wait states were:

It look a while to get this image because whenever I added some code to a line I had to change the delay code at the end of the line to get the start of the next line to line up correctly, so there was a fair amount of trial and error involved.

In this image, every other scanline displays (just using normal 640-pixel graphics mode) a pattern that repeats every 16 pixels, so that I can see where the lchar boundaries are.

Once I had the CGA wait states, getting CGA/CPU lockstep was relatively easy. Here's a photo I took when I finally got it:

Note that of the lower 16 black horizontal lines, they all end at the same position mod 48 hdots (you'll have to take my word that before the lockstep code they were at 16 different relative phases, like the lines further up which make a diagonal pattern).

Phew, that's a lot of complications for such a tiny piece of code!

Tomorrow we'll look at how to get the CRTC in lockstep as well.

The CGA wait states

Saturday, September 29th, 2012

As part of my project to emulate an IBM PC or XT with cycle accuracy, I also wanted to emulate the CGA card with cycle accuracy. That meant figuring out exactly what the wait states are when accessing CGA memory. Here's what I found out.

When talking about this stuff it helps to have a common terminology to talk about the several units of timing involved. This is the terminology I use:

  • 1 hdot = ~70ns = 1/14.318MHz = 1 pixel time in 640-pixel mode
  • 1 ldot = 2 hdots = ~140ns = 1/7.159MHz = 1 pixel time in 320-pixel mode
  • 1 ccycle = 3 hdots = ~210ns = 1/4.77MHz = 1 CPU cycle
  • 1 cycle = 4 hdots = ~279ns = 1/3.58MHz = 1 NTSC color burst cycle
  • 1 hchar = 8 hdots = ~559ns = 1/1.79MHz = 1 character time in 80-column text mode
  • 1 lchar = 16 hdots = ~1.12us = 1/895KHz = 1 character time in 40-column text mode

The wait state algorithm for the original IBM CGA is basically "wait 1 hchar, then wait for the next lchar, then wait for the next ccycle". That works out at between 3 and 8 ccycles depending on the relative phase of the CPU and CGA clocks. There are actually 16 possible relative phases (one for each of the hdots within the lchar at which the CPU cycle starts).

One relative phase has a 3 ccycle wait state and there are 3 relative phases for each of the other 5 possible wait state lengths (4, 5, 6, 7 and 8 ccycles respectively). 1+3+3+3+3+3=16. So the average wait state is (3+4*3+5*3+6*3+7*3+8*3)/16 = 5.8125 ccycles, but you might measure a different average depending on how your piece of code ends up synchronizing with the CGA clock.

In a way it's rather unfortunate because with a slight hardware modification I think the 1 hchar wait state could have been eliminated, making the average wait state about 3 ccycles shorter and roughly doubling the average speed of the CGA memory access.

Also unfortunately, "rep stosw" gives almost the worst possible wait state behavior. I haven't tried it yet, but I suspect that it would be possible to write CGA code that self-synchronizes to get the best possible wait states (though of course that would probably only improve performance on machines that were cycle exact with the machine that it was tuned for).

A third unfortunate thing is that the wait states are the same whereever the raster is on the screen - they aren't disabled during the retrace interval or anything like that. There's a good reason for that though - the CRTC continues to strobe through the CGA RAM throughout the overscan/retrace areas for dynamic RAM refresh - allowing the CPU access to the full memory bandwidth could result in loss of video RAM data, since the CGA doesn't participate in the system DRAM refresh cycles (which is a good thing, because otherwise all those wait states would propagate to the entire memory system).

Multifunction gates

Friday, September 28th, 2012

Recently, I came across an interesting article about unusual electronic components. One of the components that article talks about is the multifunction gate, which is a 6-pin integrated circuit which can act as one of several different 2-input logic gates depending on which input pins are connected to which incoming signal lines, and whether the ones that aren't connected are pulled high or low.

However, the gates mentioned by that article can't act as any 2-input logic gate, only some of them. The 74LVC1G97 can't act as NAND or NOR, the 74LVC1G98 can't act as AND or OR, and neither of the devices can act as XOR or XNOR. Their truth tables are as follows:

Inputs 74LVC1G97 74LVC1G98
0 0 0 0 1
0 0 1 0 1
0 1 0 1 0
0 1 1 1 0
1 0 0 0 1
1 0 1 1 0
1 1 0 0 1
1 1 1 1 0

That got me wondering if it's possible for such a 6-pin device to act as any 2-input logic gate. Such a 6-pin device is naturally equivalent to a single 3-input logic gate, since two of the pins are needed for power. There are only 256 (28) possible 3-input logic gates (since the truth table for a 3-input logic gate contains 23=8 bits of information). So it's a simple matter to just write a program that enumerates all the possible 3-input logic gates, tries all the possible ways of connecting them up, and sees what the results are equivalent to.

There are 16 possible 2-input logic gates: 0, 1, A, B, ~A, ~B, A&B, A|B, ~(A&B), ~(A|B), A~B, ~(A~B), A&(~B), (~A)&B, A|(~B) and (~A)|B. Discounting the trivial gates, the gates that ignore one of their inputs and treating two gates as identical if we can get one from the other by swapping the inputs, gives us 8 distinct gates: AND, OR, NAND, NOR, XOR, XNOR, AND-with-one-inverting-input and OR-with-one-inverting-input.

There are 4 possibilities for connecting up each of the three input pins: low, high, incoming signal line A and incoming signal line B. I originally thought that connecting the output pin to one of the input pins might also be useful, but it turns out that it isn't - either the value of the input pin makes no difference (in which case it might just as well be connected to one of the other four possibilities) or it does. If it does, then either the output is the same as the input pin that it's connected to (in which case it's underconstrained, and not a function of the other two input pins), or the opposite (in which case it's overconstrained, and also not a function of the other two input pins).

So we have 256 possible 3-input logic gates times four possibilities for each of the three input pins, times two possible states for each of the two incoming signal line - that's 65536 circuit evaluations to try, which a computer program can run through in a timespan that is indistinguishable from instantaneous by unaided human senses.

Running the program didn't find any 3-input gates which can be configured to act as any of the 16 2-input gates, but it find something quite interesting - a dozen 3-input gates which can be configured to make 14 of them. Since swapping the inputs around gives equivalent gates, there's actually just two different gates, which for the purposes of this essay I'll call Harut and Marut.

Inputs Harut Marut
0 0 0 0 1
0 0 1 0 1
0 1 0 1 0
0 1 1 1 0
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 0

Note that the truth tables for Harut and Marut are quite similar to 74LVC1G97 and 74LVC1G98 respectively, just with two of the rows swapped or inverted. I haven't been able to find Harut and Marut gates on Mouser - it would be interesting to know if anybody makes them. One possible downside is that the 3-input gate isn't as useful in its own right (the 3-input gates for 74LVC1G97 and 74LVC1G98 are just a 2-input multiplexer and same with inverted output respectively).

I think it should be possible to design a 6-pin device that can yield any 2-input gate by making the supply lines take part in the configuration as well. For example, if you have a Harut gate and a Marut gate which give high-impedence outputs when power is not applied, you could put diodes on their supply lines and connect them up in parallel with the supply lines interchanged. Then applying power in the normal way would yield a Harut gate and reversing the polarity would yield a Marut gate. There's probably better ways to do it.