By Bruce S.
The PDP10-KI went down sometime in the fall, maybe October. This is the machine just to the right of the CDC6500 as you come in the second floor computer room. I noticed this fairly quickly and tried to reboot it, but it would hang when I tried, several times.
OK, it must be time to run diagnostics, which I proceeded to do. It passed all the diagnostics from DBKAA to DBKAH, which are the ones on paper tape, but the TD10 DECTape controller, just to the right of the console had quit again, as it has done almost every time I need to use it.
This TD10 and I just do not get along very well. It always kicks me around the block several times before it will let me know what is wrong, so I can fix it. Because of this history, it sometimes takes a while for me to generate the gumption to work on it. This time was no exception, since I was busy trying to get the KA working. If you don’t know about gumption and gumption traps, you should read the classic “Zen and the Art of Motorcycle Maintenance”, which doesn’t really talk about motorcycles, or Zen much.
Back to our story: We fixed the KA, moved it down to the second floor computer room, fixed it again, and had a small gathering to boot it the first time. The CDC was being its normal self, but was refusing to be down when I arrived at work, which is my signal to drop everything and work on it.
I was running out of reasons to avoid working on the KI.
Contrary to normal behavior, it only took a few hours to figure out the problem with the TD10, so I could run the diagnostics that are very awkward to load from paper tape.
All the normal diagnostics ran except for DBKAL. I think it took a few days to remember that there is a diagnostic for which the binary we have is wrong, and I have to patch a couple of locations after loading, in order for it to work. Now DBKAL and DBKAM work. On to some more obscure ones. All the CPU ones seem to pass, does it boot now? No, it still hangs after the OS is loaded, and we type “GO” to fire up timesharing.
What else can we test? We ran DDRHA, which tests the RH10 disk controller, but it passed. We then ran DDRPI, which tests the disk drives. Now we don’t really use the disk drives it is expecting to test, we use our MDE (Massbus Disk Emulator). We have been using this MDE for about 5 years now, but there could still be a bug hiding in there somewhere.
DDRPI looked like everything was fine for about 20 minutes, while it was doing register tests, seek tests, all ones and zeros tests. When it got to testing the surface of the disk, things started to go wrong. It would get an error where it looked like the data was misplaced, like it was reading the wrong sector or something like that.
How could that be? This thing has been working fine for over 5 years, in fact when we ran DDRPI from the KA, using the KI’s Memory, RH10s, DAS33, and MDE, EVERYTHING was FINE, even the surface test!
How about the memory? It passes my little MARCH memory test, from the KI or the KA. Dragging out the DECTape again, we loaded up DDMMD, which is one of the memory diagnostics. We fired it up, and it ran fine for about 15 minutes, whereupon it started spewing out errors. We have run this a BUNCH, and the errors usually seem to start at location 374000, and the data seems to be inverted from what it should be. The test complains about address bit 24.
We run the same test from the KA against the KI’s memory, and it works fine! It is handy to have the KA right across the room to enable this kind of testing.
What is really going on here? Let’s look at the console:
OK, TN=2, that means it is doing “Address” test. AS=F24 turns out to mean that it was doing fast addressing on address bit 24. “What does that mean” you may ask? I did! After much grovelling over the DDMMD listing, and consulting with Rich Alderson, I found that they would fill memory with the address and its complement, and then go through reading a location, verifying it was correct, and writing the complement in it, then go back and read and verify that they all had the complement. Ah, but what about that F24 part? When they are reading and complementing the data they start skipping locations, by changing which address bit they increment first. The first time they do this, they use bit 35 as the lsb, but next time through they shift the lsb over to bit 34, then 33 etc. When they get to bit 24, we run into this problem.
How do I figure out what is really going on here? I decided to write a version of my MARCH that does this, MRCHFA. It took a while, but I finally got it to work on the KA, and tried it on the KI: Unfortunately the KI passed it too! What else are they doing differently? OK, more grovelling over the listing: They are stuffing all their inner loops down in the Fast AC’s to speed them up. On to MRCHF3, which pushes the inner loops down to the Fast AC’s. Does the KI fail that one? Nope!
I’m running out of ideas, where do we go from here? I decide to just watch it for a while, and see what happens AFTER it starts to fail. I see it fail bit 24 from 374000 to 377777, both bit 23 and 22 over the same range, then it starts failing location 700000, in the same way. Shortly thereafter, the program gives up in disgust, and stops printing the results, and starts ignoring the errors. Now I just watch the lights on the ARM10 memory.
I got used to the way the lights blink while working on MRCHFA, and MRCHF3, as the LSB moves up the address, the slow blinking address follows it.
But wait, that isn’t what I am seeing: I see the lights increment from the bottom, do the FA thing, then increment from the bottom again, and then shift the LSB over. What is going on here? Is there anything on the ARM10 that will tell me anything? Yes, there are the read and write lights. While writing, both lights come on, but a read only lights the read light. After the FA thing, it just reads! Back to grovelling over the listing some more.
Ok, what they do is: fill memory from the bottom to the top, check and complement using the shifting LSB, and then check from the bottom to the top. Another new test called THEIRS, which of course doesn’t catch the problem either. I am running out of hair to pull out here!
As I write this, both machines are running DDMMD against the others memory, happily, no errors. No happy ending… YET!