ZpqrtBnk

Learning Computers Learning

Posted on December 20, 2015 in life and edited on December 27, 2017

The call came in around 10am this morning. Straight from the factory's manager. Mildly worried. Since the factory started at 5am, all they had been building was a certain model of Swedish-localized PC. Stacking nearly a thousand of them now, and still no sign of a different model. For this experienced man, piloting the work of about a thousand workers, the situation had a strange smell... and would we please check that everything was OK with the systems?

A quick glance at the manufacturing software showed that it was receiving consistent requests from the order-management system to build the same model of Swedish-localized PC, again and again. The order-management system, in turn, had been locked on one particular order of two hundred PCs of that model, to be shipped to a retailer in Sweden. Waiting for the manufacturing to complete.

Something was wrong indeed, and the factory was probably going to build the same PC again and again until it runs out of supplies or...

Stop!

Call the factory's manager. Stop the factory. Immediately. Investigate.

"You know we lose about one million per hour when the factory is not running, do you?" - he asks. "And are you telling me we probably need to disassemble about eight hundred PCs, too? You know we lose..." Yes. We will deal with disassembling later: at the moment, the number one priority is to understand what is going on and restart the factory. There are one thousand workers waiting idly there - and costing money.

So... the order-management system is locked on one order, and the manufacturing system keeps getting requests to build the same PC, over and over. Obviously something in-between is failing. That would be the middle-ware that interfaces the two systems, a brillant piece of software written in COBOL, which for obscure reasons maintains the quantity as a PIC 9(2) aka a two digits number. That is, a number that goes 00, 01, 02... 98, 99, 00. Yes, it loops. No, it cannot reach 100.

The order-management system is supposed to be aware of this, and to split orders in batches of 99 items each. Could it be that... yes! A developer made a change the day before, which somehow bypasses the split, and the order-management is trying to send a request for a batch of 200 PCs to be built... a number that would never be reached!

The modern equivalent would be:

for (byte b = 0; b < 400; b++) { ... }

Revert the changes. Rebuild. Restart. Call the factory manager. Now is the time to...

Restart!

It is 11:15am, and the factory has restarted. A small team there is already setting up an assembly line to work in reverse mode and disassemble the eight hundred PCs that were never ordered. The total cost of the incident will probability amount to around two millions.

Meanwhile, a post-mortem meeting is organized between support, developers and testers. Processes are to be adapted, decisions to be made. But make no mistake: within the next ten days, one support engineer will get the dreaded call on his pager, one night around 4am, because some system will refuse to work, and he will spend the next 40 hours in front of an amber terminal, digging for errors, tracing through code, racing against time and the money loss counter.

Learning, the hard way

Once you have read code at 3am trying to figure out what it is doing exactly, you know precisely what type of comments "good comments" are. Or what "elegant code" means: not an over-convoluted regular expression magic, no, no, just nicely indented and formatted blocks of instructions, meaningful names, things like that. Things you can read and understand at 3am.

And when writing code, you would check, and check again, that it has chances to run correctly. Maybe not with the same paranoia than when NASA built software for the Shuttle, but nevertheless. You learned that errors had consequences, maybe not blowing an Ariane 5 rocket, but well, being awake at 3am is painful enough.

And now I am wondering whether learning to code on "the web" is a Bad Education. It teaches you that if it does not work, you can always fix it. It teaches you that everything eventually works. Ship fast, fix fast. Ship anything that has a reasonable chance to work, and then tweak it until it actually works.

There was a time we thought we would be able to prove programs using clever mathematical tools, the way we prove theorems. And the computer would do exactly what it was designed for.

Learning, a new way

We are not exactly there, and yet I do not feel like turning this post into an endless rant.

It is quite fascinating to see cars drive themselves, or DeepMind's AlphaZero learn how to play Go or Chess, without anyone actually being able to describe how they do it. They "learn" and then they "know". And entire field of software developement is growing, where we essentially, awkwardly, try to nurture computers with enough tools so that they can figure things out by themselves.

Computers will become intelligent in a way we certainly do not expect them to.

There used to be Disqus-powered comments here. They got very little engagement, and I am not a big fan of Disqus. So, comments are gone. If you want to discuss this article, your best bet is to ping me on Mastodon.