High-technology solutions can trap even the smartest folks. The best advice: Trust, but verify.
Programming errors that weren’t detected during testing cost Knight Capital Group Inc. nearly half a billion dollars within two hours when its computers bought stocks for which it had no buyers. The company received a financial lifeline, but its stock lost 75 percent of its value.
Bad operational risk management can kill your company. Let's learn how.
I learned my first risk management rule while spending my college summers working in a cement and tile plant. Early one morning, I was underneath a 600-pound cement step hanging from the end of a forklift while I guided the step toward a flatbed trailer.
Seeing me carelessly standing under the cement steps, the grizzled 50-year-old yard foreman, yelled, "Don't stand under the steps."
In trying to assure the yard foreman that I, the very smart college student, was perfectly safe, I yelled back, "It's got a chain wrapped around it."
Now unable to contain his frustration with the idiot college kid, the foreman pointed his morning cigar at me and thundered: "Chains break!"
This was an "Aha'' moment for me. How could I have been so careless that a foreman had to teach me that "chains break"? Simple. I had lost all fear of the risk because I was mesmerized by that thick chain.
So today, let's go find some other people who have been equally mesmerized by modern-day chains and find out what happened to them when their chains broke.
Knight Capital Group Inc., a key player servicing trades for the New York Stock Exchange, was badly wounded early this month, and required $400 million of rescue money to stay in business. Per the Wall Street Journal, "Seventeen-year-old Knight Capital was considered a pillar of the stock market, matching buyers and sellers for some $20 billion of trades a day. Knight uses complex algorithms to trade swiftly ... for small retail customers."
This describes a company whose mission-critical risk is the perfect execution of its electronic trading platform. However, in one day, in less than two hours, a failure in that platform caused a $440 million loss that nearly was fatal to the firm.
How could this happen? Here's how. On the day of the incident, Knight's computers, talking to those of the New York Stock Exchange, placed thousands of orders for which there wasn't a customer. Instead, because of a glitch in Knight's brand-new system, rather than real customers for real orders, Knight's computers, on their own, were placing orders with the stock exchange's computers.
By the time Knight's IT engineers caught the problems, the firm "owned" thousands of trades for which it had no customer. Additionally, once Knight's real customers were alerted to the problem, they pulled their business -- similar to a run on the bank. Because Knight was stuck holding all those trades, it needed a giant rescue package to keep from going out of business.
How, you ask, did this happen in the first place? First the nice technical explanation and then the real reason. Technical reason: Knight's IT folks designed a new computer system to integrate with the new system being installed by the New York Stock Exchange, on the actual day of the fiasco. Unfortunately, the two systems didn't play well with each other.
Now here is the real reason: human error. Some very well-educated IT professional failed to "validate the modification in a non-production environment." In lay terms, the engineers never tested it in real time.
Let's take a look at some other high-tech examples from the files of a risk manager I know to learn how being mesmerized by high technology has contributed to the broken-chain syndrome.
Health care system. The company believed it had an automatic guard against all human error. Unfortunately, some human managed to accidentally shut down 2,000 servers, affecting thousands of users and millions of customers. The level of staff disruption eventually made its way to patient service. The health care rule of "First do no harm'' took second place to "Don't touch that dial, Dave."
Food sales. A system change resulted in no product going out the door because customers could not be identified. So as perishable foods began stacking up on the dock, customers quickly turned to other sources. Millions of dollars lost directly and many more millions lost as the company tried to win back its old customers.
Just-in-time manufacturer. The company believed it was safe because it had a backup system. (Think -- "It has a chain around it.") On the fateful day, the backup system kicked in when the main plant went down. Unfortunately, the backup system quickly began falling further and further behind because it was never designed to handle a typical full day's workload. (Think -- "Chains break.")
There are three main sources of error for operational risk:
•Individual decisions to minimize cost irrespective of reliability -- the penny-wise, pound-foolish error.
•Human error -- usually relying on memory or failing to validate new procedures. "That ought to do it'' is not an engineering solution.
•Lack of contingency plans -- a common engineering hubris.
In dealing with operational risk, my personal advice to avoid your own "broken-chain syndrome'' is to create a better management chain.
My favorite, used by President Ronald Reagan when negotiating with the Soviet Union, is "Trust, but verify."