After losing verdict, Toyota settles in sudden acceleration case

walter Lee · Oct 28, 2013

bwilson4web said: ↑

That is why many of the comments had more insight than the article author. But the 800 page report might be interesting. Also, they mentioned some sort of simulator.

I would be more impressed if they'd mounted the computers on a test fixture with stand-in sensors and then shown how to exploit the faults. To toss an 800 page technical report at a typical juror who may barely know the 'three finger salute.'

Bob Wilson
Click to expand...

Having multiple error bits flagged on the ECU device control block suggest a cascading system error - that is a primary error might trip several secondary errors which in turn trip several additional errors and so forth. Cascading system errors without time-sequence stamps will make it difficult to track what was the original primary error unless one has the source code (which tells the test engineer the interrupt- priority (error trapping sequence/heirarchy) - finding the primary error in a cascading error situation is liken to the proverbial *finding a needle in a haystack."

Embedded computer systems like those often found in motor vehicles are often done in machine code because of memory restriction and the need for high operating speeds at low power levels - custom embedded system usually have custom ROM built in to hold the software ... so even with the correct disassembler to read the code - unless you know the *entire* memory map of where all the embedded ROM locations are - understanding code can be problematic. For example, machine code writing to a memory location may look totally useless until you realize it is writing to ROM location to test if it has been illegally copied.

FL_Prius_Driver · Oct 28, 2013

My main day job is overseeing the design of manned and unmanned flight control electronics. I've watched how a poorly designed flight control system had a SEU (cosmic ray) completely destroy a UAV in under a second. I'm talking huge fireball, not a slight glitch. The entire engineering approach to these flight safety systems requires a design culture very, very different than what Toyota and every other car maker's approach to vehicle electronics. kbeck gave a pretty good description of what is done at the low level engineering level, so I'll talk about the high level company culture/approach of what ultra safe/reliable flight controls requires:

1) Avionics starts with a national or internationally recognized standard for developing system design, hardware design, and software design. Basically, the organization has to set, meet, and prove they meet the safety and reliability standards industry avionics experts have proven to be valid and achievable. These are demanding. In the avionics world, this translates into three or more independent units calculating the flight control algorithm and a selection architecture ensuring all agree. When they do not agree, the odd unit out is ignored and must be fixed before flying again. When a flight control unit like this fails, it better be due to a meteorite hitting it or a lot of planes will be grounded quick.

2) The three independent units must be proven to not have "common faults". An example would be a software bug that would simultaneously occur in all three units if they were running the same software. So of the three or more units, each should have a completely different real time operating system, the software for each would be written by independent teams, and three different compilers would be used to compile the source code into executable. Then three different processors would be desirable. In reality, this may not be possible (there are not that many aviation processor chip makers) so there is some streamlining, but the thought process intent can be accomplished if the design teams are disciplined.

3) The logging and data storage of all the processing data is an integral part of the design. When a SEU occurs, exactly what happened and how it was handled must be recorded completely. At least in the fireball described above, where the SEU inverted the fight control algorithm was exactly located due to real time telemetry, so why the UAV turned into a pile of ashes was determined.

I see none of these lessons applied in any car ECU. (However, Elon Musk has figured out just how critical ECU recording is, so at least one car maker has a clue.)

kbeck · Oct 28, 2013

jdcollins5 said: ↑

I am not trying to sugar coat anything here. I am just asking an obvious question about the software. If it was a software problem why have we not seen many more such sudden acceleration accidents?

Most of what you said above has nothing to do with the software.

Don't get me wrong, I am not a Toyota fan and think their response was totally wrong. I just do not understand why we have not seen additional incidents.
Click to expand...

So, let me make up off the top of my head why zillions of Toyotas aren't accelerating like crazy all over the landscape:

If it was that obvious and common, it wouldn't have made it out of R&D over at Toyota, or out of the testing platform. (Unless it was a one-off that wasn't replicated. Those sometimes get "smoothed" over.. It's hard to troubleshoot something that's not there.)

Think about my Cosmic Ray hypothesis which, as I said before, I wasn't kidding about. We're talking about Total Electronic Disruption a couple of micrometers across and a couple of miles long at random angles, at random times, and not that high a density. (If it was a high enough density to be really obvious, we'd all be dead of radiation poisoning. As it is, cosmic rays explains why, for example, that it takes ~5 pregnancies before a viable fetus is formed, the rest all spontaneously aborting - defects in the genome caused in large part by cosmic rays kind of kill the cell.) So, once in a great while, a Cosmic Ray comes down and hits the right bit at the right time. And the strength of the Ray and the strength of the ram cell (or whatever) have to be compatible. For all we know, there may be Toyotas out there that, when driven through a veritable Cosmic Ray storm show no ill effects, and others that stop working the moment the sun burps. In any case, given enough Toyotas out there, the laws of random numbers are such that it would be a near certainty that some car, somewhere, is going to get zapped eventually.

So: Random events. Highly unlikely. But, given the software, these things will occur.

KBeck

jdcollins5 · Oct 28, 2013

^Is that really the best answer you can give?

Mike500 · Oct 28, 2013

It's like the lottery, and the odds are even greater.

It can even be nearly impossible to reproduce.

FL_Prius_Driver · Oct 28, 2013

jdcollins5 said: ↑

... So if this was the fault of Toyota's software would not one expect many more events such as this?
Click to expand...

Keep in mind, that most drivers are extremely competent and handle it well. A lot of transient problems are totally resolved by a restart or reboot. Nearly all computers depend on this for crash recovery. I would rather have a software event like this anyday over a tire blowing out while on the interstate. Likewise most crashes are due to the driver doing something very wrong, with blaming the car as a routine excuse. So what the actual situation is can be mind numbingly hard to conclusively determine.

To truly track down and determine how many events are due to internal malfunctions requires very good data logging internals. To the extent that Toyota, or any car maker, forgoes intensive logging of error conditions and events, they do hold responsibility. They must provide the hard data showing what the ECU commands were during a critical event. Once they start doing that, big changes would be forthcoming....the first of which would be an international standard for ECU reliability.

a_gray_prius · Oct 28, 2013

kbeck said: ↑

So, let me make up off the top of my head why zillions of Toyotas aren't accelerating like crazy all over the landscape:

If it was that obvious and common, it wouldn't have made it out of R&D over at Toyota, or out of the testing platform. (Unless it was a one-off that wasn't replicated. Those sometimes get "smoothed" over.. It's hard to troubleshoot something that's not there.)

Think about my Cosmic Ray hypothesis which, as I said before, I wasn't kidding about. We're talking about Total Electronic Disruption a couple of micrometers across and a couple of miles long at random angles, at random times, and not that high a density. (If it was a high enough density to be really obvious, we'd all be dead of radiation poisoning. As it is, cosmic rays explains why, for example, that it takes ~5 pregnancies before a viable fetus is formed, the rest all spontaneously aborting - defects in the genome caused in large part by cosmic rays kind of kill the cell.) So, once in a great while, a Cosmic Ray comes down and hits the right bit at the right time. And the strength of the Ray and the strength of the ram cell (or whatever) have to be compatible. For all we know, there may be Toyotas out there that, when driven through a veritable Cosmic Ray storm show no ill effects, and others that stop working the moment the sun burps. In any case, given enough Toyotas out there, the laws of random numbers are such that it would be a near certainty that some car, somewhere, is going to get zapped eventually.

So: Random events. Highly unlikely. But, given the software, these things will occur.

KBeck
Click to expand...

This is demonstrably false. There are repair mechanisms in all cells which repair many, many defects in the genome and checkpoints that prevent cell division until these defects are fixed (although some beyond repair will undergo apoptosis). However, these processes don't always work. It's clear that they cleaned out all the "shallow bugs" but the deep, rare ones are always going to be hard to find in any software.

Mike500 · Oct 28, 2013

Then again, the bug might NEVER surface throughout the entire life of the program.

Whirldy · Oct 28, 2013

Rude person's said: ↑

Juries and courts do not always reach the correct scientific decision.

History will show many examples of humans making decisions that are non-technically supportable.
Click to expand...

+1 «As the thinker thinks, the prover proves.»

fuzzy1 · Oct 28, 2013

jdcollins5 said: ↑

... I am just asking an obvious question about the software. If it was a software problem why have we not seen many more such sudden acceleration accidents? ...
Click to expand...

I'm not aware of any minimum (non-zero) occurrence rate. Various bug occurrence rates should fill the entire spectrum between 'too fast to count' and 'so rare it probably won't happen even once'. The frequent ones will get far more attention and debugging effort than the extremely rare ones.

A certain type of electrical hardware error can theoretically be reduced to any arbitrary finite level above zero, but not to zero itself. In an imaging system where it just causes sparkles on a video display, it needs only to be kept low enough to not be much distraction. But in other uses where it can upset a system, it can theoretically be pushed down to once per second, once month, once per warranty period, once per device lifetime, or even to once for the entire product build over the age (so far) of the Universe. At a cost, of course. But it cannot be completely eliminated, even in the absence of cosmic rays.

So just because an alleged bug doesn't materialize at a significant rate doesn't mean it must be just a figment of some lawyer's imagination. But I also suspect that when the extremely rare bug does happen, it most likely will get lost under the flood of 'pilot errors', not landing on the desk of the ambulance chaser who knows about it with a convincing client available.

bwilson4web · Oct 29, 2013

Perhaps "never" might be a little strong:

austingreen said: ↑

. . .

Gen III prius never had any of the reported problems, although some drivers did report unintended acceleration, none was validated, and afaik no fatalities occured from a gen III prius UI acceleration.
Click to expand...

Our significant Gen III problem was the brake system:

Software update recall - resolved or diminished the "brake pause" problem.

Accumulator replacement - the first 80,000 apparently had a defective, metal bellows replaced.

The rule of thumb I follow with intermittent problems is how they are like the sith:

Always two, there are. No more, no less. A master, and an apprentice.
Click to expand...

Source: Yoda, Episode I: The Phantom Menace

A major intermittent problem often provides cover for a second, the less frequent one.

We're also seeing reports of Gen III in taxi service having higher failure rates: traction battery, inverter, power steering, and transaxle. Sad to say, we have not found a Gen III taxi driver willing to collaborate on getting some metrics.

As for software bugs, my other rule of thumb is moving a body of software from one computer system or language to another will often detect previously hidden errors. It can be as simple as a language change or changing processors and OS. So this weekend, a latent defect in program was found by integration and test of the Perl code on Solaris 10 and Redhat. Code that appeared to work on one, failed on the other, but it was legitimate bug. Just one system did not exhibit the failure symptom.

Bob Wilson

kbeck · Oct 29, 2013

jdcollins5 said: ↑

^Is that really the best answer you can give?
Click to expand...

Here's a better: I spent a good chunk of last night going through the couple hundred page in-court transcript of Mr. Barr's testimony in front of the jury. Barr Testimony. That includes his testimony elicited by the friendly plaintiff lawyer and the cross-examination by the not-so-friendly defendant lawyer. He blew the latter away.

My hair is on fire. Forget cosmic ray events for the moment:

Stack was 91% full, semi-worst case, with recursive functions present but not called. No stack overflow check code. And the OS code/data area that controls which subroutines run is located just above the stack. Holy *****. Analysis showed that if the stack overflowed, the main "X" (they called it that) subroutine would stop, period. Toyota made a goof-ball error on Day 1 and did not account for the OS stack usage. (What!?!)

They had fun stopping this routine with a car on a dynamometer. Car runaway resulted and, initially, scared the heck out of the tester.

The watchdog timer function was called abysmal, not challenged by Toyota. From my reading of the unchallenged testimony, the primary reason it didn't work was because Toyota had overloaded this processor with too much to do. As a result, functions that might have safety reset the ECU were not present - had they been present, the ECU wouldn't have been shippable.

Error codes from the OS were ignored. (Safety critical function - they're ignoring error codes!!!! @#$!%)

DTC codes for problems with the throttle control were set by the same routine that controls the throttle - so, when the throttle process stopped, no DTC codes were saved. Worse - analysis and testing proved that, even when DTC codes should have been present in some cases (in the airbag system), they weren't. Bugs.

The safety CPU that was supposed to check on the main CPU was faulty. On the one fault that was exercised by the Barr group, said CPU inadvertently (i.e., not by design) detected the fault. And, in this one case, could stall the engine - but only if the driver came all the way off the brake, then reapplied the brake, then waited a few seconds. Urghgh.

No MR system at Toyota. Period. I cannot overstate on just how flat-out evil this is when running a software development operation. This could easily have been challenged by Toyota's lawyer, but was not.

There's other fun stuff in there. One of the worst: NASA asked Toyota if the CPU had ECC (error-correcting code) memory. Toyota responded with a "yes". Toyota lied. As a result, NASA did not analyze a whole slew of possible failure modes. On the report, eventually made public, Toyota redacted every mention of ECC.

At the time, ECC processors were more expensive that non-ECC processors. But ECC processors are much, much better at detecting cosmic ray events. Urgh. Double Urgh. My hair is on fire.

I take back my comment from before that Barr had been doing cut-and-paste in his 800 page report. From the sounds of it, he didn't need to - there were that many errors.

By the way: Barr was editor-in-chief of Embedded Processor magazine, has written three books on the subject, and spent 1.5 years pulling teeth out of Toyota's code, with three to six helpers.

It may take me a day or three to calm down. Read the testimony.

KBeck

a_gray_prius said: ↑

This is demonstrably false. There are repair mechanisms in all cells which repair many, many defects in the genome and checkpoints that prevent cell division until these defects are fixed (although some beyond repair will undergo apoptosis). However, these processes don't always work. It's clear that they cleaned out all the "shallow bugs" but the deep, rare ones are always going to be hard to find in any software.
Click to expand...

Fine. I agree that there are repair mechanisms. I'll agree that they fix a large percentage of errors introduced by $RANDOM problem, be it free radicals, cosmic rays, or whatever.

But I'll also posit that they don't fix everything. If these processes don't fix everything then there are defects that remain. The implication is that if the uncorrected damage happens in the ovaries/testes then that damage will be present in the genome of an embryo.

As it happens the stuff I use around here is heavy on error correcting codes. ECC codes can reduce the error rate by orders of magnitude - but they can't make it zero, very similar in overall function to Life.

There are examples of highly rad-hard bacteria with multiple copies of their own genome and genome correction processes that make what humans (and most of the rest of DNA/RNA based life) do look silly. Said bacteria has few competitors in its environment, for obvious reasons. Put said bacteria in a nicer environment and they get out-competed by other bacteria that don't have to put out the same effort to just stay alive. Likewise, our correction processes are Good Enough - but not as good as said bacteria. Hence, mutations (rarely) and cell death (a lot more likely). I've read in multiple places that it takes, on average, five attempts to get a viable embryo, and it's DNA defects that kill off the misses.

KBeck

hill · Oct 29, 2013

Many Cars Have Over a Million Lines of Code | MIT Technology Review

.

bwilson4web · Oct 29, 2013

kbeck said: ↑

Here's a better: I spent a good chunk of last night going through the couple hundred page in-court transcript of Mr. Barr's testimony in front of the jury. Barr Testimony. That includes his testimony elicited by the friendly plaintiff lawyer and the cross-examination by the not-so-friendly defendant lawyer. He blew the latter away.

My hair is on fire. Forget cosmic ray events for the moment:
. . .
Click to expand...

Your summary is enough to get my hair 'smoldering' and I haven't read the PDF, yet. Being an old guy, I'll print it up later tonight. I don't like to tie up a printer during the business day.

You've done good!!

Anything on the full report?

Thanks,
Bob Wilson

austingreen · Oct 29, 2013

hill said: ↑

Many Cars Have Over a Million Lines of Code | MIT Technology Review

.
Click to expand...

We did find out during the congressional testimony that toyota did not actually do the type of testing some of us do on mission critical software.

Lines of code is a bad metric. A good metric would be no logged incidents. Toyota until very recently refused to properly log information and to read it. We therefore should have severe doubts to any investigations with black boxes that do not record properely or are not read. IIRC all 2012 and later toyota's have proper logging, and we can see if we get incidents. I am absolutely sure the software has changed, which means this does not validate the software in older cars.

kbeck · Oct 29, 2013

austingreen said: ↑

We did find out during the congressional testimony that toyota did not actually do the type of testing some of us do on mission critical software.

Lines of code is a bad metric. A good metric would be no logged incidents. Toyota until very recently refused to properly log information and to read it. We therefore should have severe doubts to any investigations with black boxes that do not record properely or are not read. IIRC all 2012 and later toyota's have proper logging, and we can see if we get incidents. I am absolutely sure the software has changed, which means this does not validate the software in older cars.
Click to expand...

Read the testimony.

Barr's group has proof that the logging doesn't actually work in all cases.

The main, "X" process, which controls throttle position, was nicknamed by his group the "kitchen sink" process, in that a ton of stuff was in there - including sending DTC codes to the black box. (Not present on the 2005 Camry that was the subject of the lawsuit, but was present on the 2008 Camry code that he also inspected.)

A single bit flip of a particular unprotected RAM location could, can, and apparently has, in the field, stopped the "X" process in its tracks. At which point, on these Camrys:

The throttle is stuck in whatever position it was in before the bit was flipped. If one was accelerating, one keeps on accelerating.

The badly designed watchdog does not figure out that the blame throttle control process has died.

The badly designed supervisory CPU does not figure out that the blame throttle control process has died, or, for that matter, even pays attention to the fact that the throttle is open and the driver is trying to brake the car.

The bit in question is just above the stack space. So, a stack overflow, which is very possible, given that Toyota badly underestimated how much stack space they needed, and, additionally, used recursive functions that fill up stack space dramatically, is highly likely.

No ECC RAM that would have detected such a cosmic ray bit flip which would lead to an ECU reset. Present in later versions of the Camry. However, given that Toyota was not mirroring OS variables (of which that bit was one), ECC RAM would not have helped in the case of a functional, stack overflow condition.

It's not the lines of code metric. Barr's group ran other tools to determine the complexity and testability of the code. Of the various piles of code they examined, a large number dinged up as "untestable" and a smaller, but critical and very scary pile dinged up as "unmaintainable".

And what has my hair on fire is that there was no MR system for tracking errors.

There is no, I repeat, no excuse for software development practices as listed above on man-safety critical hardware and software. This was simply pure, organization-driven massive stupidity. Toyota deserves every bit of the bitch-slapping they are about to get from other injured, maimed, and the estates of dead plaintiffs. And the people who allowed this to occur should be fired - but I don't expect that that will happen in insular Japan.

If you guys want to see some real furor, skip the EE Times article and go to EDN magazine. There's a lot more coverage and some really pissed-off embedded systems programmers hanging out there.

KBeck

austingreen · Oct 29, 2013

kbeck said: ↑

Read the testimony.

Barr's group has proof that the logging doesn't actually work in all cases.

Click to expand...

Please re-read what I wrote, I said that in 2012 they finally started logging well, we know that likely logged badly on purpose in the mid 2000s. That came out in the congressional testimony.

I know your hair is on fire, I'm not trying to put it out;-) but explain to the non-technical people out there that lines of code are a bad metric. Poor logging is a symptom of likely bugs in the code. You pointed out analysis of sloppy programming, also an area that could likely cause problems. Do we know if these practices led to any of the fatalities? No! But they certainly would leave doubt in my mind that toyota was truthful, and lead to findings against the company.

FL_Prius_Driver · Oct 29, 2013

First, thank for explicit details. Discussing facts is so much more productive than discussing opinions.

kbeck said: ↑

And the people who allowed this to occur should be fired - but I don't expect that that will happen in insular Japan.....
Click to expand...

The engineering culture of Japan is almost entirely inverted from the engineering culture of the US of A. Specifically, for most of my career, "manufacturing" engineering is considered what you do if you cannot do "real" engineering. As a telling example, I interviewed a whole series of HW engineers and ranked them. The HR department in turn passed the resumes of the top third to the RF engineering department, the middle third to the Digital Engineering department, and the bottom third to the Manufacturing department. It's total garbage, but very hard to change.

Meanwhile in Japan, Manufacturing Engineering is considered the top of the pyramid and software engineering is at the bottom. Don't expect any Software powerhouses to come out of Japan. Toyota is probably the extreme of this culture since they view themselves as a manufacturing company. This is strongly based on the legacy of Taiichi Ohno. Following in his footsteps is the Japanese engineering ideal there. As for a Japanese software legend, the pickings are very slim.

fuzzy1 · Oct 31, 2013

kbeck said: ↑

Read the testimony.

Barr's group has proof that the logging doesn't actually work in all cases.

The main, "X" process, which controls throttle position, ...
A single bit flip of a particular unprotected RAM location could, can, and apparently has, in the field, stopped the "X" process in its tracks. ...

Click to expand...

I haven't yet been able to read the testimony, but the discussion in this thread still leaves a glaring omission.

Alleged SUA victims were claiming that the service brake failed. The NASA report stated:

The NESC team did not find an electrical path from the ETSC-i that could disable braking.
Click to expand...

Has this changed? If the electronic throttle control fails in a WOT state (or other bad condition) the way the old mechanical controls could, is there a path that could disable the service brake? Or can the transmission be stuck in Drive, unable to shift to Neutral, with enough engine power to override the brake?

Sure, the throttle control firmware needs much better design. But my hair isn't going to be set on fire if the brakes still work as intended, especially when the faulty electronic controls cause fewer engine runaways than did the old mechanical controls.

bwilson4web · Oct 31, 2013

I am only half-way through the testimony and wanted to share some progress and early impressions:

Barr's testimony alone would make an excellent introduction to the art of programming. I'm looking forward to buying his book.

He introduces, explains, various bugs and reports they were 'found in the code' but not demonstrated.

Perhaps static code analysis has advanced far enough that this now works. My past experience with code testing software (40 years ago?) was not positive.

His testimony would be more compelling IF he said 'we replicated this problem in 1-2 cases' with the debugger.

The first demonstration with the Camry showed the effect of "task x" being manually killed. The symptoms and mitigation details about how 'riding the brake' was impressive.

I am bothered that they manually killed the task. I'll be looking for a description of how task failure could be replicated in operation.

Implicit in the testimony is both the 'watchdog' and monitor processor failed to detect the task x failure and mitigate the loss. I look forward to how these were tested.

Like I mentioned, static code tools and techniques 40 years ago were under impressive. So I prefer seeing faults replicated or detected using a debugger or other tool. Still, I quite agree with your recommendation and found it good, not as perfect as I might want but good enough:

kbeck said: ↑

Read the testimony. . . .
Click to expand...

I will probably get his book.

Bob Wilson

ps. The code testing software I remember seemed to be a lot of 'smoke' but not so much heat. I like using compiler based flags but 3d party code analyzers, well I probably need to do some research. Now if they could just fix the famous web site . . .

After losing verdict, Toyota settles in sudden acceleration case

walter Lee Hypermiling Padawan

FL_Prius_Driver Senior Member

kbeck Active Member

jdcollins5 Senior Member

Mike500 Senior Member

FL_Prius_Driver Senior Member

a_gray_prius Rare Non-Old-Blowhard Priuschat Member

Mike500 Senior Member

Whirldy Junior Member

fuzzy1 Senior Member

bwilson4web BMW i3 and Model 3

kbeck Active Member

hill High Fiber Member

bwilson4web BMW i3 and Model 3

austingreen Senior Member

kbeck Active Member

austingreen Senior Member

FL_Prius_Driver Senior Member

fuzzy1 Senior Member

bwilson4web BMW i3 and Model 3

About PriusChat

Quick Navigation

Like us on Facebook

Buy us a beer!

Useful Searches

After losing verdict, Toyota settles in sudden acceleration case

walter Lee Hypermiling Padawan

FL_Prius_Driver Senior Member

kbeck Active Member

jdcollins5 Senior Member

Mike500 Senior Member

FL_Prius_Driver Senior Member

a_gray_prius Rare Non-Old-Blowhard Priuschat Member

Mike500 Senior Member

Whirldy Junior Member

fuzzy1 Senior Member

bwilson4web BMW i3 and Model 3

kbeck Active Member

hill High Fiber Member

bwilson4web BMW i3 and Model 3

austingreen Senior Member

kbeck Active Member

austingreen Senior Member

FL_Prius_Driver Senior Member

fuzzy1 Senior Member

bwilson4web BMW i3 and Model 3