1. Attachments are working again! Check out this thread for more details and to report any other bugs.

Been there, done that, stinkin T-shirts

Discussion in 'Fred's House of Pancakes' started by bwilson4web, Dec 12, 2014.

  1. bwilson4web

    bwilson4web BMW i3 and Model 3

    Joined:
    Nov 25, 2005
    27,148
    15,406
    0
    Location:
    Huntsville AL
    Vehicle:
    2018 Tesla Model 3
    Model:
    Prime Plus
    I have a lot of sympathy for what Danny has been through as soon as he said 'database.'

    Earlier this year, one of our remote systems had a database problem that could not be repaired. We could get some reports but one that reported on the health of that system stopped working. So I scheduled a repair action to wipe and restore only to discover the database backup would not restore. There was no helping it. So we had to restart from day zero.

    Today I have another system, #132, whose database backup won't restore. It appears to be polling, recording data, and reporting but an accumulation of soft errors has left it appearing to work but not trustable. Worse, there is an unrelated issue that requires upgrade to the next version. Upgrade requires a wipe and reload but we know the backup won't restore.

    Fortunately, I have a parallel system, #58, that we wiped and restarted in August when the restore from #132 failed. Since then I have moved production over to the parallel system. Over the coming Christmas and New Years holidays, I will wipe-and-load the older, #132, and try to restore from #58 backup. If the restore works, great. If not, we'll run them in parallel starting in January 2015.

    The true test of a database is restoration of the backup to a fully operational system. If the restore does not work, a database may look OK but it is just waiting for a Murphy moment . . . and the joy of wearing the same T-shirt for a couple of days.

    Bob Wilson

    ps. For two years, I have requested enough disk storage to be able to collect and quality check the databases. But it wasn't until noon today that I finally got a 'left-over' system with enough disk space to do quality tests. Starting in 2015 I'll be able to do database quality checks before the Murphy moment.
     
    #1 bwilson4web, Dec 12, 2014
    Last edited: Dec 12, 2014
  2. Reminds me of something that happened to Fastmail a while back, they were prepared in the rare event that 2 hard drives failed in a short period of time... and what happened, 3 hard drives in a row failed

    http://blog.fastmail.com/?s=%22Server+4%22
     
  3. ny_rob

    ny_rob Senior Member

    Joined:
    Feb 28, 2012
    1,968
    813
    0
    Location:
    L.I.- NY
    Vehicle:
    Other Hybrid
    Model:
    N/A
    Sucks to be the IT guy when things goes south!
    Been there- done that :mad:

    Whenever they ask me at work "why are you replacing those hard drives- don't you make backups?" I make backups and drive images (and have spare complete working motherboards populated with CPU & RAM, PSU's , etc...) then I usually mention the fact that I wouldn't stake my job on a complete restore actually working 100%. I have seen too many restores fail for one reason or another to trust them 100% on critical systems.

    Wasn't it Fastmail that had a complete outage for a full day when that server co-location facility went offline a few years back?
     
    ftl likes this.