Computers and disk arrays
- crfriend
- Master Barista
- Posts: 15173
- Joined: Fri Nov 19, 2004 9:52 pm
- Location: New England (U.S.)
- Contact:
Computers and disk arrays
Last week, one of the disk drives therein went into "predictive failure". Now, for those in the know, "predictive failure" is the equivalent of the little red "oil pressure" light on one's car: it's already too late, and you're hosed (as, likely is, your motor). Duly, the OS picked up on this, and tried to perform the swap to the hot-spare. This worked for a while until the machine ran across a "miscorrected data error", gave up the rebuild, and faulted the array. This is not supposed to happen with professional-level kit, which this is, albeit rather "aged" (those who know me know my fondness for old iron) at about ten years of age. After the second failure, the system flagged one filesystem unsafe, errored it, and crashed the database that was running on the host. Collateral damage was the monitoring system that keeps me informed of the overall health of my home computing environment and the spam-filters.
So, here I am with an array that the system is convinced is unfixable, and every time I try to rebuild it it balks. OK, I think, I just need to swap the *other* bad disk and rebuild the array from scratch. Where's the backup? You got it; I didn't have one -- I don't have any sort of media that can back up 167 GB in any sane length of time, much less restore it. That's why I was conservative and allocated a hot-spare instead of grabbing the extra 18 gigs of space for storage. {Insert unprintable commentary here.}
"I'm not dead yet!" Sure enough, after a reboot of the system, the status page listed a dead drive (the wrong one, of course) but the one that "miscorrected" an error showed up as OK. The logs all showed the error to be in the same blasted block so I went digging. I hammered on the same block time after time and tried to get it to correct the error successfully (A symptom of insanity is performing the same behaviour and expecting a different outcome; I learnt this from Windows, but with Windows it usually works.). Clearly it was time to take a different tack, so I teased the filessytem -- and the RAID setup -- apart and found that the error block was in an unused extent of the filesystem. Great! All I need to do is run the array in degraded mode, copy everything off, and then rebuild it from scratch. How much free space do I have on all the other systems in the house to back up this 167 gig monster? 122 gigs total. {Insert unprintable commentary here.}
Now what? How much stuff can I afford to lose (i.e. How much of it do I have elsewhere?) from the array. Well, it turns out I could lose most of my music collection because I back-packed that to work on a laptop quite some time ago, and I'd been keeping the two collections in sync. That got me below the free-space number in the house and I spent the past several days copying everything off the failed array I could -- and then double-checked to make sure I'd gotten everything.
"# metaclear -f d127". For those who have issued such a command, you understand the need to have "all your ducks in a row" before pulling the trigger. The array -- and all the data thereon -- vanished without so much as a whimper.
Since I was dealing with several potentially-failing disks, I figured I'd just replace the lot of 'em and be done with. Try finding 18-gig disks these days. One can't, so I replaced them with 72-giggers (which actually hold about 67 gig -- one has to love marketing types) and decided to go with a modern RAID system.
Step 1 -- grab the array and give it a thorough cleaning (unscrewing the attachment points for the cable instead of the cable in the process).
Step 2 -- replace the cable-attachment points {insert assorted cursing here}
Step 3 -- install the new drives (and a hot-spare 18-gigger for the system's internal drives) and see if they all spin up. Miracle of miracles, they all do.
Step 4 -- take it all back downstairs again and hook it up
Step 5 -- create the main RAID entity -- double-parity this time, and a hot-spare; I do not want to have to do this again for several years.
Step 6 -- create all the filesystems and get them mounted in the right places.
Step 7 -- restore all the data. This is still in progress, and I'm hoping we don't take a power-cut. I figure this'll take about 2 days over the 10Mb/s network I have in the house.
Film at 23:00, perhaps in three days' time. I love computing!
Re: Computers and disk arrays

I have a small Seagate FreeAgent Go which is powered via any available
USB port. It holds 320GB and can back up my 2 80GB drives in about
10 minutes. Now this is files only--no programs. The unit was about
$80.00 + tax. You may want to investigate this as an 'option' for you.
Uncle Al



Grand Musician of the Grand Lodge, I.O.O.F. of Texas 2008-2025
When asked 'Why the Kilt?'
I respond-The why is F.T.H.O.I. (For The H--- Of It)
- crfriend
- Master Barista
- Posts: 15173
- Joined: Fri Nov 19, 2004 9:52 pm
- Location: New England (U.S.)
- Contact:
Re: Computers and disk arrays
Thanks, but at the moment I have more pressing details on my mind than computers. I just put a whole load of what I thought was potting soil into Sapphire's "pallet garden" (it said "Garden Soil" on the label). Sapphire informs me that it was manure. I bite my fingernails. Use your imagination.Uncle Al wrote:Carl - I feel for ya'![]()
On the computer front, the database engine is back on-line and is performing well. The lightweight virtual-machines that I use to do software development and SkirtCafe prototyping are in the process of restoring and should be hale in another couple of hours. My vast collection of archived GOES images are restoring, as are the animations therefrom; I have animated satellite imagery from back before Hurricane Katrina. Also in process is my collection of weather observations going back to about 2002 with one-hour granularity. Music comes next, followed by the various "backup" spaces and my own home directory.
At least I'm not going to be at risk of being swept overboard with this as I've got my backside firmly planted in a chair instead of hanging onto a mast in 35-knot winds trying to furl a sail, which is what happened yesterday.
A chat with another good pal of mine points up that USB drives in the Terabyte range are available for reasonably short money. Since the "new and improved" array is slightly larger than half a terabyte, two (or even four) of those, rotated weekly and stored off-site, sound like a good idea.
-
- Member Extraordinaire
- Posts: 170
- Joined: Tue Oct 21, 2003 10:09 pm
- Location: Mountain View, CA
Re: Computers and disk arrays
Not sure what system you are using but " # metaclear -f d127 " I recognize as a Solaris Volume Manager command.
Sounds like you fell in the " raid5 write hole". Have you looked into Solaris 11 ZFS RAID-Z system? Here is a quick overview https://blogs.oracle.com/bonwick/entry/raid_z of course there is a gotchya the problem is doing the whole pool backup and restore. Solaris 11 is free for download, development, testing etc. Not free in a production environment. It also needs newer hardware to run, your 5-20 year old beater pc will not work.
--Brandy
- crfriend
- Master Barista
- Posts: 15173
- Joined: Fri Nov 19, 2004 9:52 pm
- Location: New England (U.S.)
- Contact:
Re: Computers and disk arrays
It's Solaris 10 Update 8 using SVM. Good call.Brandy wrote:crfriend;
Not sure what system you are using but " # metaclear -f d127 " I recognize as a Solaris Volume Manager command.
It could be. There was a potentially contributing event just preceding the entire fiasco where one of our cats managed to knock the cable leading from the array to the DVD/ROM drive asunder which momentarily hosed the termination. No, we did not take the cat to the local Chinese restaurant for dinner.Sounds like you fell in the " raid5 write hole".
However, I could provoke the error by poking at a single drive in the array, and this indicates, to me, a drive failure.
The new setup is ZFS raidz2. ZFS (Zettabyte File System, for the uninitiated) is new to me, so when I first built the array I used what was familiar to me with the intent of learning ZFS later. "Later" has arrived.Have you looked into Solaris 11 ZFS RAID-Z system?
The iron in question is a Netra T1 - 105 with a half-gig of mainstore. The mainstore will be getting upped to a full gig in the coming week; it turns out that ZFS is a bit of a memory-pig, and with it active I cannot boot either of the zones that contain (1) the prototype I use for SkirtCafe upgrades and (2) my Icinga development environment.It also needs newer hardware to run, your 5-20 year old beater pc will not work.
As far as Solaris 11 goes, the newer kit is (1) too expensive for my budget, (2) too restricted in what one can do with it, and (3) the newer hardware is so loud that I would not want it running in the room with me.
Oracle buying Sun was just a tragedy, and seems to be further fuelling folks' departure from Solaris-atop-SPARC in favour of Linux-atop-Intel. That's certainly the case where I work, even though the Solaris systems are virtually bullet-proof. I suspect that the purchase was part of Larry Ellison's fantasy of out-doing IBM: Bauxite and sand in one end and full-featured "appliances" out the other. Unfortunately, that means that everybody else who had an interest in the environment is going to suffer and, ultimately, go elsewhere.
- crfriend
- Master Barista
- Posts: 15173
- Joined: Fri Nov 19, 2004 9:52 pm
- Location: New England (U.S.)
- Contact:
Re: Computers and disk arrays
The first hint:
Code: Select all
Jun 3 07:38:18 t1 scsi: [ID 107833 kern.warning] WARNING: /pci@1f,0/pci@1,1/scsi@2/sd@8,0 (sd7):
Jun 3 07:38:18 t1 Error for Command: read(10) Error Level: Informational
Jun 3 07:38:18 t1 scsi: [ID 107833 kern.notice] Requested Block: 27425122 Error Block: 27425122
Jun 3 07:38:18 t1 scsi: [ID 107833 kern.notice] Vendor: IBM-PSG Serial Number: 01440040UCH5
Jun 3 07:38:18 t1 scsi: [ID 107833 kern.notice] Sense Key: Soft Error
Jun 3 07:38:18 t1 scsi: [ID 107833 kern.notice] ASC: 0x5d (LUN failure prediction threshold exceeded), ASCQ: 0x2, FRU: 0x0
Code: Select all
Jun 3 09:19:47 t1 scsi: [ID 365881 kern.info] /pci@1f,0/pci@1,1/scsi@2 (glm0):
Jun 3 09:19:47 t1 Cmd (0x30008e44d60) dump for Target 8 Lun 0:
Jun 3 09:19:47 t1 scsi: [ID 365881 kern.info] /pci@1f,0/pci@1,1/scsi@2 (glm0):
Jun 3 09:19:47 t1 cdb=[ 0x28 0x0 0x1 0x2a 0x22 0x62 0x0 0x0 0x20 0x0 ]
Jun 3 09:19:47 t1 scsi: [ID 365881 kern.info] /pci@1f,0/pci@1,1/scsi@2 (glm0):
Jun 3 09:19:47 t1 pkt_flags=0x4000 pkt_statistics=0x61 pkt_state=0x7
Jun 3 09:19:47 t1 scsi: [ID 365881 kern.info] /pci@1f,0/pci@1,1/scsi@2 (glm0):
Jun 3 09:19:47 t1 pkt_scbp=0x0 cmd_flags=0x8e1
Code: Select all
Jun 3 09:19:50 t1 Error for Command: read(10) Error Level: Retryable
Jun 3 09:19:50 t1 scsi: [ID 107833 kern.notice] Requested Block: 19538530 Error Block: 19538530
Jun 3 09:19:50 t1 scsi: [ID 107833 kern.notice] Vendor: IBM-PSG Serial Number: 01440040UCH5
Jun 3 09:19:50 t1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
Jun 3 09:19:50 t1 scsi: [ID 107833 kern.notice] ASC: 0x29 (bus device reset message occurred), ASCQ: 0x3, FRU: 0x0
Jun 3 09:20:36 t1 scsi: [ID 107833 kern.warning] WARNING: /pci@1f,0/pci@1,1/scsi@2 (glm0):
Jun 3 09:20:36 t1 Resetting scsi bus, <null string> from (8,0)
Code: Select all
Jun 3 09:23:07 t1 md_raid: [ID 104909 kern.warning] WARNING: md: d127: /dev/dsk/c0t8d0s0 needs maintenance
.
.
.
Jun 3 09:23:09 t1 md_raid: [ID 241980 kern.notice] NOTICE: md: d127: hotspared device /dev/dsk/c0t8d0s0 with /dev/dsk/c0t2d0s0
Some time goes by, and then this pops up:
Code: Select all
Jun 3 11:03:28 t1 scsi: [ID 107833 kern.warning] WARNING: /pci@1f,0/pci@1,1/scsi@2/sd@b,0 (sd10):
Jun 3 11:03:28 t1 Error for Command: read(10) Error Level: Retryable
Jun 3 11:03:28 t1 scsi: [ID 107833 kern.notice] Requested Block: 14413090 Error Block: 14413144
Jun 3 11:03:28 t1 scsi: [ID 107833 kern.notice] Vendor: COMPAQ Serial Number: B0193370
Jun 3 11:03:28 t1 scsi: [ID 107833 kern.notice] Sense Key: Media Error
Jun 3 11:03:28 t1 scsi: [ID 107833 kern.notice] ASC: 0x11 (miscorrected error), ASCQ: 0xa, FRU: 0x0
Backups? We ain't got no backups. We don't need to show you any steenking backups!
{Insert unprintable commentary here}
A reboot of the system cleared the "Last Erred" state and at least allowed me to read data from it, and -- fortunately the luck of the Irish was with me -- the error block on disk 11 was in the middle of unallocated space. Sysadmin wipes brow in light of this.
Once I got most of the important data off the array, I decided to poke at the problem area a little bit. I used "dd" (It's supposed to mean "convert and copy", but a better mnemonic is "diddle and duplicate") to zero out the block that was reporting the error (it was all zeroes to begin with) to see if I could fix the "miscorrected error" problem. No joy. Confusing the matter further, I could make the error come and go by varying the size and length of an access. The net result was one very unamused sysadmin.
Realising that I was fighting a losing battle on this, I slurped everything else off the array, stashed it in assorted dark corners on disks on all the other systems in the house, and "forklifted" (from "forklift upgrade" -- swapping one entire device for a newer one) it.
There was one blessing in this, and that was that I got to clean the inside of the array during the disk-replacement process -- and it was filthy! We have cats, so some fur is to be expected; there also chickens in the room with the array. Now, chickens are surperb at producing an insanely fine dust that gets onto -- and into -- everything in the vicinity, and if there is any air motion it "goes with the flow". The net result was an almost completely choked-off plenum in and around the disks and on the grille that leads rearward to the power supply.
I didn't get a picture of the innards of the array, but I just took one of one of the disks that was extracted from it which I'll attach to this missive later.
So, the new array is in place with double-disk parity and a hot-spare, and I left the restores running over the course of the night. I figured all the activity would keep my laptop awake, but at 03:00 the laptop felt lonely and figured it'd go to sleep: The restores, needless to say, once their controlling session went away, terminated. I restarted them a bit after 08:00 after I got up and saw what happened. I suspect they'll run for the rest of the day; I just need to tickle the laptop a bit periodically to keep it awake -- or do the smart thing and put the sessions on a non-sleeping device.
Re: Computers and disk arrays
Regarding the "potting soil" incident. Yes, I thought it was garden soil as well, but it sure smelled like manure and handled like manure. As for your "potty mouth", you get no sympathy from me. You've spent tons of time around horses and chickens and can't recognize the smell? Anyway, didn't your Mom teach you to wash your hands after playing in the dirt?
-------Lazarus Long
-
- Member Extraordinaire
- Posts: 170
- Joined: Tue Oct 21, 2003 10:09 pm
- Location: Mountain View, CA
Re: Computers and disk arrays
Thanks for the details I'm sure it either bored or lost most people but enjoyed seeing the details. OK Solaris 10 5/08 I have some systems at work running on that version. ZFS raidz-2 for your data array should pretty good. I'll let the backup scheme up you but as mentioned usb drives are pretty cheap these days.
To actually back up the data pool means taking a snapshot, send the stream to a storage device. Then for a restore recreate the storage pool and then receive the stream from the storage device. Or just copy the data off to another device. There is a lot of information at OTN (Oracle Technical Network).
Yes from a user's point of view Oracle buying Sun is a disaster. As mentioned by a former SUN, now Oracle employee he was happy to see the buy out as SUN was hemorrhaging money and would have shortly been out of business.
Have a look at Oracle VirtualBox https://www.virtualbox.org/ ? I use it and like it and it is free. It will run Solaris 10 or 11 as a client.
--Brandy
- crfriend
- Master Barista
- Posts: 15173
- Joined: Fri Nov 19, 2004 9:52 pm
- Location: New England (U.S.)
- Contact:
Re: Computers and disk arrays
So, it looks like most of this saga is behind me. The boxful of 18gig drives has been replaced by a boxful of 72gig drives with double-parity and a hot spare.
Code: Select all
t1:carl >. /usr/sbin/zpool status
pool: pool0
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
pool0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c0t2d0 ONLINE 0 0 0
c0t3d0 ONLINE 0 0 0
c0t4d0 ONLINE 0 0 0
c0t5d0 ONLINE 0 0 0
c0t8d0 ONLINE 0 0 0
c0t9d0 ONLINE 0 0 0
c0t10d0 ONLINE 0 0 0
c0t11d0 ONLINE 0 0 0
c0t12d0 ONLINE 0 0 0
c0t13d0 ONLINE 0 0 0
spares
c0t14d0 AVAIL
errors: No known data errors
t1:carl >. /usr/sbin/zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool0 90.9G 439G 50.2K /pool0
pool0/Music 13.2G 439G 13.2G /export/Music
pool0/bonnie 5.88G 439G 5.88G /backup/bonnie
pool0/cache 50.2K 439G 50.2K /local/squid/cache
pool0/mancini 8.25G 9.75G 8.25G /backup/mancini
pool0/mysql-5.0.51b 640M 439G 640M /usr/local/mysql-5.0.51b
pool0/orator 1.95G 439G 1.95G /backup/orator
pool0/raid 6.23G 439G 6.23G /export/raid
pool0/skirtcafe 1.75G 3.25G 1.75G /zones/skirtcafe
pool0/syzygy 21.4G 439G 21.4G /backup/syzygy
pool0/t1a 3.28G 439G 3.28G /zones/t1a
pool0/www 28.4G 439G 28.4G /var/www
t1:carl >.
One 18gig drive remained as the hot spare for the disks that are internal to the processor itself. This will become a 73 at some point, as will the two internal disks.
- floatingmetal
- Active Member
- Posts: 79
- Joined: Sun Nov 18, 2007 11:30 am
- Location: London, England
- Contact:
Re: Computers and disk arrays
If the backups had been working as it was supposed to, there would have been no problem of course.
The joys of IT...
-
- Member Extraordinaire
- Posts: 4769
- Joined: Fri Sep 17, 2010 11:01 pm
- Location: North East Scotland.
Re: Computers and disk arrays
All the computer stuff is a foreign language to me. The "garden soil", however, we call "sharn" or "dung" dependant on it's colour and smell.
Believe it or not, there is an accepted Scottish tweed colour, "Sharnie Green", very descriptive and much beloved of the "landed gentry"
Steve.