| Forum Home | ||||
| PC World Chat | ||||
| Thread ID: 103929 | 2009-10-11 02:04:00 | AirNZ + DR | somebody (208) | PC World Chat |
| Post ID | Timestamp | Content | User | ||
| 819149 | 2009-10-11 21:45:00 | I have a bit of an inside line... I am 80% sure they did switch to DR and it just took longer that expected. looks like they run some kind of unix and had issues with disk's getting full, and some hardware faults too. |
robsonde (120) | ||
| 819150 | 2009-10-11 23:19:00 | Poor IBM: www.stuff.co.nz It wasn't that many years ago that they were slammed for the INCIS debacle - now this. |
somebody (208) | ||
| 819151 | 2009-10-12 00:25:00 | Poor IBM: www.stuff.co.nz It wasn't that many years ago that they were slammed for the INCIS debacle - now this. It's very easy to blame the provider isn't it? So often it comes down the SLA's in place, as they (should) explicitly spell out where the vendor's responsibilities start and stop. Personally, I'd never rely on IBM here in NZ - and all our servers are IBM, and most of our desktops too. Even if the server drops a disk, and the tech orders a replacement, it has to come out of Aussie, and can still take a day or two. An operation of that size should have all the N+1 infrastructure they need. And if Robsonde's inside word is true, I wonder how often they test things to simulate an outage? |
nofam (9009) | ||
| 819152 | 2009-10-12 00:55:00 | Isn't it n-1? | shermo (12739) | ||
| 819153 | 2009-10-12 01:21:00 | Isn't it n-1? No - N+1 commonly refers to clustered environments where N is the number of Nodes in the cluster, plus 1 acting as a hotspare standy; should one of the nodes fail, the hotspare takes over (usually in a secondary role). In High-Availability clustering, it often takes over near-instantly, with little or no data loss, hence the term 'sub-second' switching. In DR plans if usually means having redundancy (spares) for every known point of failure in your infrastructure. N-1 refers to RAID arrays, and the relative efficiency of each RAID type: From Wikipedia: Space efficiency is given as amount of storage space available in an array of n disks, in multiples of the capacity of a single drive. For example if an array holds n=5 drives of 250GB and efficiency is n-1 then available space is 4 times 250GB or roughly 1TB. |
nofam (9009) | ||
| 819154 | 2009-10-12 01:24:00 | No - N+1 commonly refers to clustered environments where N is the number of Nodes in the cluster, plus 1 acting as a hotspare standy; should one of the nodes fail, the hotspare takes over (usually in a secondary role). In High-Availability clustering, it often takes over near-instantly, with little or no data loss, hence the term 'sub-second' switching. In DR plans if usually means having redundancy (spares) for every known point of failure in your infrastructure. N-1 refers to RAID arrays, and the relative efficiency of each RAID type: From Wikipedia: Man thats mega geek stuff nofam. Why cant they just control alt delete the thing and go back to a backup and start again? |
prefect (6291) | ||
| 819155 | 2009-10-12 01:34:00 | Man thats mega geek stuff nofam. Why cant they just control alt delete the thing and go back to a backup and start again? Yeah sorry man - when I re-read it, it was. But it's a pretty precise concept, and really neat when it works. We're just commissioning a Redhat Linux cluster here; very cheap to set up. We looked at the IBM HA systems for AIX (their version of of Unix - pretty sure Air NZ were mentioned as current users), but from memory they started at around $250k :eek: |
nofam (9009) | ||
| 819156 | 2009-10-12 01:46:00 | No - N+1 commonly refers to clustered environments where N is the number of Nodes in the cluster, plus 1 acting as a hotspare standy; should one of the nodes fail, the hotspare takes over (usually in a secondary role). In High-Availability clustering, it often takes over near-instantly, with little or no data loss, hence the term 'sub-second' switching. In DR plans if usually means having redundancy (spares) for every known point of failure in your infrastructure. It'll be very interesting to hear what sort of set up AirNZ have given the huge financial consequences of any downtime (although I suspect we'll never find out). |
somebody (208) | ||
| 819157 | 2009-10-12 02:24:00 | It'll be very interesting to hear what sort of set up AirNZ have given the huge financial consequences of any downtime (although I suspect we'll never find out) . Agreed - it's a PR disaster for a company already unpopular with a lot of people . My guess - their backend fell over, backup system tried to come on and failed, techs were scrambling to find the cause, cause was a bit of a gray area in terms of the SLA, IBM were a little slow to respond, and eventually the systems were only able to be brought back online in a piecemeal fashion . It's a very complex setup they have, no doubt, and even if the backend servers came up reasonably quickly, there could've been a host of networking issues to contend with; even if the hardware layer was unaffected, the next few layers up could've needed reconfiguring . But there are solutions and systems available to prevent this . I hope they open the books on it and produce a whitepaper so we can all learn from it . Think I feel a Tui billboard coming on . . . . . :rolleyes: |
nofam (9009) | ||
| 819158 | 2009-10-13 06:28:00 | they where doing UPS maintenance and bypassed the UPS and had the data centre on generator power. the generator died (about 9:20am), the servers died. power was restored quite quickly (about 9:30am) clean up of the servers took longer than expected and so management called for DR about 10:30am prod clean-up took till 10pm Sunday, not sure when they switch back from DR. FYI, I don't work for air-NZ or IBM, but I do have friends in the right places to get an overview of what went wrong. |
robsonde (120) | ||
| 1 2 3 | |||||