GreenPages Blog

As an IT professional, you need to stay current on all things tech; with articles from industry experts and GreenPages' staff, you get the info you need to help your organization compete and succeed!

All Posts

Questions Around Uptime Guarantees

Some manufacturers recently have made an impact with a “five nines” uptime guarantee, so I thought I’d provide some perspective. Most recently, I’ve come in contact with Hitachi’s guarantee. I quickly checked with a few other manufacturers (e.g. Dell EqualLogic) to see if they offer that guarantee for their storage arrays, and many do…but realistically, no one can guarantee uptime because “uptime” really needs to be measured from the host or application perspective. Read below for additional factors that impact storage uptime.

Five Nines is 5.26 minutes of downtime per year, or 25.9 seconds a month.

Four Nines is 52.6 minutes/year, which is one hour of maintenance, roughly.

Array controller failover in EQL and other dual controller, modular arrays (EMC, HDS, etc.) is automated to eliminate downtime. That is really just the beginning of the story. The discussion with my clients often comes down to a clarification of what uptime means – and besides uninterrupted connectivity to storage, data loss (due to corruption, user error, drive failure, etc.) is often closely linked in people’s minds, but is really a completely separate issue.

What are the teeth in the uptime guarantee? If the array does go down, does the manufacturer pay the customer money to make up for downtime and lost data?

{Register for our upcoming webinar on June 12th "What's Missing in Hybrid Cloud Management- Leveraging Cloud Brokerage" featuring guest speakers from Forrester and Gravitant}

There are other array considerations that impact “uptime” besides upgrade or failover.

  • Multiple drive failures, since most are purchased in batches, are a real possibility. How does the guarantee cover this?
  • Very large drives must be in a suitable RAID configuration to improve the chances that a RAID rebuild will be completed before another URE (unrecoverable read error) occurs. How does the guarantee cover this?
  • Dual controller failures do happen to all the array makers, although I don’t recall this happening with EQL. Even a VMAX went down in Virginia once, in the last couple of years. How does the guarantee cover this?

 

The uptime "promise" doesn't include all the connected components. Nearly every environment has something with a single path or SPOF or other configuration issue that must be addressed to insure uninterrupted storage connectivity.

  • Are applications, hosts, network and storage all capable of automated failover at sub-10 ms speeds? For a heavily loaded Oracle database server to continue working in a dual array controller "failure" (which is what an upgrade resembles), it must be connected via multiple paths to an array, using all available paths.
  • Some operating systems don't support an automatic retry of paths (Windows), nor do all applications resume processing automatically without IO errors, outright failures or reboots.
  • You often need to make temporary changes in OS & iSCSI initiator configurations to support an upgrade - e.g. change timeout value.
  • Also, the MPIO software makes a difference. Dell EQL MEM helps a great deal in a VMware cluster to insure proper path failover, as do EMC PowerPath and Hitachi Dynamic Link Manager. Dell offers a MS MPIO extension and DSM plugin to help Windows recover from a path loss in a more resilient fashion
  • Network considerations are paramount, too.
    • Network switches often take 30 seconds to a few minutes to reboot after a power cycle or reboot.
    • Also in the network, if non-stacked switches are used, RSTP must be enabled. If not, and anything else isn’t configured correctly, connectivity to storage will be lost.
    • Flow Control must be enabled, among other considerations (disable unicast storm control, for example), to insure that the network is resilient enough.
    • Link aggregation, if not using stacked switches, must be dynamic or the iSCSI network might not support failover redundancy

 

Nearly every array manufacturer will say that upgrades are non-disruptive, but that is at the most simplistic level. Upgrades to a unified storage array, for example, will involve disruption to file system presentation, almost always. Clustered or multi-engine frame arrays (HP 3PAR, EMC VMAX, NetApp, Hitachi VSP) can offer the best hope of achieving 5 nines, or even greater. We have customers with VMAX and Symmetrix that have had 100% uptime for a few years, but the arrays are multi-million dollar investments. Dual controller modular arrays, like EMC and HDS, can’t really offer that level of redundancy, and that includes EQL.

If the environment is very carefully and correctly set up for automated failover, as noted above, then those 5 nines can be achieved, but not really guaranteed.

 

Related Posts

The Benefits of Microsoft Intune Suite for Modern Workplaces

By Josh Morganthall, Microsoft Practice Manager, GreenPages Microsoft Intune Suite unifies several endpoint management and security solutions into one bundle. In this blog post, I discuss the business value of Microsoft's cloud-based service and the operational efficiencies and enhanced user experience it brings to IT teams and users. 

CIO Fireside Chat Recap: Cloud & FinOps

By Mario Brum, VP of Practice Area and Technical Advisory Services Mario hosted the second in GreenPages' ongoing series of CIO Fireside Chats discussing how an industry-leading retail technology company partnered with GreenPages to use FinOps for optimizing the company's cloud costs. 

Preparing Your Business for the End of Windows Server 2012 Support

By Josh Morganthall, GreenPages Senior Solutions Architect for Microsoft Cloud In this blog post, Josh outlines the steps that CIOs need to take to prepare for Windows Server 2012 reaching its end of support on October 10, 2023 to ensure their IT operations remain secure, productive, and running without interruption.