Plan the Unexpected: Uptime & Downtime

This is the first from a series of posts that will cover some of the aspects that summarize feedback and debate on Cloud contract definitions that we received from various different stakeholders on both sides of the equation, Cloud Providers and Cloud Adopters. We collected relevant requirements of various scenarios to be taken into account when defining the core of SLALOM models for Cloud SLAs, our legal and technical specification and the common SLA reference.

All these aspects have been included in SLALOM initial positioning paper which you can find here. In this document you can find a detailed description of our findings, and a analysis expected pros & cons of both perspectives (Providers vs Adopters), and the final balanced positioning of SLALOM outcomes.

This deliverable is the starting point for the codification of all aspects that have been included into the SLALOM model terms, both legal and technical. So stay tuned for more updates on this series of posts, and don’t forget to comment providing your vision and experience.

Follow us on Twitter

Today, we start summarizing feedback and debate on contract definitions of “Uptime” & “Downtime” as seen from provider and service user (adopter) perspectives which informs SLALOM’s suggested approach.

UPTIME

ISO 046 Availability component [/ Uptime]
ISO 047 Availability component [/ Uptime percentage]

Uptime is one of the main aspects to availability and cloud adopters are recommended to pay attention specifically to the way it is calculated. According to best practices, SLAs are typically calculated monthly, and only consider the period during which the cloud adopter was a client of the cloud provider in the calculation of uptime versus downtime. However, there are also cloud providers that do not follow the approach of "best practice". On the contrary, they calculate the availability period starting 12 months prior to a client joining the service, with the assumption that the previous 12 months were delivered at 100% uptime (whether the cloud adopter was a client then or not) [24]. Therefore, the SLA should also describe the way that the uptime is calculated.

Another issue that needs to be dealt is the fact that it may be very difficult for an end-user/consumer or a small firm to demonstrate that a cloud provider has not worked with reasonable care and skill to achieve a certain level of availability, including in particular the amount of uptime a month [15]. The Ministry of Justice guidance on Cloud Computing and CJSM [29] points towards the same direction: i.e. the need for understanding how the percentage of availability is calculated. It additionally emphasises the importance of the definition of "up", e.g., a cloud system may be "up" according to an SLA if a number of features are unresponsive provided that core systems are available.

SLALOM clearly needs to consider the period of calculation, the definition of ‘up’, and how it is calculated, monitored and reported.

DOWNTIME

ISO 048 Availability component [/ Allowable downtime]
ISO 049 Availability component [/ Downtime]

Whilst the literature source note the obvious necessity of maintenance on the systems, which might result in downtime, there is discussion of how it should be approached. The Cloud Standards Customer Council’s practical guide suggests cloud adopter to ensure that there is a mechanism to inform them of changes and, if not, amend their contract to put the onus on the provider to provide reasonable advance notice of updates so that they are aware of the downtime of their service [20]. Again the importance of defining the measurement window is raised. Dimension data’s white paper [24], apart from hard downtime, also poses questions related to the performance degradation. In particular the white paper states that under 'best practices' "cloud SLAs should cover both unavailability (hard downtime), as well as performance degradation. Many providers offer clear SLA language explaining what happens if their infrastructure goes completely offline, but fail to mention whether performance degradation is also considered an SLA violation." They suggest cloud adopters ensure that performance degradation and unavailability are both covered in the SLA. Moreover, it advises cloud adopters to be cautious with respect to the issues coming from the complexity of the cloud architectures: "A simple server-only uptime SLA fails to address one of the most important components of cloud architecture, namely the network. Consider the implication to your business should network performance vary widely. Is this something you can easily architect around? You should have a clear understanding as to whether the provider guarantees network performance as well as uptime performance." The same whitepaper points out that some cloud providers mandate that numerous ‘availability zones’ or ‘regions’ need to fail for them to consider the failure an SLA violation and propose cloud adopters to also ask for some SLA against failures of their cloud product in a single location as well.

Feedback from SLALOM respondents confirmed these problems. From the CSPs, one commented that 'We had to make the Availability SLAs a quarterly measurement instead of monthly.' 'Customers don't often realize the difference in actual downtime. Performance SLAs are very difficult to meet due to many aspects of the network that are beyond the SaaS provider's control’.

Indeed the responsibility for aspects outside for their sphere of influence was repeatedly described by CSPs: 'The reliability of third party service - such as connectivity. … 'Managing client expectations. Many operate imperfect legacy systems and yet expect cloud technology and its purveyors, integrators and SaaS providers to provide 100% up time.' … 'You can be asked to put something under SLA that you would never be asked in traditional process - over-expectation'.

One CSP saw the demand for end-to-end SLAs as detrimental overall – [the] 'Customer wants to measure end user experience, which is highly variable. So we have to set the bar low, yet only a small percentage of users really fall under this category of likely poor performance’.

Downtime and Uptime are clearly related aspects. It appears fair to allow CSPs to define a certain level of downtime for planned maintenance and to exclude this from the uptime window. Effectively then uptime SLOs would be a commitment to avoid unplanned maintenance (outages and so on) whilst downtime SLOs would be a commitment to optimise and reduce planned maintenance. Conditions could be placed on the downtime component (such as advanced warning) that separates it from unplanned outages and maintenance. As for uptime, the period, calculation, definition, monitoring and reporting should all be considered.

Read our paper and Follow us on Twitter