by Tim Heagarty
October 16, 2007
[Ed. note: The author is a highly experienced computer and internet security expert, certified as an Information Systems Security Professional (CISSP) and a Certified Information Systems Auditor (CISA). In addition, the author is a Microsoft Certified Systems Engineer (MCSE).]
Introduction
Availability can have quite a bit of mathematics associated with it; calculating uptime, time working versus time down for maintenance, etc. Availability in the business community often brings up topics like business continuity planning and disaster recovery--the latter being how to get going again and the former how to keep running regardless of what's going on around you. Much of traditional availability thought centered on "nines"; how many nines of availability you were required to provide for your client. In my industry, we have customers that we call "five nines." These customers have been given the promise in the form of a service-level agreement that our service will have no more than 5.26 minutes of unscheduled downtime in a year's time. That's high availability.
Unfortunately, for many users of highly popular social networking applications created by Web 2.0 start-up companies--companies that had no prior experience with successfully coping with the resource demands of hundreds of millions of customers--many-nines availability is the stuff of dreams. It simply doesn't exist in everyday experience for the users of these new applications.
Availability for the security community refers to having valid information ready for use by the proper authenticated and authorized consumer of that information. We will focus in this article on how to keep availability of information at its highest level possible while protecting ourselves from intentional and accidental downtime. As Web 2.0 allows our users to rely more and more on online systems, their availability becomes more important to our client and to us, if our businesses are to survive to become useful profitable entities.
There are traditionally three primary threats to a service's uptime or availability: hardware failure, software failure, and human failure. You can probably guess that human failure leads to the other two failures most of the time. Hardware most often fails when the hardware has not been properly maintained by the human per the manufacturer's recommendations. Software failures can usually be traced back to a person that shorted the procedures or didn't fully test the results of a patch or change.
Definition of Availability
Availability can be defined in various ways, but the most common is to take the total time that a service is supposed to be available, subtract the time that it is not available, and divide that result by the original total time:
A = (T – D) / T
Availability is Time less Downtime, divided by Time. If a server should be up all week and it is down for one hour on Sunday morning then we can use the following calculation: 168 hours of possible uptime minus one hour unavailable equals 167 hours of service. Divide the 167 by 168 and you get 99.4047 percent of uptime or availability.
Keep in mind though that availability doesn't always have to be expressed in terms of hours per week. If your design goals say that you will process 1000 transactions a day and that can be done in six hours, then you have 18 hours of available downtime while still maintaining 100 percent availability. Also note that I said the hour of downtime was on Sunday morning. That may be a time of the day when you can perform maintenance and have a planned outage without disrupting the real business. Unless, of course, your service is taking payments or tracking attendance at church services; then Sunday morning might not be the best time for maintenance.
Several other factors intertwine to impact your service's availability to your end consumer. Serviceability is where a service is provided by a third party; this is the expected availability of a component to your service. We get into serviceability a lot in Web 2.0, as we're all providing various services to each other. Since you had the good sense to partner with one of the largest and most reliable user networks in the world (AOL), you have done what was expected to assure serviceability of your user identification and credentialing services.
Reliability is the time for which a software or hardware component can be expected to perform under nominal conditions without failing. Your ISP should be able to provide you with outage numbers. If they refuse or don't know, then it's time to find another ISP.
Recoverability is the time it should take to restore a component back to its operational state after a failure. Recoverability circles back to our previous discussion regarding backups and maintaining data integrity. You must be able to recover from a failure in a timely manner to provide the greatest availability that you can. Availability is directly impacted by how fast you can recover from an outage. Let's face it: you will have an outage sooner or later, probably sooner. If you can get your service back on its feet, you can still obtain incredibly high availability statistics for your clients.
Maintainability is the ease with which a component can be maintained, which can be both remedial and preventative. Make sure your software design and coding skills are the best that they can be. Refrain from making "quick and dirty” fixes to your programs. They never help anybody and will most assuredly show up again in another outage. We can no longer strictly use uptime numbers when discussing availability, especially in the world of information security. Imagine a network like AOL's, with all of these new wonderful Web 2.0 services that we are writing every day. The AOL network may have very close to 100 percent availability, but that doesn't mean that our particular service will always be available to our client. Availability now means being there for your customer no matter when your customer decides that she needs you. This means that you have to branch your service out to other platforms and ISPs to eliminate all possible single points of failure.
Web 2.0 now brings another new wrinkle: some of our services rely on other services to function. You might have a client that combines maps with restaurant reviews and movie locations for a full night out on the town. If one of those services is not available to you is your service still "available"? Availability is now in the eye of the customer. If you can gracefully downgrade your service and still provide other information, you may survive the "outage" with little or no brand damage.
Methods of Maintaining Availability
Let's see what we can do to provide the highest level of availability to our customers as possible. Providing multiple sources of our information will be key as well as protecting our information and service from accidental, environmental, and deliberate denial-of-service outages. Being able to recover from the eventual outage will at least reduce the amount of time that we're off the air, increasing our overall availability.
Redundant, Redundant, Redundant
One of the best things that you can do to achieve high availability of your systems to your user base is to have multiple sources of the information that you are mashing to provide your service. A "round robin" system will distribute your client requests across multiple access paths. Contract with multiple varied ISPs around the world, assuming that you have a worldwide audience for your information, to carry your data objects and databases. Use the same multiple varied name servers to supply DNS information so the clients can find your service. DNS is taking a tremendous load lately due to the explosive growth of Web 2.0 and social networking services. If your DNS is having trouble, it doesn't matter whether your data is intact or not.
Data Corruption
Let's say that we have covered the redundancy issues. Now we need to be sure that the distributed information is not corrupted, and if it becomes corrupted, that we can do something about it. Most of the controls that can be used here are detective controls, meaning that we will watch for corruption to occur and then fix it. About the only way to repair data corruption is to restore the data from a backup. Hopefully you also have transaction logging available and turned on in your database, so you can run those logs against the backed up data and come right back to where you were when you lost the data.
It's a good idea to periodically rebuild your database onto a test box using your backups and transaction logs. This lets you prove to yourself and your auditors (if you have them) that you can in fact quickly and cleanly rebuild the lifeblood of your service, the data.
Because the confidentiality, integrity, and availability triad is a three-legged stool, the other two legs always come into play to support the third. Being sure of the user of your service by identifying and authenticating with the AOL service will help avoid denial-of-service attacks against you by unknown clients. Maintaining data integrity by restricting what your users can do to your data will also help assure availability of your service to your entire user population.
Conclusion
The availability of services is the hallmark of Web 2.0: everything, all the time. The distributed nature of the Web will help us to achieve many of the availability techniques that we have discussed in this article. Being sure to not have single points of failure and not relying on a single service provider for critical infrastructure components will help you to provide your service to your customers for as long as possible.
I thank you for following along in this series on confidentiality, integrity, and availability. I believe that we live in incredibly exciting and challenging times with potential for success beyond any time before us. Good luck in creating the new level of services that the world has come to expect from us and from Web 2.0.
References
- http://en.wikipedia.org/wiki/Uptime - Wikipedia article on uptime and availability specification using the "nines" system.
- "Web 2.0 Confidentiality": the first article in this series

web 3.0?
any ideas about web 3.0???