Availability concept name : Nabeela kausar roll no : 22034156-020
Agenda Introduction Calculating Availability Availability patterns sources of unavailability 2
Availability Availability we can say time in which our system resources are available to work. Everyone expects their infrastructure to be available all the time A 100% guaranteed availability of an infrastructure is impossible No matter how much effort is spent on creating high available infrastructures, there is always a chance of downtime. It's just a fact of life. 3
Availability percentage Availability percentage % Downtime per year Downtime per month Downtime per week 99.8% 17.5 hours 86.2 minutes 20.2 minutes 99.9% (three nines) 8.8 hours 43.2 minutes 10.1 minutes 99.99%(four nines) 52.6 minutes 4.3 minutes 1.0 minutes 99.999%(five nines) 5.3 minutes 25.9 second 6.1 seconds 4 Availability is always given as a percentage uptime given a time period which is usually one year. The table represents the availability with respect to time.
Most requirements used today are 99.9% (three nines) or 99.95 for a full IT system. 99.999% (five nines) is also known as carrier grade, his availability originate from the telecommunication components that need a very high availability. Although 99.9% availability means 525 minutes of downtime a year, this downtime must not occur in a single event and there should also not be 525 one minute downtime events in a year, in other words unavailability intervals must be defined. Sample Unavailability intervals: Unavailability (minutes) Amount per year 0-5 <=35 5-10 <=10 10-20 <=5 20-30 <=2 >30<=35 <=1
Calculating availability Availability can neither be calculated, nor guaranteed upfront. It can only be reported on afterwards, when a system has run for some years. With the passage of time much knowledge and experience is gained on how to design high available system, using different availability patterns. 6
Mtbf and mttr 7 The factors involved in calculating availability are Mean Time Between Failures (MTBF), which is the average time that passes between failures, Mean Time To Repair (MTTR), which is the time it takes to recover from a failure
Some calculation examples 8 Decreasing MTTR and increasing MTBF both increase availability. Dividing MTBF by the sum of MTBF and MTTR results in the availability expressed as a percentage: Availability = MTBF/(MTBF+MTTR)*100% Serial components Parallel components
Serial components One defect leads to downtime 9 Example: the above system’s availability is: ( each components ’ availability is at least 99.99%)
Parallel components Parallel components: One defect: no downtime! But beware of SPOFs! 10 Calculate availability: Total availability = 99.99%
Parallel components 11 Parallel components: One defect: no downtime! But beware of SPOFs! Calculate availability: Total availability = 99.99%
availability patterns Single point of failure(SPOF) Redundancy Failover Fallback
Single point of failure A single point of failure (SPOF) is a component in infrastructure that, if it fails, causes downtime to the entire system. SPOF should be avoided in IT infrastructure as they pose large risk to the availability of the system. For example, in most storage systems, the failure of one disk does not affect the availability of the storage system. Technologies like RAID (Redundant Arrays of Independent Disks) can be used to handle the failure of a single disk eliminating disks as a SPOF. Server clusters, double network connections, and dual datacenters – they all are meant to eliminate SPOFs
redundancy Duplication of critical components in a single system to avoid SPOF is called redundancy. In IT Infrastructure the redundancy is usually implemented in power supplies(single component have two power supplies if one fails the other takes over) Network interfaces, and SAN HBAs (host bus adapters) for connecting storage. 14
Failover Failover is the (semi) automatic switch-over to a standby system (component), either in the same or other datacenter, upon the failure or abnormal termination, of the previously active system(component). Window server failover clustering VMware Oracle Real Application Cluster (RAC) 15
fallback Fallback is the manual switch-over to an identical standby computer system in a different location, typically used for disaster recovery there are three basic forms of fallback solutions: Hot site Warm site Cold site 16
Hot site A hot site is a fully configured fallback datacenter, fully equipped with power and cooling. The applications are installed on the servers, and data is kept up-to date to fully mirror the production system. Staff and operators should be able to walk in and begin full operations in a very short time (typically one or two hours). This type of site requires constant maintenance of the hardware, software, data, and applications to be sure the site accurately mirrors the state of the production site at all times. 17
Warm site 18 A warm site could best be described as a mix between a hot site and cold site. Like a hot site, the warm site is a computer facility readily available with power, cooling, and computers, but the applications may not be installed or configured. But external communication links and other data elements, that commonly take a long time to order and install, will be present. To start working in a warm site, applications and all their data will need to be restored from backup media and tested. This typically takes a day. The benefit of a warm site compared to a hot site is that it needs less attention when not in use and is much cheaper.
Cold site A cold site differs from the other two in that it is ready for equipment to be brought in during an emergency, but no computer hardware is available at the site. The cold site is a room with power and cooling facilities, but computers must be brought on-site if needed, and communications links may not be ready. Applications will need to be installed and current data fully restored from backups. Although a cold site provides minimal fallback protection, if an organization has very little budget for a fallback site, a cold site may be better than nothing 19
Sources of unavailability 20 Human errors Software bugs Planned maintenance Physical defects Environmental issues Complexity of infrastructure
Business continuity 21 Although many measures can be taken to provide high availability, the availability of the IT infrastructure can never be guaranteed in all situations. In case of a disaster, the infrastructure could become unavailable, in some cases for a longer period of time. Business continuity is about identifying threats an organization faces and providing an effective response. Business Continuity Management (BCM) and Disaster Recovery Planning (DRP) are processes to handle the effect of disasters.
Rto and rpo Two important objectives of disaster recovery planning are the Recovery Time Objective (RTO) and the Recovery Point Objective (RPO). The RTO is the maximum duration of time within which a business process must be restored after a disaster , in order to avoid unacceptable consequences (like bankruptcy). RTO is only valid in case of a disaster and not the acceptable downtime under normal circumstances. Measures like failover and fallback must be taken in order to fulfill the RTO requirements. 22
THANK YOU Reference book: IT Infrastructure architecture (third edition) by Sjaak laan