Home » Uncategorized

Achieving mainframe reliability with distributed scale

  • Andrew Oliver 
Web

About 70% of the Fortune 500 use mainframes for core business functions, according to BMC Software. There is good reason for that.

Mainframes were designed for both raw processing power and reliability with redundant components, error correction, journaling, and other key features, which provide what IBM calls “RAS”—Reliability, Availability, and Serviceability. However, new challenges have emerged: Data volumes are growing, 79% of mainframe users can’t find the talent they need (according to Deloitte), and four out of five users report they need more frequent application updates. While not all roads lead to microcomputing or the cloud, some certainly do. The question is how can you achieve mainframe reliability with distributed scale?

Redundancy is Critical

Whereas mainframes achieve reliability with redundancy, so do distributed environments—they just do it differently. Rather than having multiple redundancies based on hardware and the operating system, distributed environments have redundant physical machines, locations, and networks. However, redundancy is not as automated as it is in mainframe systems. Just as z/OS can be configured to provide full or no redundancy at multiple layers including the network and storage layers, so can a distributed environment. But in distributed systems, deliberate choices must be made in terms of the software and configuration in order to achieve redundancy.

Both public and private cloud environments usually provide various levels of network redundancy in terms of switches and routers. In a distributed environment, DNS and a load balancer decide which virtual (and physical) machine receives a request. If that system is down, then the request should be sent elsewhere. This means that multiple VMs must be set up with the same service in order to receive the request. In modern environments, these are usually configured as “containers” and managed by Kubernetes. Kubernetes can also be configured to restart them when they fail. Multiple containers running the same software with redundant DNS and load balancers routing the requests on a reliable network backbone is the equivalent of a mainframe running multiple VMs and with redundant hardware. Kubernetes provides the same kind of functionality as z/OS in terms of managing these separate virtual machines.

Having redundant containers configured on different machines is utterly useless if the data is not there. In cloud computing environments, this often means a form of network storage is needed, such as AWS’s Elastic Block Storage (EBS), which provides fault tolerance at the disk layer. However, this may not be sufficient by itself to provide reliability for a database. Distributed SQL databases provide similar semantics as DB/2 but ensure data is replicated across multiple machines and even multiple data centers, called availability zones in cloud computing parlance, or geographic regions. By ensuring the network, services, and data have multiple redundancies, distributed environments can match mainframe characteristics.

Distributed environments can provide the same level of redundancy as a mainframe with a bit more configuration and planning but can even go further by replicating services and data across multiple availability zones” (data centers that a near each other) and geographic regions. This ensures that services stay up even if connectivity is lost to a data center or if a regional event makes an entire area unavailable. For instance, a service can fail over to Ohio if a hurricane renders the data centers in Virginia inoperative.

Managing the Tyranny of Choice

While in the mainframe world there is frequently a “right answer,” the distributed world has many right answers—especially in software. This can be a bit confusing and complicated. A few new technologies can make this easier.

Software as a service (SaaS) vendors can provide a managed version of a database or similar service without requiring as much configuration and administration. These are not created equal and some can be “black boxes” which turn out to not provide actual redundancy or high availability. It is important to understand how a SaaS vendor provides high availability and how they manage upgrades—and their associated outages.

GitOps provides an agent-based, version-controlled configuration management system. Changes are checked into revision control (usually git) and are applied automatically by software agents. Changes can be rolled back both manually or automatically if they cause unavailability.

Flox is a package management system based on the Nix package management system. It allows the software to be described and configured declaratively. These Nix packages can be published as containers and installed in Kubernetes.

Part of providing reliability is ensuring correct configuration and administration. The complexity of distributed systems increases the chance of misconfiguration. Using software as a service reduces the overall burden, at a cost, whereas modern packaging and configuration management systems can reduce the complexity of maintaining a distributed system.

By using modern tools and configuring multiple levels of redundancy, modern distributed systems can achieve mainframe-level reliability. It is critical to consider a modern database, such as a distributed SQL database, to ensure the data backing services are also reliable. Ultimately, making these choices yields a better business value for companies across industries.