How to Manage a Large UCS Environment
If your job includes managing a large number of Cisco UCS servers, or will in the future, this post shares some tips, tools, and lessons learned for managing an environment with over 250 UCS blade servers. Oh, and a small number of UCS rack-mount "pizza box" servers as well. There are an enormous amount of configuration options and automation features available. Some may be helpful to you, others not so much.
What is Cisco UCS
Cisco Unified Computing System (UCS) was developed and introduced by Cisco Systems in 2009. This came as a bit of a surprise to many in the industry, because prior to this, Cisco Systems had been primarily a networking company. Cisco saw an opportunity to build a more integrated and scalable system with compute, networking and storage resources built in or easily attached. Cisco UCS has managed to capture about 25% of the blade server market since then (but only a bit over 4% of the rack-mounted server market). This speaks volumes about the quality, reliability and manageability of the UCS blade server system.
Cisco UCS is designed around a pair of Cisco multi-protocol switches with specialized firmware and software. These switches are able to connect to both data center ethernet networks and fibre channel SANs, to provide networking and storage to the blade "chassis" and/or rack-mounted servers. In the case of a blade chassis, which holds eight half-width blades or four full-width blades, the networking and SAN connectivity is shared among the blade servers via backplane connections to Input/Output Modules (IOMs). This permits more stable structured cabling, since no cabling changes need to be made to a configured chassis to add an additional blade or add more connectivity to an existing blade.
Know Your Hardware
Even if you are unlikely to ever touch or even see your Cisco UCS equipment, it is good to be familiar with the hardware, what it looks like, and what the different options are. For some people, it helps to be able to visualize the system, and if there is a technician working on the hardware, you may be asked to provide guidance and answer questions.
Knowing which data center rack or which chassis a specific blade is in, what the various LEDs indicate, and sometimes the layout of the components inside the blade enclosure will be useful. Cisco has ample documentation on the UCS hardware, and if your company maintains a Configuration Management Database (CMDB) which most do these days, familiarize yourself with this information.
Cisco UCS Architecture and the Domain
In Cisco UCS architecture, the concept of a domain is an important one. A UCS Domain consists of two Fabric Interconnect (FI) switches, plus at least one blade chassis to hold blades (B-Series servers) or one rack-mounted "pizza box" server (C-Series servers). This represents a "bare minimum" configuration. Typically, a domain contains multiple blade chassis, multiple C-Series servers, or a combination of these. As far as upper limits go, this is what may be the most significant limit at present: the maximum combined number of blade and rack servers per Cisco UCS domain is 160. There is also a limit of 20 chassis per domain. Because the current chassis models have eight 1/2 width slots, but a full width blade consumes two 1/2 width slots, the use of full width blades will reduce the total number of blades supported accordingly.
So, 160 servers is a lot, right? That depends on what kind and size of business we are discussing. There are customers with thousands of servers… even tens or hundreds of thousands in some cases. How is it possible to scale up to numbers like that and still be able to manage it all? We will discuss some of the tools and design decisions that make this possible.
How to Manage a UCS Domain
Let's start with a single UCS domain. While it is possible to make some configuration changes to the blades or switches by logging into them directly, a Cisco UCS Domain is managed via UCS Manager in a web browser. UCS Manager is reached via a "Virtual IP Address" that is assigned to the FI pair during initial setup. There is a pair of FI switches for redundancy, and the virtual IP address is routed to the FI that is "Primary" at the time. The Primary FI can be changed either manually or as an automatic failover.
UCS Manager gives the administrator access to a variety of tools to manage the FI switches, chassis, blades, networking and SAN connectivity. Several of the most powerful tools or features available in UCS Manager are the ability to create pools of resources such as ethernet MAC addresses, create policies for many settings, and to combine and apply these policies to blades in what is called a service profile. To multiply this power even more, service profile templates can be created, and service profiles created from a common template are linked to it. This translates into the ability to make a configuration change in the template that is propagated to all the linked profiles automatically.
Now we can discuss a scenario where these tools will be used to simplify your configuration work. For simplicity, we'll assume that we have two Cisco 6248UP FI switches, and two UCS 5108 chassis with eight blades of the same exact model (let's say, Model B200-M5), that all have the exact same number of CPUs, amount of memory (RAM), internal hard drives, and interface modules. You can create a pool of ethernet MAC addresses, a pool of Fibre Channel World Wide Port Names (WWPNs), and policies for firmware version, for mirroring the internal hard drives as a boot device, for CPU settings, and other available settings. Next you would create a service profile template. Let's call it "B200-M5_TEMPLATE", which is configured to use all of these policies and pools.
The next step is elegantly simple: You create 16 Service Profiles from the B200-M5_TEMPLATE, with a base name of "B200-M5_" and a starting sequence number of "01". UCS Manager will create these profiles named "B200-M5_01", "B200-M5_02", and so on, up to "B200-M5_16", all linked to the template. After this, you can either manually associate a profile with a specific blade, or if the order doesn't matter to you, you can create a server pool and UCS will automatically assign profiles to the blades as needed.
As a profile becomes associated with a blade, there is a "Discovery" process that takes place, firmware is downloaded onto the blade if necessary, and all the other policies in the template and hence in the profile are applied to the blade.
There are several methods of installing an operating system onto the blades. The network boot method "Preboot eXecution Environment" (PXE) is one method that is convenient to use but requires some advance setup and a PXE install server. The most often used method in my experience is mounting an ISO image as a virtual CD/DVD to the blade via the KVM Console. If you have physical access, the blades have a USB port on the KVM "Dongle" cable adapter that can be attached to the blade. Regardless of which method you choose, the Service Profile must be configured accordingly for the correct first boot device.
My preferred method is to use a server or a virtual desktop running on a system in the same datacenter as the blade, and have the ISO image I wish to use for the installation stored either on the server or VDI, or on a network share that is "nearby" networking-wise. Trying to do this from a laptop at home or far away over the Internet may be very slow, and could potentially cause corruption issues if interrupted at a crucial point. I would definitely look into the PXE method if I needed to do a large number of OS installs in a short period of time, though.
When to Use Multiple Domains
It should be mentioned that there are some compelling reasons to split up a large number of chassis and blades into multiple UCS domains, even if they might all fit within one domain. Potential outage issues arise when new firmware needs to be applied to the UCS components, which may only be once a year or two, depending on what bugs or vulnerabilities might exist in the current code or what desirable features have been added in the new code. This activity requires rebooting the FI switches one at a time, restarting IOMs in the chassis, and rebooting blades. Depending on the operating system(s) running on the blades, this can either be simple, or a major challenge.
Technically speaking, whether the blades are running Windows Server, Linux, or VMware ESXi, if networking and FC SAN connectivity is configured properly with redundant A/B adapters, rebooting an FI or IOM will not cause an outage. Rebooting a blade obviously causes an outage to that particular blade, but if clustering methods are in use it may not require an outage to a database or application.
However, speaking business-wise, many application owners and management stakeholders will not agree to the risk of outage during these activities, even on evenings and weekends. Because upgrades such as this are done to an entire domain, this leads us to look at the multi-domain model. This may not help much with Windows or Linux servers, but if the blades are running VMware ESXi, this introduces the opportunity to migrate virtual machines (VMs) off of the blades in the domain being upgraded. The additional cost of a pair of FI switches and the associated cabling, network and SAN switch ports needed must be considered, but the expense may easily be judged worth the reduction in risk.
Let's imagine a scenario where we have 90 1/2 width blades, all running VMware ESXi. We could have these all in one domain with two FI switches and a minimum of 12 chassis. Instead, what if we add four more FI switches and split this up into three domains with 30 blades in each? We would have four chassis in each domain, with two empty slots remaining in each. Next, let's assume that the blades are in 3 vCenter clusters of 30 blades each, 10 in each domain, so they are evenly distributed.
As long as enough free "overhead" exists in the compute and memory capabilities of these clusters to support moving all VMs onto 20 ESXi hosts temporarily, all of the blades in one domain could be evacuated during the UCS firmware upgrades. This is also a great opportunity to patch or upgrade the ESXi hosts, which may need to happen much more often than UCS upgrades.
If there is not enough free capacity in one or more of the clusters, at least some of the ESXi hosts could be evacuated while being upgraded and rebooted. This doesn't mitigate the risk of something going wrong during the FI switch and IOM upgrades, but again, most operating systems and applications can handle this if configured properly. If the environment contains a lot of busy databases or sensitive applications, and the business will not tolerate risk during UCS upgrades, it may take several weekends of moving some things around, doing some of the upgrades, and moving them back on another weekend. Having multiple UCS domains increases your flexibility.
UCS Central versus InterSight
One limitation of UCS Manager is that it can only manage one domain. If you have multiple domains, you can still manage them with UCS Manager, but you will need to login to each domain separately, in a different browser window or CLI session. Thankfully there are some additional tools available that can monitor and manage multiple UCS domains simultaneously.
UCS Central has been around since 2012. According to Cisco, "Cisco UCS Central does not replace Cisco UCS Manager, which is the basic engine for managing a Cisco UCS domain. Instead, it builds on the capabilities provided by Cisco UCS Manager and works with Cisco UCS Manager to effect changes in the individual domain." Please note that UCS Central 2.0 was released in April 2017, and has not been updated since then. I believe Cisco's focus going forward is on the Intersight Cloud Platform, which we will discuss next.
Considered a Software as a Service (SaaS) platform, Cisco Intersight was developed to be a modular central control point for not only Cisco UCS, but also for Cisco Hyperflex hyperconverged systems, networking, storage, and external cloud services providers such as Amazon Web Services. In describing Intersight, Cisco says: "A unified, secure SaaS platform comprising modular services that bridge applications with infrastructure, Intersight provides correlated visibility and management across bare-metal servers, hypervisors, Kubernetes, and serverless and application components, helping you transform with AIOps to reach the scale and velocity your stakeholders demand."
With regard to UCS, Intersight offers some basic features for free but requires the purchase of licensing for more advanced features. Administrators with even relatively small UCS environments may find Intersight's features, even at the free Base level, quite useful. Companies with larger installations should seriously consider purchasing licensing for one of the tiers offered for Intersight.
PowerTools for PowerShell
Last but not least, Cisco provides a free PowerShell module called Cisco UCS PowerTool Suite. Any UCS administrator who is familiar with Microsoft PowerShell is likely to find this module quite useful. It has a variety of commands that can connect to UCS Manager or UCS Central, query certain aspects of the hardware, firmware, software and configuration, or make changes to the configuration via interactive CLI or within scripts. It is highly recommended that UCS administrators at least investigate and evaluate this tool. It's use interactively does not require much knowledge of PowerShell syntax, but knowing this will enable you to create useful scripts or modify those that others have created to meet your needs.
I would like to give an example of what the use of PowerTool commands has enabled me to do. Over time, I have created and updated a PowerShell script that uses both PowerTool commands and VMware PowerCLI commands to collect and correlate information on the ESXi hosts in my VMware vCenter environments, and the Cisco UCS environments that house the actual hardware. This has become extremely useful on a day-to-day basis, especially with at least eight UCS domains and over 250 blades and rack-mounted C-Series servers. It includes information such as the Cisco UCS domain name, the chassis and slot a blade is in, plus its physical rack location in the data center, what its Service Profile name is, what the blade model and serial number is, correlated with the vCenter name, cluster name, ESXi hostname, IP address, build number of ESXi, and so on.
This information is exported from the script as a comma-delimited .CSV file that is imported into Microsoft Excel, allowing easy filtering and searches. In addition, the spreadsheet this data is imported into contains a sheet with data on the UCS hardware collected from the company's Configuration Management Database, which includes information such as the blade serial number, that can also be cross-referenced. Finally, there is a sheet with a rough physical representation of the chassis and domains that has formulas to populate the chassis names, and slot locations with the actual name of the ESXi host that is running on that blade. In practice when, for example, a host needs emergency hardware maintenance, having this information at hand in one spreadsheet can minimize what may have been a long, manual exercise of logging into systems and cross-referencing information into a simple, one-step lookup.
Cisco UCS is a very well designed and reliable product. It has been supported, enhanced and updated by Cisco over the years, and has gained and held a significant market share for good reasons. Personally, I have found it very rewarding to work with, especially after becoming more familiar with the tools available to manage it.