How to Successfully Manage 250 ESXi Hosts
Virtualization has become an integral part of the enterprise data center for large multinational corporations to small businesses. VMware vSphere is the feature-rich suite of products that enables IT organizations to maximize their investment in computing equipment by running many virtual servers on hardware that, in the past, would have only supported a single operating system.
Managing a large virtualized environment requires unique training and knowledge, and presents the administrator with some additional and possibly unexpected challenges. This post will illuminate some of these challenges and help you successfully manage your large VMware environment.
The Best Naming Conventions for 250 ESXi Hosts
When you're working with a vCenter that has, say, 250 ESXi hosts in 15 different clusters, 4000 VMs, and 150 datastores, it can be quite confusing. This is much worse if there is no consistency to the names given to all these components. Even if you're not required to administer the operating system of the VMs, you may often be asked to locate them, move them, or modify the CPU or memory configuration of the VM. These tasks will be much easier if the groups that manage the VM operating systems (Windows, Linux, etc.) have adopted a standard naming convention.
An example might be something like: "W" for "Windows", "P" for "Production", "DB" for "Database", followed by a two letter department and a number, e.g. "WPDBHR01" would be a Human Resources Department ("HR") Production Database Server running Microsoft Windows. Most larger companies already have conventions like this in place, yet often also have legacy systems still around from before the naming convention was implemented.
For vCenter or cluster names, the temptation may be to use simple names like "Production", "Development", or "Test". However, as environments grow, there may be a need for multiple clusters or vCenters, so it's useful to at least append a number like "01" so you can have "Prod01", "Prod02" and "Dev01", "Dev02", and so on.
Naming for ESXi hosts could be something like "P" + "ESX" + number to yield "PESX01", "TESX05". For example. It might be desirable to add something more to the name, but keep in mind that often an ESXi host is more like a general purpose "engine" that runs VMs and doesn't care what cluster it is in or what type of VMs run on it. Even the "P", "D" or "T" designation may complicate operations in the future, such as in the case of a production ESXi host failing and needing to be replaced temporarily by a host taken from the test environment. The more generic the names are, the more dynamic the ESXi environment can be without confusion or renaming objects
For naming datastores, a lot may depend on the type of datastores and what they are used for. Typically in large corporate data centers, the datastores are on large Storage Arrays connected to the ESXi hosts via a Fibre Channel, FCoE or iSCSI SAN. Because the Storage Arrays and SAN may be managed by another IT group, it will be important to have a consistent naming convention on both ends or there will be ongoing confusion.
An example of a datastore name that can also be used on the Storage Array "LUN" (the unit of block storage being presented to the hosts to create the datastore on) would be "vCenter" + "cluster" + "storage array" + number, yielding something like "PRVC01_Prod02_DellUnity1234_005". It's a bit long, but every bit of information in that name is likely to come in handy at some point. VMware vCenter certainly allows names this long and longer, and most modern storage arrays will support names like this. If not, then a useful strategy may be to maintain a spreadsheet that maps the datastore names to whatever device names the storage team provides, but it must be kept up-to-date or it will not be reliable.
In summary, if standard naming conventions are implemented and followed consistently, it will make everyone's job easier.
Why Code Consistency Matters: Versions & Patches
The phrase “code consistency” refers to the versions and patch levels of the various vSphere components. While typically it is best to be running the latest stable version of code from any vendor, this is often not practical. However, a VMware administrator should strive to at least be aware of the "End of Support Life" (EOSL) dates for the software components, and plan strategically to upgrade components reasonably ahead of these dates. In addition to this, it is highly recommended that the components be held at the same code level within certain realms of the environment.
For example, in a cluster of ESXi 6.7 hosts, stability-wise and reliability-wise, it may be most advantageous to make sure they are all running the same exact "Build" number (e.g. ESXi 6.7 U3 P05, Build #17700523). VMware's website has various resources for users to keep track of these version and build numbers, and as mentioned before, a compatibility matrix that you can use to verify that the versions you are using together are in fact supported to work together by VMware. There are certain tools that can alert you if you have components that are not in compliance, but ultimately it is the customer's responsibility to use a supported configuration.
Also worth mentioning here is VMware Tools, which is an optional utility that is installed on the VM operating system, and provides device drivers as well as the ability for vCenter to interact with the VM for operations such as quiescing the VM operating system while taking snapshots. It is advised to keep up to date with the bug fixes and features in this code and keep it at an acceptably recent and stable level. This software often requires a reboot of the VM to install or upgrade, so scheduling this can be a challenge that requires collaboration and planning.
How to Monitor 250 ESXi Hosts
VMware vCenter has many "built in" monitoring and alerting features and options. Unless you want to sit and stare at your vCenter GUI all day and night, you are likely to miss something that goes wrong (and yes, things do go wrong now and then). Some of the options that exist are: having emails sent to you or a group when there is an alert, having alerts sent to a specialized monitoring system such as Microsoft System Center Operations Manager (SCOM) or Nagios, that in turn will notify the correct personnel. Alert thresholds can be adjusted to better meet your needs.
Like other things discussed in this post, most large companies have already adopted and implemented strategies and tools in the monitoring and alerting area. It is wise to take advantage of these and make the most of them. It's much less stressful to get a call, text or email at 10:00 p.m. letting you know that a datastore is getting too full so you can take immediate action, than getting a call from your manager at 2:00 a.m. asking you why 25 business critical VM servers have gone offline.
3 Automation & Scripting Tools
There are a number of automation features built into vCenter, and add-ons, plugins, and third-party products that can streamline certain administrative tasks. One very useful built-in automation feature is Distributed Resource Scheduler (DRS) that provides load balancing and VM placement capabilities.
For example, say you have a cluster with three ESXi hosts and thirty VMs, and the hosts are being equally utilized in both CPU and memory. Suddenly one of the VMs that is a database server has a very resource-intensive query start to run, and the CPU utilization on that VM (and hence the ESXi host) climbs above a threshold. The vCenter will automate the live movement of one or more other VMs (vMotion) off of the over-utilized host to more balance the workload. This is an extremely valuable feature in keeping things running smoothly.
DRS is also extremely valuable when the time comes to patch or upgrade ESXi hosts, because vCenter will have the capability (either manually at your instruction or automatically) moving all the VMs off of a host, patching/upgrading it, then putting it back in service before moving on to the next one. As mentioned in the Code Consistency section, it is best to keep ESXi hosts within a cluster at the same build level. VMware Update Manager (VUM) accomplishes that task — and it’s another integrated tool. For a cluster or for individual hosts, you can define and attach baselines for code components, run compliance checks, and execute either manual or automatic remediation if desired. If Automated DRS is enabled on a cluster, you can initiate remediation and vCenter will manage the entire process of upgrading each host one at a time.
Another valuable, and free, tool is VMware PowerCLI for Microsoft PowerShell. It is an installable module that enables you to login to one or more vCenters and execute commands or complete scripts against them in a command line environment. For those who are already familiar with PowerShell or other programming/scripting languages, the learning curve is not very steep. Those not accustomed to using the command line or scripting may find that it is not difficult to learn and there are many resources available to help.
PowerCLI can be used for tasks as simple as listing out all the VMs in your vCenter(s) along with pertinent information such as the VMs' operating systems, IP addresses, etc. You can easily direct the output to a comma delimited (CSV) file that can be opened with Microsoft Excel and manipulated there. There are various repositories of shared scripts on the Internet, and VMware even maintains a Community Forum where information about using PowerCLI is shared.
Permissions and Auditing for 250 ESXi Hosts
VMware vCenter has the capability of integrating with Microsoft Active Directory (AD). This makes it possible for the access to vCenter for individual users to be controlled by the organization's IT Security group, or whatever group handles access controls such as this.
For example a group called "vCenter Administrators" can be created in AD, and specific users added to that group. Then the existing vCenter administrator adds that group to the Global Permissions and assigns the "Administrator" Role to that group. Once this is done, anyone added to the group in AD gets access, and anyone removed from the group loses access.
Specific roles can be defined in vCenter with or without certain permissions, and permissions can also be assigned to clusters, hosts, datastores, specific VMs, VM folders, just about any object or grouping in vCenter. However, because of this capability, the VMware administrator must remember that vCenter permissions applied to a "child" level object will always override the permissions applied at the parent level.
VMware vCenter has a variety of ways to see what events have taken place and who initiated or executed them. The browser GUI interface has a "Monitor" tab for just about every object (the vCenter itself, clusters, hosts, VMs, Datastores, and so on). Under this tab, there are menu items that will display recent Tasks, Events and Triggered Alarms. These are all useful in troubleshooting, each in their own way. However, in a very large environment, browsing or searching any of these areas can be tedious and frustrating, especially if something has happened that you want to quickly find answers for. You may find it quite useful to know that this information can be accessed in several other ways.
One option is to enable log events to be sent to a "syslog" server. Syslog is a service that has been around for a long time, and numerous free implementations of this are available. This makes it possible to use a variety of text search and manipulation tools to find what you are looking for more quickly.
Another option is the relatively new VMware Skyline Collector Appliance and "Cloud Services", which involves the collector retrieving log data from the vCenter and ESXi hosts. This data is then uploaded to your VMware Cloud Services account, where it is analyzed to identify potential issues. This data is immediately available to VMware Support engineers, if a support case is opened for an issue.The Skyline service does require an active Production Support or Premier Services contract.
This aspect of vCenter is not as interesting or exciting as many other features, but it is worthwhile to spend some time and effort familiarizing yourself with the tools and options, and implementing a strategy that makes sense for your situation. When something triggers the need to audit activities or troubleshoot an issue, you will be glad you did.
Design and Best Practices for Large Environments
It was mentioned earlier that VMware maintains a compatibility matrix for interoperability of its own products and those of various other software vendors and hardware manufacturers. This interactive tool should be used as needed to make sure all the various components in use are supported, and will continue to be supported if something changes. For example, if you are planning to upgrade your ESXi hosts from version 6.5 to 6.7, it would be wise to make sure your host hardware is still supported by the newer 6.7 version.
Another useful interactive VMware website is the VMware Configuration Maximums website, which allows you to select the products you are interested in and see what the limitations for that product or feature are. An example of this would be the maximum number of CPUs you can assign to a single VM under ESXi 6.7.
How to Manage Hardware at Scale
So, what is all this VMware software actually running on? Even though many VMware administrators will never see or touch the actual hardware their ESXi hosts are running on, it is still important to know at least some basic information about the manufacturer, model, and capabilities of the hardware. While it's possible with some companies that there may be a separate group responsible for maintaining the health of the hardware, it's also quite possible that you may be responsible for that in addition to the VMware software.
Some of the common types of compute server hardware that are used as ESXi host are standalone rack-mounted "pizza box" enclosures, blade servers, which are smaller server units that slide into a rack-mounted "chassis" enclosure, and hyper-converged systems, which are designed with CPUs and memory, networking, storage, and management combined. It is likely that you may encounter a situation with a combination of these, which then requires a wider set of hardware knowledge.
Factors to consider when it comes to managing hardware like this efficiently are the tools that the vendor provides, how intuitive and well-designed these tools are, and whether they have some level of integration with vCenter. For example, if one of your ESXi hosts has a failed memory module, and that failure is recognized by the ESXi host in vCenter, then you will know about it and can take action based on your vCenter monitoring and Triggered Alarms. Even better, if the vendor has a method for the host to notify their support team of the problem (often termed a "call home" feature), they may know about it and have a support ticket open before you are even aware of it.
Managing a large VMware environment may be challenging, but it is also rewarding. VMware has had a role in the corporate data center long enough that most administrator teams should be "right-sized" for the environment they manage, so you will most likely be sharing the responsibilities with a group. Knowing the skills and capabilities of your co-workers and being able to collaborate to manage your VMware environment may be almost as important as having a high level of training and experience.