A graphics processing unit performs rapid mathematical calculations to render graphics, and organizations often employ GPUs to aid in workload acceleration — specifically to support AI and machine learning. Heavy processing tasks of this type often require multiple GPU chips, each with multiple cores, to do their work.
A GPU performs heavier mathematical calculations than a CPU by employing parallel processing, in which multiple processors take on different parts of the same task. It has its own RAM to store the data it processes.
Implementing a GPU in your data center
You can implement a GPU in one of two ways. The aftermarket approach — the smaller scale method — has you install the GPU subsystem as an upgrade to one or more existing servers in your data center. This approach is most popular with organizations early in the adoption cycle and those that still experiment with GPUs. However, this style of implementation creates significant additional power demands.
The other approach to implementation is including GPUs as part of your server refresh cycles. This enables you to buy or lease new servers that come with GPUs preinstalled along with suitable power supplies. Organizations that seek more extensive, permanent GPU deployments often prefer this approach.
Your approach to implementation depends on how you intend your infrastructure to consume GPU resources and what workloads you intend to run.
GPUs vs. CPUs
GPUs differ from other processing chips in a handful of ways.
The CPU — the GPU’s predecessor — uses logic circuitry to process commands sent to it by the OS, and performs arithmetic, logic and I/O operations. It feeds data to specialized hardware and sends instructions to different parts of the compute system. It functions as the “brain” of the computer. The GPU was initially designed to complement the CPU by offloading more complex graphics rendering tasks, leaving the CPU free to handle other compute workloads. One major difference between the GPU and the CPU is that the GPU performs operations in parallel rather than in serial.
The GPU and CPU are both critical for data analytics. The GPU’s ability to tackle heavy mathematical tasks such as matrix multiplication can speed up deep learning and data analysis, but it’s an ill fit for smaller tasks such as querying data in a database.
Rather than deciding between GPU or CPU chips for your data center, consider using both together. The chips were designed to complement one another; you can greatly accelerate particularly heavy workloads, such as training machine learning models or performing specialized analytical tasks, with both.
To virtualize a GPU is not quite the same as virtualizing CPU or RAM. Each GPU is unique, which means designing, licensing and deploying vGPUs requires a different approach.
You can run different models of GPU cards in the same VMware cluster, for example. However, each host in that cluster must run the same GPU cards internally, which means that although your hosts can have different GPU models, each host can only have one model installed. You must also have a license to enable drivers to access remote GPU functionality, and this license then determines the features of each vGPU. If you have multiple types of GPUs in your cluster, you must have an additional license to pull everything together.
Also consider aspects such as security, hardware host platforms, power requirements and cooling requirements before deploying vGPUs in the infrastructure.
Compare major GPU offerings
Nvidia and AMD offer the two most popular GPU products on the market. Nvidia’s GPUs handle a range of tasks in data centers, including machine learning training and operating machine learning models. Organizations also use Nvidia GPUs to speed up calculations in supercomputing simulations. Nvidia has worked with its partner OmniSci to develop a platform with a GPU-accelerated database, rendering engine and visualization system for faster analytics results.
AMD’s GPUs, meanwhile, mainly target scientific computing workloads. Its GPU portfolio has two separate targets, one for data center performance models and the other for gaming-targeted models. It offers an open software development platform called ROCm that enables developers to write and compile code for a variety of environments, and it supports common machine learning frameworks.
AMD has a slight advantage over Nvidia in terms of performance. However, Nvidia handles AI workloads better and includes more memory in its GPUs and has a more mature programming framework.
When it comes to AMD and Nvidia’s vGPU offerings, the vendors take different delivery approaches. Nvidia’s vGPUs require installing host drivers within the hypervisor, allocating vGPUs to guest VMs. AMD has a fully hardware-based approach and allocates a portion of GPU cores to each machine directly.