Home » Technical Topics » AI Hardware

Acceleration technologies that will boost HPC and AI efforts

  • RobFarber 

Experts believe we are entering the 5th epoch of distributed computing where the heterogenous design of modern systems have been driven by numerous technology advances in accelerators,  next-generation lithography manufacturing, chiplets, and packaging technology. This is the 5th and final article in the series discusses the impact that current and future acceleration technology will have on HPC and AI.

The most visible AI workload at the moment is the ubiquity of AI-workloads such as Large Language Models (LLMs). Less visible, but foundational, are the accelerators used to ensure the security of our cloud and on-premises datacenters and those accelerators that perform more mundane activities such as data movement. This move to accelerators to reduce or eliminate bottlenecks for common operations is the day of reckoning foreseen by Gordon Moore (of Moore’s law); a time when we will need to build larger systems out of smaller functions, combining heterogenous and customized solutions.

Software addresses exponential support issues and avoids vendor lock-in

Software is the key to utilizing these rapidly evolving accelerator technologies, many of which are understandably based on building blocks that accelerate AI-based workloads given their commercial viability.

The size and capabilities of these accelerators vary widely, from dedicated on-package accelerators focused on security to general-purpose accelerators like GPUs. Competition is our performance friend given the cornucopia of vendor-specific accelerators that are now available. It has also forced the HPC community to come together, driven by a common need  to address the exponentially hard problem of application support for ubiquitous multiarchitecture and multivendor accelerator deployments in datacenters and the cloud. The breadth of the HPC deployments and diversity of workloads is simply too big. No single company— however large — can meet all customer needs nor can bespoke software customizations performed by humans. Instead, community development efforts create software ecosystems that support platform portability. Extensive, many-year efforts such as the DOE funded Exascale Computing Project (ECP) and oneAPI software ecosystem are two current practical solutions that support existing (and likely future) heterogeneous devices through standards-based libraries and languages. The efficacy of these efforts can be assessed by looking at what works (and does not work) for the leaders in relevant workload domains. This is the way to stay on top of the application performance curve and avoid vendor lock-in.

Bigger is better at the moment in AI as domain leaders are using both NVIDIA and Intel hardware to train trillion parameter LLMs. These trillion parameter efforts reflect a high-water mark for large AI models. The monumental Argonne National Lab ScienceGPT effort, for example, is backed by Intel and US government. It also reflects the amazing power of exascale supercomputing as this training is currently using a small subset of the Intel Data Center CPU and GPU Max series powered Aurora supercomputer nodes (testing started with 64 nodes and continues using only 256 of the more than 9,000 Aurora supercomputer nodes that will eventually be installed). This ScienceGPT project combines all the text, codes, specific scientific results, papers, into the model that will be used to speed scientific research.

HBM and reduced-precision arithmetic can make CPUs the preferred platform for both HPC and AI workloads

Such large runs make headlines, but in practice, massive investments are not necessary to train many LLM models.

It is important to recognize that AI workloads, particularly LLM workloads, tend to be memory bandwidth limited. Results show that High Bandwidth Memory (HBM) can make CPUs the desired platforms of choice for many AI and HPC workloads. HBM is not necessarily an “accelerator” device, but it can be an important workload accelerator because it can help keep the accelerators and processor cores busy and thus significantly speed many workloads.[1] [2] [3] Similarly, hardware accelerated reduced precision arithmetic operations can increase both computational performance and the effective memory bandwidth. Examples include the Intel® Advanced Matrix Extensions (Intel AMX) instructions in the latest 4th generation Intel Xeon processors and Intel Xᵉ Matrix Extensions (Intel XMX) on Intel Data Center GPU Max Series or Intel Data Center GPU Flex Series GPUs,

Use cases and representative workload results show CPUs can be fast enough for many workloads — including LLMs. A single Intel Xeon Platinum 8480+ 2S node, for example, can train a Bidirectional Encoder Representations from Transformers (BERT) language model in 20 minutes, and outperform an NVIDA A100 GPU on some fine tuning workloads. In part 3 of  “Tuning and Inference for Generative AI with 4th Generation Intel Xeon Processors” published results showed that AWS customers can use Intel Xeon processors for tuning small to medium sized LLMs for their specific use cases. The 7 billion Falcon-7B Large Language Model is used as an example. This result is reflected by other PyTorch transformer workloads as well.

Accelerating inference

Many are discovering that large parameter inference workloads can also be challenging. This is where the big memory and reduced precision capabilities of CPUs can help cloud and on-premises AI users meet their desired latency goals, even when using models containing billions of parameters. Unified interfaces are important in supporting these workloads as illustrated by the Hugging Face use of the Intel Neural Compressor Architecture. Of course, the benefits of these AI building blocks, along with HBM, can speed traditional, non-AI HPC workloads.

New, power efficient accelerators such as the Intel NPU (Neural Processing Unit) in the Intel Core Ultra (aka “Meteor Lake”) CPUs can help bring many of the benefits of these AI-assisted simulations to researcher desktop and laptop devices. Time will tell, but local processing offers many advantages including lower the latency of inference operations and the use of fat-client, thin-server Internet AI capabilities. Local processing can also provide better privacy and security.

Specialized accelerators enrich general-purpose devices

Specialized accelerators such as the Intel Gaudi2 AI accelerators also provide AI-specific training and inference performance. One example is demonstrated by the use of 8 of these AI accelerator cards to run inference workloads using the 176 billion HuggingFace BLOOM model and others.

Intel is working to roll the capabilities of such specialized AI accelerators like Gaudi 2 into general-purpose accelerators. For example, Intel announced that the Intel Xeon Max Series GPU and Gaudi AI chip road maps will converge starting with a next-generation product code-named Falcon Shores.

Looking beyond CPUs and GPUs

All the accelerators discussed thus far utilize conventional Von Neumann hardware and neural network models. Research is proceeding on remarkable new technologies such as  neuromorphic, quantum and other devices to understand how they might impact the future of HPC and AI.

Neuromorphic computing

New non-Von Neumann approaches such as neuromorphic computing promise to deliver high accuracy AI solutions while consuming orders of magnitude less power. Examples in the literature demonstrate the extraordinary efficacy of neuromorphic computing, which can match the accuracy of traditional neural networks on vision problems with orders of magnitude greater power efficiency compared to current CPUs and GPUs. The SpikeGPT project reflects a current effort to apply these spiking neural network models to large language models.  

The neuromorphic hardware continues to advance. Intel’s Loihi project provides one example of a neuromorphic research processor that is being used to advance the state-of-the-art in accelerated AI performance and power efficiency. Loihi supports a broad range of spiking neural networks and can run at sufficient scale, with the performance and features needed to deliver competitive results compared to state-of-the-art contemporary computing architectures. As AI-augmented science and commercial applications advance, such extraordinarily power efficient devices become ever more attractive from a cost, performance, and global climate perspective both for local and datacenter processing.

The recently announced Intel Hala Point neuromorphic system that utilizes the Intel Loihi 2 processors is a concrete instantiation of this progress. Hala Point is Intel’s first large-scale neuromorphic system. It is being used to demonstrate state-of-the-art computational efficiencies on mainstream AI workloads. Characterization shows Hala Point can support up to 20 quadrillion operations per second, or 20 petaops, with an efficiency exceeding 15 trillion 8-bit operations per second per watt (TOPS/W) when executing conventional deep neural networks. This rivals and exceeds levels achieved by architectures built on graphics processing units (GPU) and central processing units (CPU). Hala Point’s unique capabilities could enable future real-time continuous learning for AI applications such as scientific and engineering problem-solving, logistics, smart city infrastructure management, large language models (LLMs), AI agents and more.

Quantum computing

Quantum computing promises a game-changing new computing capability. The paper “Local minima in quantum systems” , for example, discusses why finding local minima is easy for quantum systems and hard for conventional computers.

This is a field where researchers continue to realize foundational milestone achievements. Intel labs, for example, is involved in research collaborations to demonstrate practical solutions using quantum technology. In collaboration with industry and academic partners, for example, a team successfully demonstrated the supervised training of very small  2-to-1 bit neural networks using non-linear activation functions on actual quantum hardware. Such milestones represent significant progress, but while the field of quantum computing is rapidly advancing, practical solutions still remain tantalizingly out of reach.

Summary

The 5th epoch of computing identified by industry experts and foretold by Gordon Moore long ago clearly provides many benefits for HPC and AI workloads, but only when the data security and accessibility infrastructure are able to support user needs. Accelerators clearly are the future, which makes it easy to predict that standards based, community software development ecosystems like oneAPI and the ECP Extreme Scale Software Stack (E4S) will become the portable infrastructure for accessing accelerated capabilities by the scientific computing community. Otherwise, the combinatorial support problem becomes intractable unless one is willing to accept vendor lock-in. Such community developed infrastructure is necessary given the breadth of new computing models and hardware that are in the works and approaching widespread use. [4] [5] [6] [7] [8] [9] [10]

Learn more about how AI accelerated HPC will impact the future of supercomputing through the previous articles in this series. (Technology investment guidelines are provided in article 3.)

For workload specific information, look to the leaders in your area(s) of interest to see how community software development efforts and the use of standards-based libraries and languages can meet current and future computational needs. The most general-purpose accelerated software ecosystems at this time are oneAPI and E4S.

  • The Argonne Leadership Computing Facility AI testbed is a good information resource about the capabilities of the next-generation AI accelerators.
  • Work on the current generation Department of Energy exascale supercomputers provide information about the leading edge and exploration into what-is-possible in both HPC and AI.
  • For hands-on testing, download and start working with the E4S software and oneAPI ecosystem.
    • Many cloud providers provide access to new accelerated platforms. Look to your favorite ISPs.
    • HPC groups can contact E4S to gain access to the Frank cluster. This cluster is used for verification of the E4S software and can provide test access to recent accelerator hardware not covered under NDA.

Rob Farber is a global technology consultant and author with an extensive background in HPC and machine learning technology.


[1] https://www.intel.com/content/www/us/en/products/details/processors/xeon/max-series.html

[2] https://www.intel.com/content/www/us/en/products/docs/processors/max-series/overview.html

[3] https://www.datasciencecentral.com/internal-cpu-accelerators-and-hbm-enable-faster-and-smarter-hpc-and-ai-applications/

[4] https://arxiv.org/abs/2201.00967

[5] https://cs.lbl.gov/news-media/news/2023/amrex-a-performance-portable-framework-for-block-structured-adaptive-mesh-refinement-applications/

[6] https://www.exascaleproject.org/collaborative-community-impacts-high-performance-computing-programming-environments/

[7] https://www.exascaleproject.org/highlight/exaworks-provides-access-to-community-sustained-hardened-and-tested-components-to-create-award-winning-hpc-workflows/

[8] https://www.exascaleproject.org/e4s-deployments-boost-industrys-acceptance-and-use-of-accelerators/

[9] https://www.exascaleproject.org/harnessing-the-power-of-exascale-software-for-faster-and-more-accurate-warnings-of-dangerous-weather-conditions/

[10] https://journals.ametsoc.org/view/journals/bams/102/10/BAMS-D-21-0030.1.xml                                                                                                                                        

This article was produced as part of Intel’s editorial program, with the goal of highlighting cutting-edge science, research and innovation driven by the HPC and AI communities through advanced technology. The publisher of the content has final editing rights and determines what articles are published.