Optimizing PyTorch Model Inference on AWS Graviton

AI/ML models can be an extremely expensive endeavor. Many of our posts have been focused on a wide variety of tips, tricks, and techniques for analyzing and optimizing the runtime performance of AI/ML workloads. Our argument has been twofold:

Performance analysis and optimizations must be an integral process of every AI/ML development project, and,
Reaching meaningful performance boosts and cost reduction does not require a high degree of specialization. Any AI/ML developer can do it. Every AI/ML developer should do it.

, we addressed the challenge of optimizing an ML inference workload on an Intel® Xeon® processor. We began by reviewing a number of scenarios in which a CPU might be the best choice for AI/ML inference even in an era of multiple dedicated AI inference chips. We then introduced a toy image-classification PyTorch model and proceeded to demonstrate a wide number of techniques for boosting its runtime performance on an Amazon EC2 c7i.xlarge instance, powered by 4th Generation Intel Xeon Scalable processors. In this post, we extend our discussion to AWS’s homegrown Arm-based Graviton CPUs. We will revisit many of the optimizations we discussed in our previous posts — some of which will require adaptation to the Arm processor — and assess their impact on the same toy model. Given the profound differences between the Arm and Intel processors, the paths to the best performing configuration may take different paths.

AWS Graviton

AWS Graviton is a family of processors based on Arm Neoverse CPUs, that are custom designed and built by AWS for optimal price-performance and energy efficiency. Their dedicated engines for vector processing (NEON and SVE/SVE2) and matrix multiplication (MMLA), and their support for Bfloat16 operations (as of Graviton3), make them a compelling candidate for running compute intensive workloads such as AI/ML inference. To facilitate high-performance AI/ML on Graviton, the entire software stack has been optimized for its use:

Low-Level Compute Kernels from the Arm Compute Library (ACL) are highly optimized to leverage the Graviton hardware accelerators (e.g., SVE and MMLA).
ML Middleware Libraries such as oneDNN and OpenBLAS route deep learning and linear algebra operations to the specialized ACL kernels.
AI/ML Frameworks like PyTorch and TensorFlow are compiled and configured to use these optimized backends.

In this post we will use an Amazon EC2 c8g.xlarge instance powered by four AWS Graviton4 processors and an AWS ARM64 PyTorch Deep Learning AMI (DLAMI).

The intention of this post is to demonstrate tips for boosting performance on an AWS Graviton instance. Importantly, our intention is not to draw a comparison between AWS Graviton and alternative chips, nor is it to advocate for the use of one chip over the other. The best choice of processor depends on a whole bunch of considerations beyond the scope of this post. One of the important considerations will be the maximum runtime performance of your model on each chip. In other words: how much “bang” can we get for our buck? Thus, making an informed decision about the best processor is one of the motivations for optimizing runtime performance on each one.

Another motivation for optimizing our model’s performance for multiple inference devices, is to increase its portability. The playing field of AI/ML is extremely dynamic and resilience to changing circumstances is crucial for success. It is not uncommon for compute instances of certain types to suddenly become unavailable or scarce. Conversely, an increase in capacity of AWS Graviton instances, could imply their availability at steep discounts, e.g., in the Amazon EC2 Spot Instance market, presenting cost-savings opportunities that you would not want to miss out on.

Disclaimers

The blocks code of code we will share, the optimization steps we will discuss, and the results we will reach, are intended as an example of the benefits you may see from ML performance optimization on an AWS Graviton instance. These may differ considerably from the results you might see with your own model and runtime environment. Please do not rely on the accuracy or optimality of the contents of this post. Please do not interpret the mention of any library, framework, or platform as an endorsement of its use.

Inference Optimization on AWS Graviton

As in our previous post, we will demonstrate the optimization steps on a toy image classification model:

import torch, torchvision
import time


def get_model(channels_last=False, compile=False):
    model = torchvision.models.resnet50()

    if channels_last:
        model= model.to(memory_format=torch.channels_last)

    model = model.eval()

    if compile:
        model = torch.compile(model)

    return model

def get_input(batch_size, channels_last=False):
    batch = torch.randn(batch_size, 3, 224, 224)
    if channels_last:
        batch = batch.to(memory_format=torch.channels_last)
    return batch

def get_inference_fn(model, enable_amp=False):
    def infer_fn(batch):
        with torch.inference_mode(), torch.amp.autocast(
                'cpu',
                dtype=torch.bfloat16,
                enabled=enable_amp
        ):
            output = model(batch)
        return output
    return infer_fn

def benchmark(infer_fn, batch):
    # warm-up
    for _ in range(20):
        _ = infer_fn(batch)

    iters = 100

    start = time.time()
    for _ in range(iters):
        _ = infer_fn(batch)
    end = time.time()

    return (end - start) / iters


batch_size = 1
model = get_model()
batch = get_input(batch_size)
infer_fn = get_inference_fn(model)
avg_time = benchmark(infer_fn, batch)
print(f"\nAverage samples per second: {(batch_size/avg_time):.2f}")

The initial throughput is 12 samples per second (SPS).

Upgrade to the Most Recent PyTorch Release

Whereas the version of PyTorch in our DLAMI is 2.8, the latest version of PyTorch, at the time of this writing, is 2.9. Given the rapid pace of development in the field of AI/ML, it is highly recommended to use the most up-to-date library packages. As our first step, we upgrade to PyTorch 2.9 which includes key updates to its Arm backend.

pip3 install -U torch torchvision --index-url https://download.pytorch.org/whl/cpu

In the case of our model in its initial configuration, upgrading the PyTorch version does not have any effect. However, this step is crucial for getting the most out of the optimization techniques that we will assess.

Batched Inference

To reduce the overhead of launching overheads and increase the utilization of the HW accelerators, we group together samples and apply batched inference. The table below demonstrates how the model throughput varies as a function of batch size:

Inference Throughput for Varying Batch Sizes (by Author)

Memory Optimizations

We apply a number of techniques from our previous post for optimizing memory allocation and usage. These include the channels-last memory format, automatic mixed precision with the bfloat16 data type (supported from Graviton3), the TCMalloc allocation library, and huge page allocation. Please see the for details. We also enable the fast math mode of the ACL GEMM kernels, and caching of the kernel primitives — two optimizations that appear in the official guidelines for running PyTorch inference on Graviton.

The command line instructions required to enable these optimizations are shown below:

# install TCMalloc
sudo apt-get install google-perftools

# Program the use of TCMalloc
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4

# Enable huge page memory allocation
export THP_MEM_ALLOC_ENABLE=1

# Enable the fast math mode of the GEMM kernels
export DNNL_DEFAULT_FPMATH_MODE=BF16

# Set LRU Cache capacity to cache the kernel primitives
export LRU_CACHE_CAPACITY=1024

The following table captures the impact of the memory optimizations, applied successively:

ResNet-50 Memory Optimization Results (by Author)

In the case of our toy model, the channels-last and bfloat16-mixed precision optimizations had the greatest impact. After applying all of the memory optimizations, the average throughput is 53.03 SPS.

Model Compilation

The support of PyTorch compilation for AWS Graviton is an area of focused effort of the AWS Graviton team. However, in the case of our toy model, it results in a slight reduction in throughput, from 53.03 SPS to 52.23.

Multi-Worker Inference

While typically applied in settings with many more than four vCPUs, we demonstrate the implementation of multi-worker inference by modifying our script to support core pinning:

if __name__ == '__main__':
    # pin CPUs according to worker rank
    import os, psutil
    rank = int(os.environ.get('RANK','0'))
    world_size = int(os.environ.get('WORLD_SIZE','1'))
    cores = list(range(psutil.cpu_count(logical=True)))
    num_cores = len(cores)
    cores_per_process = num_cores // world_size
    start_index = rank * cores_per_process
    end_index = (rank + 1) * cores_per_process
    pid = os.getpid()
    p = psutil.Process(pid)
    p.cpu_affinity(cores[start_index:end_index])

    batch_size = 8
    model = get_model(channels_last=True)
    batch = get_input(batch_size, channels_last=True)
    infer_fn = get_inference_fn(model, enable_amp=True)
    avg_time = benchmark(infer_fn, batch)
    print(f"\nAverage samples per second: {(batch_size/avg_time):.2f}")

We note that contrary to other AWS EC2 CPU instance types, each Graviton vCPU maps directly to a single physical CPU core. We use the torchrun utility to start up four workers, with each running on a single CPU core:

export OMP_NUM_THREADS=1 #set one OpenMP thread per worker
torchrun --nproc_per_node=4 main.py

This results in a throughput of 55.15 SPS, a 4% improvement over our previous best result.

INT8 Quantization for Arm

Another area of active development and continuous improvement on Arm is INT8 quantization. INT8 quantization tools are typically heavily tied to the target instance type. In our previous post we demonstrated PyTorch 2 Export Quantization with X86 Backend through Inductor using the TorchAO (0.12.1) library. Fortunately, recent versions of TorchAO include a dedicated quantizer for Arm. The updated quantization sequence is shown below. As in our previous post we are interested just in the potential performance impact. In practice, INT8 quantization can have a significant impact on the quality of the model and may necessitate a more sophisticated quantization strategy.

from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e, convert_pt2e
import torchao.quantization.pt2e.quantizer.arm_inductor_quantizer as aiq

def quantize_model(model):
    x = torch.randn(4, 3, 224, 224).contiguous(
                            memory_format=torch.channels_last)
    example_inputs = (x,)
    batch_dim = torch.export.Dim("batch")
    with torch.no_grad():
        exported_model = torch.export.export(
            model,
            example_inputs,
            dynamic_shapes=((batch_dim,
                             torch.export.Dim.STATIC,
                             torch.export.Dim.STATIC,
                             torch.export.Dim.STATIC),
                            )
        ).module()
    quantizer = aiq.ArmInductorQuantizer()
    quantizer.set_global(aiq.get_default_arm_inductor_quantization_config())
    prepared_model = prepare_pt2e(exported_model, quantizer)
    prepared_model(*example_inputs)
    converted_model = convert_pt2e(prepared_model)
    optimized_model = torch.compile(converted_model)
    return optimized_model


batch_size = 8
model = get_model(channels_last=True)
model = quantize_model(model)
batch = get_input(batch_size, channels_last=True)
infer_fn = get_inference_fn(model, enable_amp=True)
avg_time = benchmark(infer_fn, batch)
print(f"\nAverage samples per second: {(batch_size/avg_time):.2f}")

The resultant throughput is 56.77 SPS for a 7.1% improvement over the bfloat16 solution.

AOT Compilation Using ONNX and OpenVINO

In our previous post, we explored ahead-of-time (AOT) model compilation techniques using Open Neural Network Exchange (ONNX) and OpenVINO. Both libraries include dedicated support for running on AWS Graviton (e.g., see here and here). The experiments in this section require the following library installations:

pip install onnxruntime onnxscript openvino nncf

The following code block demonstrates the model compilation and execution on Arm using ONNX:

def export_to_onnx(model, onnx_path="resnet50.onnx"):
    dummy_input = torch.randn(4, 3, 224, 224)
    batch = torch.export.Dim("batch")
    torch.onnx.export(
        model,
        dummy_input,
        onnx_path,
        input_names=["input"],
        output_names=["output"],
        dynamic_shapes=((batch,
                         torch.export.Dim.STATIC,
                         torch.export.Dim.STATIC,
                         torch.export.Dim.STATIC),
                        ),
        dynamo=True
    )
    return onnx_path

def onnx_infer_fn(onnx_path):
    import onnxruntime as ort

    sess = ort.InferenceSession(
        onnx_path,
        providers=["CPUExecutionProvider"]
   )
    sess_options = ort.SessionOptions()
    sess_options.add_session_config_entry(
               "mlas.enable_gemm_fastmath_arm64_bfloat16", "1")
    input_name = sess.get_inputs()[0].name

    def infer_fn(batch):
        result = sess.run(None, {input_name: batch})
        return result
    return infer_fn

batch_size = 8
model = get_model()
onnx_path = export_to_onnx(model)
batch = get_input(batch_size).numpy()
infer_fn = onnx_infer_fn(onnx_path)
avg_time = benchmark(infer_fn, batch)
print(f"\nAverage samples per second: {(batch_size/avg_time):.2f}")

It should be noted that the ONNX runtime supports a dedicated ACL-ExecutionProvider for running on Arm, but this requires a custom ONNX build (as of the time of this writing), which is out of the scope of this post.

Alternatively, we can compile the model using OpenVINO. The code block below demonstrates its use, including an option for INT8 quantization using NNCF:

import openvino as ov
import nncf

def openvino_infer_fn(compiled_model):
    def infer_fn(batch):
        result = compiled_model([batch])[0]
        return result
    return infer_fn

class RandomDataset(torch.utils.data.Dataset):
    def __len__(self):
        return 10000

    def __getitem__(self, idx):
        return torch.randn(3, 224, 224)

quantize_model = False
batch_size = 8
model = get_model()
calibration_loader = torch.utils.data.DataLoader(RandomDataset())
calibration_dataset = nncf.Dataset(calibration_loader)

if quantize_model:
    # quantize PyTorch model
    model = nncf.quantize(model, calibration_dataset)

ovm = ov.convert_model(model, example_input=torch.randn(1, 3, 224, 224))
ovm = ov.compile_model(ovm)
batch = get_input(batch_size).numpy()
infer_fn = openvino_infer_fn(ovm)
avg_time = benchmark(infer_fn, batch)
print(f"\nAverage samples per second: {(batch_size/avg_time):.2f}")

In the case of our toy model, OpenVINO compilation results in an additional boost of the throughput to 63.48 SPS, but the NNCF quantization disappoints, resulting in just 55.18 SPS.

Results

The results of our experiments are summarized in the table below:

As in our , we reran our experiments on a second model — a Vision Transformer (ViT) from the timm library — to demonstrate how the impact of the runtime optimizations we discussed can vary based on the details of the model. The results are captured below:

ViT Inference Optimization Results (by Author)

Summary

In this post, we reviewed a number of relatively simple optimization techniques and applied them to two toy PyTorch models. As the results demonstrated, the impact of each optimization step can vary greatly based on the details of the model, and the journey toward peak performance can take many different paths. The steps we presented in this post were just an appetizer; there are undoubtedly many more optimizations that can unlock even greater performance.

Along the way, we noted the many AI/ML libraries that have introduced deep support for the Graviton architecture, and the seemingly continuous community effort of ongoing optimization. The performance gains we achieved, combined with this apparent dedication, prove that AWS Graviton is firmly in the “big leagues” when it comes to running compute-intensive AI/ML workloads.

What's Hot

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Optimizing PyTorch Model Inference on AWS Graviton

CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

Follow the AI Footpaths | Towards Data Science

Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

Hallucinations in LLMs Are Not a Bug in the Data

Visual Generalization in Reinforcement Learning via Dynamic Object Tokens

How to Build a Production-Ready Claude Code Skill

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

LinkedIn updates feed algorithm with LLM-powered ranking and retrieval

Trust Is The New Ranking Factor

CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

What They Mean and How to Use Them in Social Media Campaigns

3 CMS Platforms Control 73% Of The Market & Shape Technical SEO Defaults

Top 7 Traackr Alternatives 2026

Most Popular

13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

How to watch the 2026 GRAMMY Awards online from anywhere

Corporate Reputation Management Strategies | Sprout Social

Our Picks

At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

Here’s how I turned a Raspberry Pi into an in-car media server

Beloved SF cat’s death fuels Waymo criticism

Subscribe to Updates

What's Hot

Optimizing PyTorch Model Inference on AWS Graviton

Disclaimers

Inference Optimization on AWS Graviton

Upgrade to the Most Recent PyTorch Release

Batched Inference

Memory Optimizations

Model Compilation

Multi-Worker Inference

INT8 Quantization for Arm

AOT Compilation Using ONNX and OpenVINO

Results

Summary

Related Posts

Subscribe to Updates