Close Menu
SkytikSkytik

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 2025

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 2025

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 2025
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    SkytikSkytik
    • Home
    • AI Tools
    • Online Tools
    • Tech News
    • Guides
    • Reviews
    • SEO & Marketing
    • Social Media Tools
    SkytikSkytik
    Home»AI Tools»Optimizing PyTorch Model Inference on AWS Graviton
    AI Tools

    Optimizing PyTorch Model Inference on AWS Graviton

    AwaisBy AwaisDecember 11, 2025No Comments11 Mins Read0 Views
    Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
    Optimizing PyTorch Model Inference on AWS Graviton
    Share
    Facebook Twitter LinkedIn Pinterest Email

    AI/ML models can be an extremely expensive endeavor. Many of our posts have been focused on a wide variety of tips, tricks, and techniques for analyzing and optimizing the runtime performance of AI/ML workloads. Our argument has been twofold:

    1. Performance analysis and optimizations must be an integral process of every AI/ML development project, and,
    2. Reaching meaningful performance boosts and cost reduction does not require a high degree of specialization. Any AI/ML developer can do it. Every AI/ML developer should do it.

    , we addressed the challenge of optimizing an ML inference workload on an Intel® Xeon® processor. We began by reviewing a number of scenarios in which a CPU might be the best choice for AI/ML inference even in an era of multiple dedicated AI inference chips. We then introduced a toy image-classification PyTorch model and proceeded to demonstrate a wide number of techniques for boosting its runtime performance on an Amazon EC2 c7i.xlarge instance, powered by 4th Generation Intel Xeon Scalable processors. In this post, we extend our discussion to AWS’s homegrown Arm-based Graviton CPUs. We will revisit many of the optimizations we discussed in our previous posts — some of which will require adaptation to the Arm processor — and assess their impact on the same toy model. Given the profound differences between the Arm and Intel processors, the paths to the best performing configuration may take different paths.

    AWS Graviton

    AWS Graviton is a family of processors based on Arm Neoverse CPUs, that are custom designed and built by AWS for optimal price-performance and energy efficiency. Their dedicated engines for vector processing (NEON and SVE/SVE2) and matrix multiplication (MMLA), and their support for Bfloat16 operations (as of Graviton3), make them a compelling candidate for running compute intensive workloads such as AI/ML inference. To facilitate high-performance AI/ML on Graviton, the entire software stack has been optimized for its use:

    • Low-Level Compute Kernels from the Arm Compute Library (ACL) are highly optimized to leverage the Graviton hardware accelerators (e.g., SVE and MMLA).
    • ML Middleware Libraries such as oneDNN and OpenBLAS route deep learning and linear algebra operations to the specialized ACL kernels.
    • AI/ML Frameworks like PyTorch and TensorFlow are compiled and configured to use these optimized backends.

    In this post we will use an Amazon EC2 c8g.xlarge instance powered by four AWS Graviton4 processors and an AWS ARM64 PyTorch Deep Learning AMI (DLAMI).

    The intention of this post is to demonstrate tips for boosting performance on an AWS Graviton instance. Importantly, our intention is not to draw a comparison between AWS Graviton and alternative chips, nor is it to advocate for the use of one chip over the other. The best choice of processor depends on a whole bunch of considerations beyond the scope of this post. One of the important considerations will be the maximum runtime performance of your model on each chip. In other words: how much “bang” can we get for our buck? Thus, making an informed decision about the best processor is one of the motivations for optimizing runtime performance on each one.

    Another motivation for optimizing our model’s performance for multiple inference devices, is to increase its portability. The playing field of AI/ML is extremely dynamic and resilience to changing circumstances is crucial for success. It is not uncommon for compute instances of certain types to suddenly become unavailable or scarce. Conversely, an increase in capacity of AWS Graviton instances, could imply their availability at steep discounts, e.g., in the Amazon EC2 Spot Instance market, presenting cost-savings opportunities that you would not want to miss out on.

    Disclaimers

    The blocks code of code we will share, the optimization steps we will discuss, and the results we will reach, are intended as an example of the benefits you may see from ML performance optimization on an AWS Graviton instance. These may differ considerably from the results you might see with your own model and runtime environment. Please do not rely on the accuracy or optimality of the contents of this post. Please do not interpret the mention of any library, framework, or platform as an endorsement of its use.

    Inference Optimization on AWS Graviton

    As in our previous post, we will demonstrate the optimization steps on a toy image classification model:

    import torch, torchvision
    import time
    
    
    def get_model(channels_last=False, compile=False):
        model = torchvision.models.resnet50()
    
        if channels_last:
            model= model.to(memory_format=torch.channels_last)
    
        model = model.eval()
    
        if compile:
            model = torch.compile(model)
    
        return model
    
    def get_input(batch_size, channels_last=False):
        batch = torch.randn(batch_size, 3, 224, 224)
        if channels_last:
            batch = batch.to(memory_format=torch.channels_last)
        return batch
    
    def get_inference_fn(model, enable_amp=False):
        def infer_fn(batch):
            with torch.inference_mode(), torch.amp.autocast(
                    'cpu',
                    dtype=torch.bfloat16,
                    enabled=enable_amp
            ):
                output = model(batch)
            return output
        return infer_fn
    
    def benchmark(infer_fn, batch):
        # warm-up
        for _ in range(20):
            _ = infer_fn(batch)
    
        iters = 100
    
        start = time.time()
        for _ in range(iters):
            _ = infer_fn(batch)
        end = time.time()
    
        return (end - start) / iters
    
    
    batch_size = 1
    model = get_model()
    batch = get_input(batch_size)
    infer_fn = get_inference_fn(model)
    avg_time = benchmark(infer_fn, batch)
    print(f"\nAverage samples per second: {(batch_size/avg_time):.2f}")

    The initial throughput is 12 samples per second (SPS).

    Upgrade to the Most Recent PyTorch Release

    Whereas the version of PyTorch in our DLAMI is 2.8, the latest version of PyTorch, at the time of this writing, is 2.9. Given the rapid pace of development in the field of AI/ML, it is highly recommended to use the most up-to-date library packages. As our first step, we upgrade to PyTorch 2.9 which includes key updates to its Arm backend.

    pip3 install -U torch torchvision --index-url https://download.pytorch.org/whl/cpu

    In the case of our model in its initial configuration, upgrading the PyTorch version does not have any effect. However, this step is crucial for getting the most out of the optimization techniques that we will assess.

    Batched Inference

    To reduce the overhead of launching overheads and increase the utilization of the HW accelerators, we group together samples and apply batched inference. The table below demonstrates how the model throughput varies as a function of batch size:

    Inference Throughput for Varying Batch Sizes (by Author)

    Memory Optimizations

    We apply a number of techniques from our previous post for optimizing memory allocation and usage. These include the channels-last memory format, automatic mixed precision with the bfloat16 data type (supported from Graviton3), the TCMalloc allocation library, and huge page allocation. Please see the  for details. We also enable the fast math mode of the ACL GEMM kernels, and caching of the kernel primitives — two optimizations that appear in the official guidelines for running PyTorch inference on Graviton.

    The command line instructions required to enable these optimizations are shown below:

    # install TCMalloc
    sudo apt-get install google-perftools
    
    # Program the use of TCMalloc
    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4
    
    # Enable huge page memory allocation
    export THP_MEM_ALLOC_ENABLE=1
    
    # Enable the fast math mode of the GEMM kernels
    export DNNL_DEFAULT_FPMATH_MODE=BF16
    
    # Set LRU Cache capacity to cache the kernel primitives
    export LRU_CACHE_CAPACITY=1024

    The following table captures the impact of the memory optimizations, applied successively:

    ResNet-50 Memory Optimization Results (by Author)

    In the case of our toy model, the channels-last and bfloat16-mixed precision optimizations had the greatest impact. After applying all of the memory optimizations, the average throughput is 53.03 SPS.

    Model Compilation

    The support of PyTorch compilation for AWS Graviton is an area of focused effort of the AWS Graviton team. However, in the case of our toy model, it results in a slight reduction in throughput, from 53.03 SPS to 52.23.

    Multi-Worker Inference

    While typically applied in settings with many more than four vCPUs, we demonstrate the implementation of multi-worker inference by modifying our script to support core pinning:

    if __name__ == '__main__':
        # pin CPUs according to worker rank
        import os, psutil
        rank = int(os.environ.get('RANK','0'))
        world_size = int(os.environ.get('WORLD_SIZE','1'))
        cores = list(range(psutil.cpu_count(logical=True)))
        num_cores = len(cores)
        cores_per_process = num_cores // world_size
        start_index = rank * cores_per_process
        end_index = (rank + 1) * cores_per_process
        pid = os.getpid()
        p = psutil.Process(pid)
        p.cpu_affinity(cores[start_index:end_index])
    
        batch_size = 8
        model = get_model(channels_last=True)
        batch = get_input(batch_size, channels_last=True)
        infer_fn = get_inference_fn(model, enable_amp=True)
        avg_time = benchmark(infer_fn, batch)
        print(f"\nAverage samples per second: {(batch_size/avg_time):.2f}")

    We note that contrary to other AWS EC2 CPU instance types, each Graviton vCPU maps directly to a single physical CPU core. We use the torchrun utility to start up four workers, with each running on a single CPU core:

    export OMP_NUM_THREADS=1 #set one OpenMP thread per worker
    torchrun --nproc_per_node=4 main.py

    This results in a throughput of 55.15 SPS, a 4% improvement over our previous best result.

    INT8 Quantization for Arm

    Another area of active development and continuous improvement on Arm is INT8 quantization. INT8 quantization tools are typically heavily tied to the target instance type. In our previous post we demonstrated PyTorch 2 Export Quantization with X86 Backend through Inductor using the TorchAO (0.12.1) library. Fortunately, recent versions of TorchAO include a dedicated quantizer for Arm. The updated quantization sequence is shown below. As in our previous post we are interested just in the potential performance impact. In practice, INT8 quantization can have a significant impact on the quality of the model and may necessitate a more sophisticated quantization strategy.

    from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e, convert_pt2e
    import torchao.quantization.pt2e.quantizer.arm_inductor_quantizer as aiq
    
    def quantize_model(model):
        x = torch.randn(4, 3, 224, 224).contiguous(
                                memory_format=torch.channels_last)
        example_inputs = (x,)
        batch_dim = torch.export.Dim("batch")
        with torch.no_grad():
            exported_model = torch.export.export(
                model,
                example_inputs,
                dynamic_shapes=((batch_dim,
                                 torch.export.Dim.STATIC,
                                 torch.export.Dim.STATIC,
                                 torch.export.Dim.STATIC),
                                )
            ).module()
        quantizer = aiq.ArmInductorQuantizer()
        quantizer.set_global(aiq.get_default_arm_inductor_quantization_config())
        prepared_model = prepare_pt2e(exported_model, quantizer)
        prepared_model(*example_inputs)
        converted_model = convert_pt2e(prepared_model)
        optimized_model = torch.compile(converted_model)
        return optimized_model
    
    
    batch_size = 8
    model = get_model(channels_last=True)
    model = quantize_model(model)
    batch = get_input(batch_size, channels_last=True)
    infer_fn = get_inference_fn(model, enable_amp=True)
    avg_time = benchmark(infer_fn, batch)
    print(f"\nAverage samples per second: {(batch_size/avg_time):.2f}")

    The resultant throughput is 56.77 SPS for a 7.1% improvement over the bfloat16 solution.

    AOT Compilation Using ONNX and OpenVINO

    In our previous post, we explored ahead-of-time (AOT) model compilation techniques using Open Neural Network Exchange (ONNX) and OpenVINO. Both libraries include dedicated support for running on AWS Graviton (e.g., see here and here). The experiments in this section require the following library installations:

    pip install onnxruntime onnxscript openvino nncf

    The following code block demonstrates the model compilation and execution on Arm using ONNX:

    def export_to_onnx(model, onnx_path="resnet50.onnx"):
        dummy_input = torch.randn(4, 3, 224, 224)
        batch = torch.export.Dim("batch")
        torch.onnx.export(
            model,
            dummy_input,
            onnx_path,
            input_names=["input"],
            output_names=["output"],
            dynamic_shapes=((batch,
                             torch.export.Dim.STATIC,
                             torch.export.Dim.STATIC,
                             torch.export.Dim.STATIC),
                            ),
            dynamo=True
        )
        return onnx_path
    
    def onnx_infer_fn(onnx_path):
        import onnxruntime as ort
    
        sess = ort.InferenceSession(
            onnx_path,
            providers=["CPUExecutionProvider"]
       )
        sess_options = ort.SessionOptions()
        sess_options.add_session_config_entry(
                   "mlas.enable_gemm_fastmath_arm64_bfloat16", "1")
        input_name = sess.get_inputs()[0].name
    
        def infer_fn(batch):
            result = sess.run(None, {input_name: batch})
            return result
        return infer_fn
    
    batch_size = 8
    model = get_model()
    onnx_path = export_to_onnx(model)
    batch = get_input(batch_size).numpy()
    infer_fn = onnx_infer_fn(onnx_path)
    avg_time = benchmark(infer_fn, batch)
    print(f"\nAverage samples per second: {(batch_size/avg_time):.2f}")

    It should be noted that the ONNX runtime supports a dedicated ACL-ExecutionProvider for running on Arm, but this requires a custom ONNX build (as of the time of this writing), which is out of the scope of this post.

    Alternatively, we can compile the model using OpenVINO. The code block below demonstrates its use, including an option for INT8 quantization using NNCF:

    import openvino as ov
    import nncf
    
    def openvino_infer_fn(compiled_model):
        def infer_fn(batch):
            result = compiled_model([batch])[0]
            return result
        return infer_fn
    
    class RandomDataset(torch.utils.data.Dataset):
        def __len__(self):
            return 10000
    
        def __getitem__(self, idx):
            return torch.randn(3, 224, 224)
    
    quantize_model = False
    batch_size = 8
    model = get_model()
    calibration_loader = torch.utils.data.DataLoader(RandomDataset())
    calibration_dataset = nncf.Dataset(calibration_loader)
    
    if quantize_model:
        # quantize PyTorch model
        model = nncf.quantize(model, calibration_dataset)
    
    ovm = ov.convert_model(model, example_input=torch.randn(1, 3, 224, 224))
    ovm = ov.compile_model(ovm)
    batch = get_input(batch_size).numpy()
    infer_fn = openvino_infer_fn(ovm)
    avg_time = benchmark(infer_fn, batch)
    print(f"\nAverage samples per second: {(batch_size/avg_time):.2f}")

    In the case of our toy model, OpenVINO compilation results in an additional boost of the throughput to 63.48 SPS, but the NNCF quantization disappoints, resulting in just 55.18 SPS.

    Results

    The results of our experiments are summarized in the table below:

    ResNet50 Inference Optimization Results (by Author)

    As in our , we reran our experiments on a second model — a Vision Transformer (ViT) from the timm library — to demonstrate how the impact of the runtime optimizations we discussed can vary based on the details of the model. The results are captured below:

    ViT Inference Optimization Results (by Author)

    Summary

    In this post, we reviewed a number of relatively simple optimization techniques and applied them to two toy PyTorch models. As the results demonstrated, the impact of each optimization step can vary greatly based on the details of the model, and the journey toward peak performance can take many different paths. The steps we presented in this post were just an appetizer; there are undoubtedly many more optimizations that can unlock even greater performance.

    Along the way, we noted the many AI/ML libraries that have introduced deep support for the Graviton architecture, and the seemingly continuous community effort of ongoing optimization. The performance gains we achieved, combined with this apparent dedication, prove that AWS Graviton is firmly in the “big leagues” when it comes to running compute-intensive AI/ML workloads.

    AWS Graviton Inference Model optimizing PyTorch
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Awais
    • Website

    Related Posts

    CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

    March 17, 2026

    Follow the AI Footpaths | Towards Data Science

    March 17, 2026

    Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

    March 17, 2026

    Hallucinations in LLMs Are Not a Bug in the Data

    March 16, 2026

    Visual Generalization in Reinforcement Learning via Dynamic Object Tokens

    March 16, 2026

    How to Build a Production-Ready Claude Code Skill

    March 16, 2026
    Leave A Reply Cancel Reply

    Top Posts

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 20250 Views

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 20250 Views

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 20250 Views
    Don't Miss

    LinkedIn updates feed algorithm with LLM-powered ranking and retrieval

    March 17, 2026

    LinkedIn is launching a new AI-powered feed ranking system that uses large language models and…

    Trust Is The New Ranking Factor

    March 17, 2026

    CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

    March 17, 2026

    What They Mean and How to Use Them in Social Media Campaigns

    March 17, 2026
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    3 CMS Platforms Control 73% Of The Market & Shape Technical SEO Defaults

    March 17, 2026

    Top 7 Traackr Alternatives 2026

    March 17, 2026
    Most Popular

    13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

    November 18, 20257 Views

    How to watch the 2026 GRAMMY Awards online from anywhere

    February 1, 20263 Views

    Corporate Reputation Management Strategies | Sprout Social

    November 19, 20252 Views
    Our Picks

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 2025

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 2025

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest YouTube Dribbble
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Disclaimer

    © 2025 skytik.cc. All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.