Mastering YOLO and ResNet Optimization on HiSilicon NPUs

You want to get the best performance from your models on HiSilicon NPUs. This can be a difficult task. You may see performance bottlenecks or poor hardware use. The Ascend NPU architecture presents unique challenges. For example, it does not use massive thread-level parallelism to hide memory access delays.

Goal: This guide gives you a complete workflow for Optimizing YOLO/ResNet. You will learn to prepare models, convert them with the Ascend Tensor Compiler, apply INT8 quantization, and deploy them using AscendCL for peak efficiency. 🚀

Key Takeaways

Prepare your model for HiSilicon NPUs. Convert it to ONNX format. Prune it to make it smaller and faster.
Use the Ascend Tensor Compiler (ATC) to convert your ONNX model. This creates an optimized '.om' file for the NPU.
Apply INT8 quantization to boost performance. Use Post-Training Quantization (PTQ) or Quantization-Aware Training (QAT).
Use the Ascend Profiler to find slow parts of your model. This helps you fix performance issues.
Deploy your optimized model using AscendCL. Measure its latency and FPS to check its speed.

PREPARING AND OPTIMIZING YOLO/RESNET FOR CONVERSION

Your first step is preparing the model for the HiSilicon NPU. A well-prepared model converts smoothly and performs better. This preparation involves several key stages.

Model Conversion: You will convert your trained model into an NPU-compatible format. The Ascend Tensor Compiler (ATC) helps with this process.
Operator Adaptation: Your model might have custom operators. You need to adapt these for the NPU if they are not natively supported.
Model Inference: You will deploy the converted model onto the NPU. This involves loading the model and feeding it data to get predictions.
Performance Optimization: You can further improve performance with techniques like quantization and operator fusion.

Using ResNet as a YOLO Backbone

You can use a ResNet architecture as the backbone for your YOLO model. ResNet is excellent at extracting complex features from images. This power comes at a cost. ResNet adds significant computational weight. This makes the next step, pruning, very important for Optimizing YOLO/ResNet on resource-constrained devices.

Pruning for NPU Compatibility

Pruning removes unnecessary connections or neurons from your neural network. This process creates a smaller, faster model without a major loss in accuracy. A pruned model has fewer parameters and operations. This makes it ideal for the NPU, reducing memory usage and speeding up inference. This is a critical technique for Optimizing YOLO/ResNet.

Exporting Models to ONNX

You must export your model to the Open Neural Network Exchange (ONNX) format. ONNX is an intermediate format that the ATC tool understands. You can easily export a PyTorch model using a simple command.

Example: Exporting a YOLOv8n model to ONNX.
# This command creates 'yolov8n.onnx' from your PyTorch model
yolo export model=yolov8n.pt format=onnx
Note: You may encounter errors during export. Issues like Unsupported ONNX data type: INT64 or shape mismatches in FusedMatMul are common. You can often fix these by ensuring your input tensor size is correct or by converting data types before exporting.

Verifying the ONNX Graph

You should always verify the exported ONNX file. This check ensures the model structure is correct before you proceed to conversion. Several tools can help you with this final preparation step. ⚙️

Netron: This is a visual tool. You can upload your .onnx file to see the entire model graph. It lets you inspect each layer's properties, inputs, and outputs.
ONNX Checker: This is a Python library. You can use the onnx.checker.check_model() function in a script. It programmatically confirms your model's structure is valid and will raise an error if it finds any problems.

MODEL CONVERSION WITH ATC

After preparing your ONNX model, your next task is to convert it into a format the HiSilicon NPU can understand. You will use the Ascend Tensor Compiler (ATC) for this critical step. ATC is a powerful tool within the CANN (Compute Architecture for Neural Networks) toolchain.

Its primary job is to transform your model into a highly optimized offline model.

ATC is a core conversion tool in the Huawei CANN toolchain.
You use it to adapt models from popular frameworks into an Ascend-compatible format.
It helps you deploy trained AI models efficiently on Huawei Ascend hardware.

This conversion process creates a .om file, which is the final executable model you will deploy on the NPU.

Basic ATC Conversion

You can start with a basic command to convert your verified .onnx file. This command tells ATC the input model, the original framework, the desired output file name, and the target Ascend chip version.

💡 What is an Offline Model (.om)? An offline model is a file that has been pre-processed and optimized for a specific hardware target. It includes tasks like operator fusion and memory optimization, which means the NPU can execute it with minimal setup time during inference.

Here is a fundamental atc command-line example:

atc --model=yolov8n.onnx \
    --framework=5 \
    --output=yolov8n \
    --soc_version=Ascend310

Let's break down what each part of this command does:

Flag	Description
`--model`	Specifies the path to your input `.onnx` file.
`--framework=5`	Tells ATC the model is in ONNX format. (Other values are for Caffe, etc.)
`--output`	Defines the base name for your output `.om` file (no extension needed).
`--soc_version`	Specifies the target Ascend processor, like `Ascend310` or `Ascend710`.

Configuring Inputs

Your model needs to know the exact size and format of the data you will send it. You configure this during the ATC conversion using specific flags. This step is vital for performance and preventing runtime errors.

You use the --input_shape flag to define the dimensions of your input tensor. You can set a fixed batch size for consistent performance or a dynamic batch size for flexibility.

Static Batch Size: "--input_shape=images:1,3,640,640" (Batch of 1)
Dynamic Batch Size: "--input_shape=images:-1,3,640,640" (Variable batch size)

You also need to specify the data layout. Most computer vision models trained in PyTorch use the NCHW format. This layout organizes tensor data as (Number of samples, Channels, Height, Width). Getting this right is essential for computational efficiency, as it affects how the NPU accesses data in memory. You can use the --input_format flag to set this.

Here is the enhanced command with input configuration:

atc --model=yolov8n.onnx \
    --framework=5 \
    --output=yolov8n \
    --input_format=NCHW \
    --input_shape="images:1,3,640,640" \
    --soc_version=Ascend310

Handling Unsupported Ops

Sometimes, your model may contain special layers or operations ("ops") that ATC does not natively support. When this happens, the conversion will fail. You have a powerful tool to solve this: the Tensor Boost Engine (TBE). TBE allows you to define and implement custom operators in a way the Ascend NPU can execute. ⚙️

Developing a custom operator with TBE is an advanced process that involves several stages:

DSL Module: You first write the operator's core mathematical logic. You use a domain-specific language to define the calculation steps and data flow.
Scheduling Module: Next, you tell the hardware how to execute your logic efficiently. This involves planning how to segment data (tiling) to optimize memory access and performance.
IR Module: TBE then generates an Intermediate Representation (IR) of your operator. This is a standardized format that the compiler can understand and begin to optimize.
Compiler Transfer Module: The compiler takes the IR and applies further optimizations. It uses techniques like double buffering and smart memory allocation to prepare the operator for the specific NPU hardware.
CodeGen Module: Finally, the CodeGen module produces a C-like code file. This file is compiled into an executable operator that the CANN framework can load and run directly on the NPU.

⚠️ Note: Creating custom operators requires a deep understanding of both the operator's function and the underlying NPU architecture. You should always check the official CANN Operator List first. Only create a custom operator if you cannot find a supported alternative.

ADVANCED OPTIMIZATION AND TUNING

You have converted your model. Now you can unlock the true power of the HiSilicon NPU. Advanced optimization techniques will push your model's performance to its peak. This stage focuses on reducing the model's precision and analyzing its runtime behavior to find and fix bottlenecks.

NPUs like HiSilicon's Da Vinci architecture are built to accelerate specific math operations. Your model must use these "NPU-friendly" operators for convolution, pooling, and activation to achieve maximum speed. Any operation not supported by the NPU runs on the slower CPU, creating a performance bottleneck. Quantization is a key technique that makes your model more NPU-friendly. It reduces the precision of your model's numbers, for example, from 32-bit floating point (FP32) to 8-bit integer (INT8). This change makes the model smaller and faster. An AI benchmark test shows this impact clearly. A model running at 12 frames per second using FP32 can achieve 30 frames per second when optimized with INT8 quantization.

Post-Training Quantization (PTQ)

Post-Training Quantization (PTQ) is a powerful technique for Optimizing YOLO/ResNet. You apply it to an already trained FP32 model. The main tool you will use for this is the Ascend Model Compression Toolkit (AMCT). PTQ is popular because it is a fast and straightforward way to gain a significant performance boost without retraining your model.

However, PTQ can sometimes cause a drop in your model's accuracy. For a model like YOLOv8 Nano, you might see a slight decrease in inference accuracy. In some cases, especially with smaller models, this accuracy loss can be more significant. Static INT8 quantization can lead to a moderate accuracy drop of around 3-7% in absolute mAP50-95.

If the accuracy loss from PTQ is too high, you have another option: Quantization-Aware Training (QAT). QAT introduces the simulation of quantization during the training process itself. This allows the model to learn how to compensate for the precision loss, often resulting in better final accuracy.

Here is a comparison to help you decide which method to use:

Feature	Post-Training Quantization (PTQ)	Quantization-Aware Training (QAT)
Application Stage	Applied to a pre-trained model	Incorporated directly into the model training process
Retraining Required	No, you do not need to retrain	Yes, it requires longer training time to adapt to quantization
Complexity	Simpler and faster to implement	More complex, as it simulates quantization during training
Accuracy Impact	May result in a noticeable accuracy drop	Often achieves better accuracy by optimizing for quantized inference

When should you choose one over the other?

Choose PTQ when you cannot retrain your model or when a slight accuracy drop is acceptable for a large speed gain.

Choose QAT when accuracy is your top priority and you have the resources for a longer, more complex training cycle.

Preparing a Calibration Dataset

To perform PTQ, you need a calibration dataset. This is a small, representative collection of input data (e.g., images) that the AMCT uses. It runs this data through your model to analyze the range of activation values. This information helps it calculate the optimal scaling factors for converting FP32 values to INT8 without losing too much information.

Creating a good calibration dataset is crucial for successful quantization.

Quality over Quantity: You do not need a massive dataset. A set of around 1024 diverse and well-prepared images is often enough. Simply using more images is not always better. An unbalanced or redundant dataset can skew the calibration statistics. This can cause the quantized model to perform poorly during real-world inference.
Representation is Key: Your calibration images should reflect the variety of data your model will see in production. Include images with different lighting conditions, object sizes, and backgrounds to ensure the calibration process is robust.

A well-chosen calibration set is a cornerstone of effective quantization when Optimizing YOLO/ResNet.

Profiling with Ascend Profiler

After you convert and quantize your model, you need to measure its performance. The Ascend Profiler is the tool you will use to find any remaining performance bottlenecks. It gives you a detailed breakdown of how your model executes on the NPU.

You can use the profiler's data to answer critical questions:

Which operators are taking the most time?
Are any operators unexpectedly running on the CPU instead of the NPU?
Is there inefficient data movement between the host and the device?

The Ascend Profiler has two key components for this analysis:

Timeline Analysis: This view gives you a low-level, visual representation of your model's execution. Each colored block on the timeline shows the start time and duration of an operator. You can see exactly which operators are running on the AICORE (the main NPU), the AICPU, or the HOSTCPU. This helps you pinpoint operators with long execution times. You can download this timeline data as a JSON file and open it in tools like chrome://tracing for a deeper look.
Operator Performance Analysis: This component provides high-level statistics. It displays operator execution times in tables and charts, sorted by duration. You can quickly see which operator types (e.g., AICORE vs. AICPU) and which specific operators are consuming the most time. This is perfect for identifying if a significant portion of your model's runtime is spent on a few slow operations.

By using these two views together, you can effectively diagnose performance issues. For example, if the profiler shows that a custom or unsupported operator is running on the HOSTCPU, you know that creating a custom TBE operator for it will likely provide a major speedup.

DEPLOYMENT AND INFERENCE

You have optimized your model. Now you will deploy it on the HiSilicon NPU. This final stage is where you run your model to make real-world predictions. You will use the Ascend Compute Language (AscendCL) to communicate with the hardware and measure your model's final performance.

Initializing the Device with AscendCL

You first need to set up the NPU for inference. You will use AscendCL to do this. The Python Ascend Computing Language (pyACL) is a Python API library that makes this process simple. It lets you control the Ascend AI Processor directly from your Python code.

The standard workflow for running a model follows a clear sequence:

Initialize pyACL: You start the pyACL library to prepare system resources.
Allocate Resources: You set aside the runtime resources your application needs.
Transfer Data: You move your input data, like images, into the device's memory.
Process Data: You can perform last-minute image changes, like resizing.
Execute Model: You load your .om model and run the inference process.
Destroy Allocations: You free up the runtime resources after inference is complete.
Deinitialize pyACL: You shut down the library to release all resources.

Writing Inference Code

Your inference script brings all these steps together. This code is the engine that loads your model, feeds it data, and retrieves the predictions.

💡 Using pyACL for Inference The pyACL library is your main tool for this task. You use its functions to manage the device, handle memory, load your .om model, and execute it. It gives you full control over the entire inference pipeline within a Python environment.

Your script will load the pre-processed input data, send it to the NPU, trigger the model execution, and then process the output from the model.

Benchmarking Performance

After deployment, you must measure your model's performance. This tells you how fast your model runs. Two key metrics are essential for this evaluation:

Latency: This is the time your model takes to process one input, measured in milliseconds (ms). Lower latency is better.
Frames Per Second (FPS): This measures how many inputs your model can process in one second. Higher FPS is better.

These two metrics are directly related. For a real-time video application running at 30 FPS, your model's latency must be less than 33.3ms. Other important latency metrics can also give you deeper insights.

Time to first token: The time it takes to get the very first piece of output.
Total generation time: The end-to-end time from input to full output.

Measuring these numbers helps you confirm that your optimization work was successful. 🚀

You have now mastered a complete workflow for NPU optimization. This guide walked you through four essential stages:

Model Preparation
ATC Conversion
Quantization Tuning
AscendCL Deployment

Following this structured process is the key to unlocking the full performance of your computer vision models on HiSilicon hardware.

Now, apply these techniques to your own projects! For more help, consult the official CANN documentation or community forums. Good luck! 🚀

Written by Wyatt Yan from ic-online.com

ic-online.com is a fast-growing global electronic components distributor and a trusted ERAI member, delivering authentic parts and secure supply chain solutions to customers worldwide.

We provide millions of in-stock ICs and semiconductors with same-day shipping, while offering complete one-stop BOM sourcing and turnkey PCBA services, including PCB fabrication, SMT assembly, and full production support.

From prototype to mass production, we help engineers and buyers reduce costs, shorten lead times, and simplify procurement.

One BOM. One Partner. One Complete PCBA Solution.

Visit ic-online.com and submit your RFQ today.

FAQ

What is the main difference between PTQ and QAT?

You apply Post-Training Quantization (PTQ) to an already trained model. It is fast but can lower accuracy. You use Quantization-Aware Training (QAT) during the training process. This method takes more time but often keeps accuracy high.

Why is ONNX the preferred format for conversion?

You use ONNX as a universal format. The Ascend Tensor Compiler (ATC) understands ONNX files. This lets you easily convert models from frameworks like PyTorch into the .om format that the NPU can execute.

What should I do if my ATC conversion fails?

An ATC failure often means your model has an unsupported operator. First, check the error logs for clues. You may need to create a custom operator using the Tensor Boost Engine (TBE) to solve the issue. ⚙️

How many images do I need for a calibration dataset?

You do not need thousands of images. A diverse set of about 1024 images is often enough. The key is quality, not quantity. Your calibration data must represent what your model will see in production.