Real-Time 4-Stream YOLOv8 Video Analytics on Edge AI Hardware

The convergence of artificial intelligence and edge computing is reshaping industrial automation. Running complex vision models directly on embedded hardware—without relying on cloud servers—enables real-time decision-making for applications like quality inspection, autonomous mobile robots, and smart surveillance. A recent demonstration using a development board powered by a hexa-core processor and a 6 TOPS NPU shows how four 1080p video streams can be processed simultaneously with YOLOv8 variants, achieving practical frame rates and high detection accuracy.

Understanding YOLOv8 and Its Model Variants

YOLO (You Only Look Once) has evolved through multiple iterations, with YOLOv8 representing the current state-of-the-art in real-time object detection. It balances speed, accuracy, and ease of deployment across diverse hardware platforms. The algorithm is available in several task-specific suffixes and model scales, making it adaptable to everything from low-power embedded devices to high-performance servers.

Suffix	Full Name	Task	Output	Typical Application
-det	Detection	Object Detection	Bounding boxes + class + confidence	Pedestrian detection, vehicle counting, safety gear monitoring
-seg	Segmentation	Instance Segmentation	Bounding boxes + class + pixel mask	Defect segmentation, autonomous driving scene parsing
-pose	Pose	Keypoint Detection	Bounding boxes + 17 human keypoints	Worker ergonomics analysis, action recognition, human-machine interaction
-cls	Classification	Image Classification	Class label for whole image	Product sorting, quality grading, anomaly screening
-obb	Oriented Bounding Boxes	Rotated Object Detection	Rotated bounding boxes + angle + class	Aerial image analysis, container orientation detection, warehouse pallet recognition

Model scale prefixes (n, s, m, l, x) indicate the size and computational requirements. For edge devices with limited resources, Nano and Small variants are preferred. The following table summarizes the trade-offs:

Prefix	Meaning	Characteristics	Suitable Scenarios
n	Nano	Smallest, fastest, lowest accuracy	Mobile devices, CPU-only inference, ultra-low-power embedded systems
s	Small	Balanced speed and accuracy	Real-time video analytics on edge devices, most common starting point
m	Medium	Optimal accuracy-speed trade-off	Applications requiring higher precision with still acceptable speed
l	Large	High accuracy, slower	Server-side inference where accuracy is prioritized over latency
x	X-Large	Highest accuracy, slowest	Research, benchmarking, offline analysis with extreme precision demands

Hardware Setup and Multi-Stream Configuration

The test platform features a heterogeneous processor with four Cortex-A72 cores, four Cortex-A53 cores, and an integrated NPU delivering up to 6 TOPS. It supports multiple MIPI-CSI camera interfaces and hardware-accelerated video encoding. In the demonstrated setup, three MIPI-CSI inputs were used with a combination of camera modules to capture four AHD high-definition video streams. Each stream was encoded in H.264 and delivered as an RTSP stream at 1920×1080 resolution and 30 frames per second.

Performance Snapshot: After processing through the board, each individual stream was output at 1080p and 25 fps. When all four streams ran concurrently, the total system throughput reached 60–70 fps. CPU utilization approached 100%, while the NPU load stayed between 50% and 60%, indicating efficient offloading of neural network inference.

Software Optimizations for Maximum Throughput

A naive AI inference pipeline often follows these steps: capture a frame from CSI, crop/resize to the model input size, run the RKNN API, retrieve bounding boxes and confidence scores, scale coordinates back to the original resolution, and draw overlays. Executing this sequentially leads to low CPU utilization and choppy video output because the pipeline stalls on each frame.

To overcome this, a thread pool architecture was implemented. The pipeline was broken into stages that run in parallel across the available CPU cores. Additionally, the hardware RGA (Raster Graphics Acceleration) module was used for image cropping and scaling, offloading these pixel operations from the CPU. This allowed the CPU, GPU, NPU, and VPU to work concurrently, maximizing resource utilization.

Key Optimization Techniques:

Multi-threaded frame capture, pre-processing, inference, and post-processing
RGA-based image resize and color space conversion to reduce CPU load
Zero-copy buffer sharing between hardware blocks where possible
Batch processing of multiple streams to keep the NPU pipeline full

Real-World Application Scenarios

The ability to run multiple YOLOv8 variants on four video streams simultaneously opens up numerous industrial automation and smart city applications:

Intelligent Security & Surveillance

Detect abnormal behaviors like fighting or running in public areas. Combine person detection with face recognition to identify individuals against watchlists, all processed locally to maintain privacy.

Autonomous Mobile Robots (AMRs)

Use instance segmentation to identify obstacles and free paths. Pose estimation can monitor human workers nearby for safe human-robot collaboration. Oriented bounding boxes help detect pallets or shelves at any angle.

Industrial Quality Inspection

Deploy segmentation models to detect surface defects on manufactured parts. Classification models can sort products by grade. Multi-stream processing allows inspecting multiple production lines from a single edge node.

Traffic & Parking Management

Count vehicles, detect illegal parking, or recognize license plates using oriented bounding boxes for angled views. All processing happens at the edge, reducing bandwidth and latency.

Model Selection Guidelines for Edge Deployment

When deploying YOLOv8 on embedded hardware, choosing the right model scale and task type is critical. For a board with 6 TOPS NPU, the Small (s) and Nano (n) variants offer the best balance. The following practical recommendations emerge from testing:

Model Variant	Task	Expected Performance on 6 TOPS NPU	Best Use Case
YOLOv8n-det	Detection	~30 fps per stream (1080p)	High-speed counting, basic presence detection
YOLOv8s-det	Detection	~25 fps per stream (1080p)	General object detection with better accuracy
YOLOv8s-seg	Segmentation	~20 fps per stream (1080p)	Defect detection, precise object outlining
YOLOv8s-pose	Pose	~22 fps per stream (1080p)	Worker safety monitoring, sports analytics
YOLOv8s-obb	Oriented BBox	~18 fps per stream (1080p)	Aerial imagery, rotated object detection

Note that when running four streams concurrently, the total throughput is limited by the NPU and memory bandwidth. The system demonstrated 60–70 fps aggregate across all streams, which is sufficient for most real-time industrial applications. Further optimizations like model quantization (INT8) can boost performance by 30–50% with minimal accuracy loss.

Bottom Line: Edge AI platforms with dedicated NPUs are now capable of handling complex multi-stream video analytics that previously required a full PC or server. By carefully selecting model variants and applying hardware-specific optimizations, developers can build responsive, low-latency vision systems for industrial automation, smart cities, and beyond. The combination of YOLOv8’s versatility and efficient embedded hardware opens the door to scalable, cost-effective AI at the edge.

Blog

Understanding YOLOv8 and Its Model Variants

Hardware Setup and Multi-Stream Configuration

Software Optimizations for Maximum Throughput

Real-World Application Scenarios

Intelligent Security & Surveillance

Autonomous Mobile Robots (AMRs)

Industrial Quality Inspection

Traffic & Parking Management

Model Selection Guidelines for Edge Deployment

EtherNet/IP to EtherCAT Gateway: Real-Time PLC Control for Devices

Siemens PLC to Lenze Drive via Modbus to EtherNet/IP Gateway

Automation Instruments in Key Industries: Aerospace to Environmental Monitoring

Machine Vision for PCB Silkscreen Inspection: Detect All Defects

S7-200 SMART Direct WinCC Communication Without OPC

Cross-Subnet Coupler for Real-Time SCADA Data Acquisition

We specialize in providing all electrical automation control equipment and systems.

ABOUT US

PRODUCTS

SUPPORT

CONTACT

Blog

Understanding YOLOv8 and Its Model Variants

Hardware Setup and Multi-Stream Configuration

Software Optimizations for Maximum Throughput

Real-World Application Scenarios

Intelligent Security & Surveillance

Autonomous Mobile Robots (AMRs)

Industrial Quality Inspection

Traffic & Parking Management

Model Selection Guidelines for Edge Deployment

Similar Posts

We specialize in providing all electrical automation control equipment and systems.

ABOUT US

PRODUCTS

SUPPORT

CONTACT