Real-Time 4-Stream YOLOv8 Video Analytics on Edge AI Hardware
The convergence of artificial intelligence and edge computing is reshaping industrial automation. Running complex vision models directly on embedded hardware—without relying on cloud servers—enables real-time decision-making for applications like quality inspection, autonomous mobile robots, and smart surveillance. A recent demonstration using a development board powered by a hexa-core processor and a 6 TOPS NPU shows how four 1080p video streams can be processed simultaneously with YOLOv8 variants, achieving practical frame rates and high detection accuracy.
Understanding YOLOv8 and Its Model Variants
YOLO (You Only Look Once) has evolved through multiple iterations, with YOLOv8 representing the current state-of-the-art in real-time object detection. It balances speed, accuracy, and ease of deployment across diverse hardware platforms. The algorithm is available in several task-specific suffixes and model scales, making it adaptable to everything from low-power embedded devices to high-performance servers.
| Suffix | Full Name | Task | Output | Typical Application |
|---|---|---|---|---|
| -det | Detection | Object Detection | Bounding boxes + class + confidence | Pedestrian detection, vehicle counting, safety gear monitoring |
| -seg | Segmentation | Instance Segmentation | Bounding boxes + class + pixel mask | Defect segmentation, autonomous driving scene parsing |
| -pose | Pose | Keypoint Detection | Bounding boxes + 17 human keypoints | Worker ergonomics analysis, action recognition, human-machine interaction |
| -cls | Classification | Image Classification | Class label for whole image | Product sorting, quality grading, anomaly screening |
| -obb | Oriented Bounding Boxes | Rotated Object Detection | Rotated bounding boxes + angle + class | Aerial image analysis, container orientation detection, warehouse pallet recognition |
Model scale prefixes (n, s, m, l, x) indicate the size and computational requirements. For edge devices with limited resources, Nano and Small variants are preferred. The following table summarizes the trade-offs:
| Prefix | Meaning | Characteristics | Suitable Scenarios |
|---|---|---|---|
| n | Nano | Smallest, fastest, lowest accuracy | Mobile devices, CPU-only inference, ultra-low-power embedded systems |
| s | Small | Balanced speed and accuracy | Real-time video analytics on edge devices, most common starting point |
| m | Medium | Optimal accuracy-speed trade-off | Applications requiring higher precision with still acceptable speed |
| l | Large | High accuracy, slower | Server-side inference where accuracy is prioritized over latency |
| x | X-Large | Highest accuracy, slowest | Research, benchmarking, offline analysis with extreme precision demands |
Hardware Setup and Multi-Stream Configuration
The test platform features a heterogeneous processor with four Cortex-A72 cores, four Cortex-A53 cores, and an integrated NPU delivering up to 6 TOPS. It supports multiple MIPI-CSI camera interfaces and hardware-accelerated video encoding. In the demonstrated setup, three MIPI-CSI inputs were used with a combination of camera modules to capture four AHD high-definition video streams. Each stream was encoded in H.264 and delivered as an RTSP stream at 1920×1080 resolution and 30 frames per second.
Performance Snapshot: After processing through the board, each individual stream was output at 1080p and 25 fps. When all four streams ran concurrently, the total system throughput reached 60–70 fps. CPU utilization approached 100%, while the NPU load stayed between 50% and 60%, indicating efficient offloading of neural network inference.
Software Optimizations for Maximum Throughput
A naive AI inference pipeline often follows these steps: capture a frame from CSI, crop/resize to the model input size, run the RKNN API, retrieve bounding boxes and confidence scores, scale coordinates back to the original resolution, and draw overlays. Executing this sequentially leads to low CPU utilization and choppy video output because the pipeline stalls on each frame.
To overcome this, a thread pool architecture was implemented. The pipeline was broken into stages that run in parallel across the available CPU cores. Additionally, the hardware RGA (Raster Graphics Acceleration) module was used for image cropping and scaling, offloading these pixel operations from the CPU. This allowed the CPU, GPU, NPU, and VPU to work concurrently, maximizing resource utilization.
Key Optimization Techniques:
- Multi-threaded frame capture, pre-processing, inference, and post-processing
- RGA-based image resize and color space conversion to reduce CPU load
- Zero-copy buffer sharing between hardware blocks where possible
- Batch processing of multiple streams to keep the NPU pipeline full
Real-World Application Scenarios
The ability to run multiple YOLOv8 variants on four video streams simultaneously opens up numerous industrial automation and smart city applications:
Intelligent Security & Surveillance
Detect abnormal behaviors like fighting or running in public areas. Combine person detection with face recognition to identify individuals against watchlists, all processed locally to maintain privacy.
Autonomous Mobile Robots (AMRs)
Use instance segmentation to identify obstacles and free paths. Pose estimation can monitor human workers nearby for safe human-robot collaboration. Oriented bounding boxes help detect pallets or shelves at any angle.
Industrial Quality Inspection
Deploy segmentation models to detect surface defects on manufactured parts. Classification models can sort products by grade. Multi-stream processing allows inspecting multiple production lines from a single edge node.
Traffic & Parking Management
Count vehicles, detect illegal parking, or recognize license plates using oriented bounding boxes for angled views. All processing happens at the edge, reducing bandwidth and latency.
Model Selection Guidelines for Edge Deployment
When deploying YOLOv8 on embedded hardware, choosing the right model scale and task type is critical. For a board with 6 TOPS NPU, the Small (s) and Nano (n) variants offer the best balance. The following practical recommendations emerge from testing:
| Model Variant | Task | Expected Performance on 6 TOPS NPU | Best Use Case |
|---|---|---|---|
| YOLOv8n-det | Detection | ~30 fps per stream (1080p) | High-speed counting, basic presence detection |
| YOLOv8s-det | Detection | ~25 fps per stream (1080p) | General object detection with better accuracy |
| YOLOv8s-seg | Segmentation | ~20 fps per stream (1080p) | Defect detection, precise object outlining |
| YOLOv8s-pose | Pose | ~22 fps per stream (1080p) | Worker safety monitoring, sports analytics |
| YOLOv8s-obb | Oriented BBox | ~18 fps per stream (1080p) | Aerial imagery, rotated object detection |
Note that when running four streams concurrently, the total throughput is limited by the NPU and memory bandwidth. The system demonstrated 60–70 fps aggregate across all streams, which is sufficient for most real-time industrial applications. Further optimizations like model quantization (INT8) can boost performance by 30–50% with minimal accuracy loss.
Bottom Line: Edge AI platforms with dedicated NPUs are now capable of handling complex multi-stream video analytics that previously required a full PC or server. By carefully selecting model variants and applying hardware-specific optimizations, developers can build responsive, low-latency vision systems for industrial automation, smart cities, and beyond. The combination of YOLOv8’s versatility and efficient embedded hardware opens the door to scalable, cost-effective AI at the edge.