The Jetson Nano is low powered but equipped with an NVIDIA GPU. NVIDIA TensorRT can be used to optimize neural networks for the GPU achieving enough performance to run inference in real-time. In this post we will convert a Tensorflow MobileNetV2 SSD Neural Network to TensorRT, deploy it on a ROS2 node and provide object detection at 40 FPS from a 720p live video stream.

ssd_mobilenet_1 ssd_mobilenet_2

In following post we will look at the code in more detail but all the source and detail of running the project can be found on Github. In the meantime lets look at an overview of the solution.

Design Overview

The example is built using Robotic Operating System (ROS2) to provide a modular structure, interprocess communication and a distributed parameter system. Video frames are captured at 1280x720 from the CSI camera with a GStreamer pipeline and are color converted from raw NVMM video data from YuV to RGB using CUDA before being passed upstream. In a prior step a pre-trained Tensorflow model is converted to UFF format so that it can be imported into TensorRT. After that the inference takes place entirely on the GPU and uses GPU RAM. The output of inference (bounding boxes for the detected objects and associated confidence level) is sent to a OpenGL display accelerated with CUDA interop. At each stage buffers are used to improve throughput.

NVIDIA Jetson Nano

Specifications:

GPU: 128-core Maxwell
CPU: Quad-core ARM A57 @ 1.43 GHz
Memory: 4 GB 64-bit LPDDR4 25.6 GB/s

The NVIDIA Jetson Nano is a low-popwered embedded systems aimed at accelerating machine learning applictions. These can include robotics, automonous systems and smart devices. The Nano is the most constrained of the Jetson series of devices and offers the weakest performance but by careful design it can acheive realtime inference.

jetson_nano

TensorRT

TensorRT is NVIDIA’s highly optimized neural network inference framework which works on NVIDIA GPUs. TensorRT speeds up the network by using FP16 and INT8 precision instead of the default FP3 and uses the tensor cores of the GPU instead of the regular CUDA cores. TensorFlow models can be deployed on Jetson Nano in two ways:

Use TensorFlow with TensorRT (TF-TRT).
Convert first to UFF then generate a TensorRT engine.

We choose not to use TF-TRT because:

Requires a full install of TensorFlow on the target
Requires loading TensorFlow into memory at runtime
It’s slower than TensorRT engine.

TensorRT workflow:

Create a network description graph consisting of TensorRT layers
Build a TensorRT runtime engine which optimizes the network for the specific GPU
Serialized the runtime engine to disk for later inference.
Create a TensorRT execution context specifying “dynamic” dimensions.
Use the execution context to run the network.

SSD Object Detection

SSD

Object detection is a computer vision technique that allows simultaneous identification and localization of objects in images. When applied to video streams this identification and localization can be used to count objects in a scene and to determine and track their precise locations. This is a task our visual cortex achieves this effortlessly it is computationaly intensive and any CPU will struggle to achieve a 30 FPS real-time inference rate. Fortunately the parallel structure of GPU can help us attain real-time performance even on embedded systems.

Single-Shot-Detectors (SSDs) are a type of neural network that use a set of predetermined regions to detect objects. A grid of anchor points is laid over the input image, and at each anchor point boxes of various dimensions are defined. For each box at each anchor point, the model outputs a prediction of whether or not an object exists within the region. Because there are multiple boxes at each anchor point and anchor points may be close together, SSDs produce detections that overlap. Post-processing (non-maximum suppression) is applied in order to prune away most predictions and pick the best one. This is a one-pass operation which contrast from the two-pass operation of R-CNN. The accuracy of two-pass models is generally better butone-pass models win in terms of speed and are thus attractive in embedded systems.

The SSD has two components:

The Backbone Model that is a pre-trained image classification network (e.g. MobileNetV2) from which the final fully connected classification layer has been removed and that acts as a feature extracto.
The SSD Head that is just a series of convolutional layers added to the backbone and the outputs are interpreted as the bounding boxes and classes of objects.

Camera Streaming

MIPI CSI Cameras are compact sensors that are acquired directly by the Jetson’s hardware CSI/ISP interface. We will use it to interface with the Raspberry Pi Camera Module. By default, CSI cameras will be created with a 1280x720 resolution.

Image Manipulation with CUDA

GStreamer acquires images from the low level sensor in NV12 (YUV) color encoding. The NV12 image format is commonly found as the native format from various machine vision, and other, video cameras. It is a format where colour information is stored at a lower resolution than the intensity data. In the NV12 case the intensity (Y) data is stored as 8 bit samples, and the colour (Cr, Cb) information as 2x2 subsampled image, this is otherwise known as 4:2:0. The object detector DNN expects images in RGB format so we need to convert them using using CUDA to accelerate the process.