“Perception is the process of building an internal representation of the environment”

[Action in Perception, Alva Noë 2004]

In autonomous systems perception is the ability of a computer system to use sensory inputs (perceive) to interpret its environment (comprehend) in order to make decisions (reason). This fundamental prerequisite of any intelligent cyber-physical system is what enables the system to take action autonomously.

Building reliable vision capabilities has been a major inhibitor in the development of autonomous mobile robots (e.g. seld-driving cars). But by fusing the data from several sensors developers are now able to sense the autonomous agents environment better than human eyesight. The key is to have diversity (multiple sensor modalities) and redundancy (multiple inputs for verification).

Sensor modalities

The three primary sensors used for autonomous mobile agents are cameras, lidar radar and force/torque each having its modality [pixel, point-cloud, range-angle, time-series].

  1. Camera The human visual cortex has been trained over millenia to subconciously recognise objects and relationships between objects. The computers perspective of the task is to make sense of an array of RGB(D) values which no human can do on a concious level. But recent progress, mostly (but not exclusively) fueled by deep neural networks has lead to computers out performing humans in some tasks (e.g. medical diagnostic imaging).

  2. Radar (Radio Detection and Ranging) can suppliment vision in situations of low vizability (darkness, fog) and to nimprove results of object detection. It transmits radio waves in pulses and uses the reflections to measure range, angle and velocity directly with the Doppler effect. It provides no information on the type of object or its 3D structure.

  3. Lidar (Light Detection and Ranging) provides a 3D view of the environment using pulsed lasers to provide shape and depth informtion. It is unaffected by darkness but is comprimized by the diffusion caused by atmospheric humidity (mist/fog). By emitting laser pulses at high baud rates, lidar sensors are able to detect detailed 3D structure of point clouds that represent the environment.

  4. FT (Force/Torque) sensors are electronic devices designed to detect and regulate linear and rotational forces exerted upon it as a function of time [time series modality]. This can be compared to the micro-receptors in skin that equip humans with the sense of touch. As a contact sensor, it is specifically designed to interact with physical objects in its environment.

When modalities are combined they provide a perception that can be used to build a model (representation) of the environment. This can include structure of objects, estimations of speed and distance of surrounding objects, and their three dimensional shape. Aditionally an inertial measurement unit (IMU) can be used to estimate the agents location, velocity and acceleration.

Designing an autonomous system involves choosing the combination of sensors used will be a trade-off of performance, cost and reliablility. Lidar has an order of magnitude higher cost compared to carmera and radar, has a limited range and moving parts that compromise reliability. Camera and radar are solid state devices and may have excelent reliability but do not produce 3D directly. Creating 3D point clouds will require development of stereo-vision software which add cost and can impact reliabilty through software bugs.

Environment representation

“Solving a problem simply means representing it so as to make the solution transparent.” Herbert A. Simon, Sciences of the Artificial

A fundamental problem in machine perception is to learn correct representations of the world and estimating its current state. Among the numerous approaches used in environment representation for mobile robotics, and for autonomous robotic-vehicles, the most influential approach is the occupancy grid mapping. This 2D mapping is still used in many mobile platforms due to its efficiency, probabilistic framework, and fast implementation but recent advances in software tools, like ROS and PCL, and also the advent of methods like Octomaps have been contributing to the increase in 3D-like environment representations.

Processing methods

Visual perception based on pixels can largely be broken down into four tasks

Sensor fusion and State Estimation

The camera, radar and lidar sensors can provide a flow of rich data about an environment but additional processing is required to make sense of it all. In humans this task is given to the visual cortex but in autonomous agents we provide this functionality with a process called sensor fusion. Usually implemented with a software algorithm, the sensor inputs are fed into a high performance computer system which combines the relevant data to provide the diversity and redundancy required to make decisions. The computer system is often an on-board System-on-Chip. A well know sensor data algorithm is the Kalman filter which minimizes demands on system resources like memory and processing power. Other methods like Particle Filters and Graphical Models are becoming increasingly popular as the performance of on-board computing improves.

Fussion algorithms use a series of data observed over time from one or more sensors and produce estimates of the unknown system state that are more accurate than those based on a single measurement alone. They do this by estimating a joint probability distribution over the variables for each timeframe but the details are out of the scope of this post.

Traditionaly the tasks belonged to classical computer vision (think OpenCV) and where solved using mthods like feature detection, optical-flow and support vector machines. More recently the highest performing methods have mostly used deep neural networks to learn representations.

System development

The expense and impracticality for each machine perception engineer possesing an autonomous system for development means alternatives must be found. In reality most development relies on a combination of robotic environment simulation (e.g. with Gazebo or Unity) and replaying real data either from open benchmark datasets (KITTI, CityScapes, Oxford) or custom datasets collected by the development team. In a followup post we will replay the popular KITTI dataset using ROS 2 and go on to visualize it with RViz.