In this blog article, we will review the paper SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving by Zhenpei Yang et. al. which was presented in CVPR 2020 .
I will quickly go over the outline of this article. First, I will introduce you to the motivation and goal behind this work. Next, I will provide a brief insight into the state of the art and introduce you to concepts which are necessary to grasp the proposed method. Afterwards, I will give a detailed explanation of the approach proposed by the authors and we will have a look at the experiments and the corresponding results. Lastly, I will present the authors' conclusion, subsequent work, and my personal opinion on the paper. Images that do not cite a source were either provided in the paper or created by me.
An important requirement for the evaluation of autonomous driving systems is the availability of high-quality sensor data. This data should be representative of real-world conditions. However, the chart below shows that the number of annotated frames in datasets has only increased by a factor of 10 over the last 8 years implying that the collection of such high-quality data is resource intensive.
The goal of the authors is to leverage real-world LIDAR and camera data provided by Waymo Open Dataset (WOD) to simulate traffic scenarios and synthesize realistic novel views .
Specifically, the system should generate realistic sensor data of previously unseen trajectories, which is useful for the training of downstream modules. This is depicted in the figure below.
The authors propose a two-staged approach to generate novel data. First, they reconstruct the environment using a texture-mapped surfel representation, which is simple and computationally efficient, and render the scene for novel poses. Next, they close the domain gap between the synthetic and real data using a Generative Adversarial Network (GAN).
The authors provide three main contributions:
Nowadays, simulated environments are used to evaluate autonomous driving systems. They are usually built on top of game engines such as Unity or Unreal Engine. One example is CARLA, an open-source simulation engine that models the 3D environment and is used to train and test self-driving vehicles . However, these simulation engines show two main problems. First, significant manual effort is required to create the environments, and therefore cannot be scaled easily. Second, there is still a large domain gap between the simulated and real environments.
Another recent publication in the field of novel view synthesis is Augmented Autonomous Driving Simulation (AADS) proposed by Li et. al. . It aims to generate novel views only based on image data whereas the discussed work by Zhang et. al. reconstructs the 3D environment using both LIDAR and camera data. Consequently, there are two problems which arise from this. One has no freedom to synthesize novel views that could not be easily captured in the real world. Also, one needs to store the images or query the nearest views for synthesizing novel views.
Another approach in the domain of novel view generation is Neural Radiance Fields (NeRF), which was proposed by Mildenhall and his colleagues . It synthesizes novel views by optimizing an underlying volumetric scene function using a sparse set of input images. However, images need to be taken from various angles around the object which is not always possible in autonomous driving. Also, it has only been shown to work on rather smaller objects, questioning whether this method is applicable to the domain of autonomous driving.
It is also possible to synthesize data by using methods from image translation based on GANs. Wang and his colleagues performed video-to-video synthesis on a dataset called Cityscapes . They convert videos of semantic segmentation masks into realistic videos by using flow maps, image and video discriminators as well as a spatio-temporally progressive training paradigm, which is shown in the image below. However, this approach requires accurate per-pixel semantic annotations of scenes which is not feasible. Contrary to the paper discussed in this article, the network used for video-to-video synthesis only uses paired data for training, which is harder to attain.
There are two important concepts which are needed to understand the paper in its full detail. In the following lines, I will try to explain them as short and easy as possible.
The first concept I would like to introduce is the surfel, which is short for surface element . Multiple surfels can approximate any 3D surface shape. They are compact, as well as easy to reconstruct, to texture and to compress, due to their fixed size. However, they lack detailed texture information. A surfel, which is usually a circle, is parametrized by its radius, its center coordinate and its normal (vector perpendicular to the surface).
The next concept is a variant of the classic GAN called CycleGAN . It translates an image from source domain 𝑋 to a target domain 𝑌 using unpaired data. The translation should be cycle-consistent, meaning that the mappings 𝐺:𝑋⟼𝑌 and 𝐹:𝑌⟼𝑋 should be the inverses of each other. This preserves the original image after translation and subsequent reverse translation, which is achieved by including a cycle-consistency loss during training. Therefore, the image 𝑥̂ attained after translation and subsequent reverse translation is close to the original image 𝑥. Discriminators 𝐷𝑋 and 𝐷𝑌 try to distinguish between a synthesized and a real image from their respective domain. Afterwards, the generators use the discriminators' feedback to increase their ability of synthesizing high-quality fake images of a specific domain.
As mentioned in the introduction, the proposed approach can be divided into two stages. Therefore, I will first explain the process regarding the scene reconstruction using texture-enhanced surfels. After that, I will detail the image synthesis using the SurfelGAN.
The first stage of the method is the surfel scene reconstruction. It aims to reconstruct the perceived environment using surfels. After that, each surfel is discretized and several colors are assigned to each grid element of each surfel.
The scene is discretized into a 3D grid using volume elements (voxels) of fixed size 𝑣. For each voxel, the surfel parameters, such as the mean coordinate and the surfel normal, are estimated using the LIDAR points in that voxel. Specifically, the radius was chosen as √3𝑣 and the surfel normal is estimated using Principal Component Analysis (PCA). Traditional surfels suffer from the tradeoff between geometric consistency and rich texture details. If the surfel radius is larger, the geometry of the reconstruction is better, but will lack detail since each surfel only has one color. Therefore, the authors discretize each surfel disk into a 𝑘×𝑘 grid centered on its point centroid and assign an independent color to each grid element - called texture-enhanced surfels. They achieve both geometric consistency and rich texture details by permitting utilization of a larger surfel radius and multiple colors per surfel. Moreover, the color of surfels can be different depending on the pose and illumination. Therefore, a codebook of such grids from 𝑛 distances is stored for each surfel. For each surfel, the distances are chosen in a uniform manner meaning that a grid is stored when the vehicle is e.g. 10 m, 20 m, 30 m, .. far away from the respective surfel. For all the experiments, the authors chose 𝑣=0.2 𝑚, 𝑘=5 and 𝑛=10.
Dynamic objects such as vehicles or pedestrians are also reconstructed using these texture-enhanced surfels. First, I will explain the process of reconstructing vehicles and then the difference of reconstructing surfel representations of pedestrians. Each vehicle in the scene is reconstructed using a separate model. For this, the authors make use of the 3D bounding box annotations from Waymo Open Dataset to accumulate LIDAR points from multiple scans. In case the bounding box annotations are not consistent over multiple frames, they refine the point cloud registration using an algorithm called Iterative Closest Point (ICP). ICP is a well-known algorithm used to align two point clouds. Here, the authors utilize it to produce a dense point cloud for the respective vehicle . I have linked a video by Cyrill Stachniss regarding point cloud registration using ICP, which I found very helpful to get a basic understanding. Lastly, they reconstruct each vehicle using the enhanced surfel representation. The method leverages the symmetry of vehicles to improve the reconstructions. Pedestrians are deformable, dynamic objects, and herefore one needs to construct a separate surfel model for each LIDAR scan separately. The reconstructions of vehicles and pedestrians can be placed in any location, meaning that objects can be perturbed from their original position to synthesize novel scenes. However, these reconstructions might have imperfect geometry and texturing.
The figure below shows the results attained by different scene reconstruction strategies. The top row shows the reconstruction using the basic surfel, the second row uses the texture-enhanced surfels, whereas the third row contains the real images. Compared to the first row, the second row contains less artifacts and yields smoother coloring.
For the image synthesis using the SurfelGAN, there are two kinds of data available - paired and unpaired data. Waymo Open Dataset contains sequences of camera and LIDAR data.
The SurfelGAN is a generative model that converts surfel image renderings to realistically looking images. The architecture consists of two symmetric generators 𝐺𝑆↦𝐼 and 𝐺𝐼↦𝑆 as well as discriminators 𝐷𝑆 and 𝐷𝐼. One generator maps from the surfel domain 𝑆 to the image domain 𝐼 and the other generator does the inverse. Moreover, each discriminator is specialized in one domain.
There are three different variants of the SurfelGAN. SurfelGAN-S only utilizes a supervised term which aligns the generated images to the real images. The supervised term consists of the L1 loss for the reconstruction and the cross-entropy loss for the semantic and instance maps. This variant only makes use of the paired data. SurfelGAN-SA can be attained by using an additional adversarial term in the form of the hinged Wasserstein loss. SurfelGAN-SAC also adds a cycle-consistency term in the form of an L1 loss for renderings and images. Both variants make use of paired and unpaired data. In the image below, I summarized the information provided for the SurfelGAN variants and loss functions.
The figure below shows the surfel rendering, the results attained by the three SurfelGAN variants as well as the real image. Every SurfelGAN variant improves on top of the surfel. The images generated by the SurfelGAN-SAC are very similar to the real images.
One criterion used to evaluate the SurfelGAN is vehicle detector realism. For this, the generated camera data is fed to an off-the-shelf vehicle object detector. The evaluation metrics include the average-precision (AP) and its variants as well as recall at 100 (Rec). If the metrics attained by using synthesized images are close to the ones obtained by feeding real images to the detector, one can conclude that the synthesized images are close to real ones in terms of realism.
Also, WOD-EVAL is the only dataset that contains ground truth camera images; thus, these images are used as an input for the detector to provide an upper bound of the performance. The metrics of the images generated by SurfelGAN-SAC are close to the real images, implying that in the eye of the detector the generated images are similar to the real images [see Table 1].
The figure below shows synthesized novel views. The right column contains the original view which is reconstructed using surfels. In the left column, a random transformation is applied to attain views. Then, these are fed to the SurfelGAN to synthesize images, as seen in the middle column.
The authors conclude that the SurfelGAN model generalizes well because the relative improvement over the baseline is very similar between the WOD-TRAIN Novel View and WOD-EVAL Novel view. Moreover, the surfel renderings have a quality bias with respect to the viewing direction meaning that the original poses should not be perturbed too much to attain high-quality novel data [see Table 2]. For this, the deviation 𝑑 is introduced, which is a weighted sum of both translational and rotational differences of poses. If 𝑑 is increased, the performance values of the vehicle detector decrease. is computed using the following formula:
Here, 𝑡 and 𝑅 are the translation and rotation of the novel view in WOD-EVAL-NV whereas 𝑡′ and 𝑅′ are the translation and rotation of the closest pose in WOD-EVAL. For all experiments, the authors chose 𝜆𝑅.
Another criterion used to evaluate the SurfelGAN is the pixel-wise realism of images, which can be measured by the well-known L1 distance. For this, the Dual-Camera-Pose Dataset is used. It contains scenarios in which two vehicles observe the same scene at the same time. The scene is reconstructed from the source camera using surfels. After that, a rotation and a translation are applied to arrive at the position of the target camera. Using the surfel renderings of that view, images are synthesized by the SurfelGAN.
These images can be matched to real images and one can report the L1 loss on pixels covered by the surfel rendering. Table 3 shows that every SurfelGAN improves on top of the surfel renderings. Moreover, SurfelGAN-S outperforms the other variants because it optimizes for the L1 distance.
Furthermore, the authors explore whether the generated images from perturbed views are a helpful form of data augmentation. For this, they train a vehicle object detector from scratch by using different combinations of data. WOD-TRAIN Novel view only inherits 3D bounding boxes from WOD-TRAIN. It does not contain 2D bounding boxes from the camera which are needed to train the detector. Therefore, they are approximated by projecting all the surfels in the 3D bounding boxes to the 2D novel view and taking the axis-aligned bounding box as approximation. Table 4 shows that the use of plain surfels to augment the data already boosts the detector performance. However, the improvement is even larger if images generated by the SurfelGAN are used for data augmentation. One can conclude that the generated images from the perturbed views are a helpful form of data augmentation.
There are several limitations which are identified by the authors. First, the SurfelGAN is not able to fix the broken geometry, if the initial reconstructed surfel map contains very large errors. This can be seen in the first row, where the surfel reconstruction of the silver car is not optimal. Consequently, the synthesized result also includes an unrealistically looking car. Second, the model hallucinates when surfel cues are lacking. This can be seen in the second row, where the model synthesized buildings at a place which did not include surfels initially.
The authors conclude that the experimental results demonstrate high level of realism of the synthesized sensor data and of the generated images can be used for data augmentation, as it has been shown by training a vehicle object detector from scratch. As a next step, they plan to enhance the camera simulation by improving the dynamic object modeling process as well as to investigate temporally consistent video generation.
In my opinion, the paper presents novel work with potential to have an impact on the status quo. The approach achieves promising results by only using a combination of rather common methods. Moreover, the paper provides a good overview over current state-of-the-art methods in novel view synthesis as well as autonomous driving system testing.
However, there are some points I want to address. The paper does not provide any quantitative or qualitative comparisons to other approaches, such as AADS. The author mentioned that the experimental setups were too different for comparison. Moreover, there is no qualitative analysis on the Dual-Camera-Pose Dataset. It is hard to judge the performance and quality of the generated images only based on the reported L1 distances of each SurfelGAN variant. Also, there is no information regarding the time needed for the surfel scene reconstruction. However, the author recalls that it takes several minutes to reconstruct one sequence given in the WOD. Additionally, the explanations are not sufficient to recreate the proposed approach, considering that the code is not publicly available. Therefore, it takes more time to understand the paper in its full depth.
Mayuran Surendran is part of the AI Engineering team of Design AI, a start-up focusing on agile AI development and use case identification through Design Thinking. He is pursuing a M.Sc. in Robotics, Cognition and Intelligence at Technical University of Munich. Within Design AI, he is mainly focusing on Reinforcement Learning.Weitere Beiträge
A step-by-step introduction