Q: Detector timing?
A: The timing in the paper is wrong. The total runtime of PIXOR is 35 ms on a NVIDIA TITAN Xp GPU, which consists of 1 ms data voxelization, 31 ms network forward pass, and 3 ms oriented-NMS. Note that both voxelization and oriented-NMS are implemented on GPU for more efficiency.

Q: How many residual layers in Res_block_5 in Figure 2?
A: There's a typo in the text. It should be 3.

Q: log(dx), log(dy) in regression targets?
A: This is a typo as well. They should be dx and dy.

Q: Network optimization details on KITTI?
A: We train the network with stochastic gradient descent with momentum for 35 epochs on 4 NVIDIA 1080Ti GPUs, with each GPU taking 4 frames. The initial learning rate is 0.01 and we decay it by 10 after 20 and 30 epochs respectively. The training process takes < 4 hours.

Q: How do you evaluate on KITTI without having proper 2D detection box?
A: We manually set the 2D box height wrt. the BEV detection's distance to ego-car. Specifically, if the distance is larger than 60 meters, we set the 2D box height to 10 pixels; if the distance is larger than 30 meters and smaller than 60 meters, we set the 2D box height to 30 pixels; if the distance is smaller than 30 meters, we set the 2D box height to 50 pixels.