|
# ScanNet Dataset |
|
|
|
MMDetection3D supports LiDAR-based detection and segmentation on ScanNet dataset. This page provides specific tutorials about the usage. |
|
|
|
## Dataset preparation |
|
|
|
For the overall process, please refer to the [README](https://github.com/open-mmlab/mmdetection3d/blob/dev-1.x/data/scannet/README.md) page for ScanNet. |
|
|
|
### Export ScanNet point cloud data |
|
|
|
By exporting ScanNet data, we load the raw point cloud data and generate the relevant annotations including semantic labels, instance labels and ground truth bounding boxes. |
|
|
|
```shell |
|
python batch_load_scannet_data.py |
|
``` |
|
|
|
The directory structure before data preparation should be as below |
|
|
|
``` |
|
mmdetection3d |
|
βββ mmdet3d |
|
βββ tools |
|
βββ configs |
|
βββ data |
|
β βββ scannet |
|
β β βββ meta_data |
|
β β βββ scans |
|
β β β βββ scenexxxx_xx |
|
β β βββ batch_load_scannet_data.py |
|
β β βββ load_scannet_data.py |
|
β β βββ scannet_utils.py |
|
β β βββ README.md |
|
``` |
|
|
|
Under folder `scans` there are overall 1201 train and 312 validation folders in which raw point cloud data and relevant annotations are saved. For instance, under folder `scene0001_01` the files are as below: |
|
|
|
- `scene0001_01_vh_clean_2.ply`: Mesh file storing coordinates and colors of each vertex. The mesh's vertices are taken as raw point cloud data. |
|
- `scene0001_01.aggregation.json`: Aggregation file including object ID, segments ID and label. |
|
- `scene0001_01_vh_clean_2.0.010000.segs.json`: Segmentation file including segments ID and vertex. |
|
- `scene0001_01.txt`: Meta file including axis-aligned matrix, etc. |
|
- `scene0001_01_vh_clean_2.labels.ply`: Annotation file containing the category of each vertex. |
|
|
|
The procedure of exporting ScanNet data by running `python batch_load_scannet_data.py` mainly includes the following 3 steps: |
|
|
|
- Export original files to point cloud, instance label, semantic label and bounding box file. |
|
- Downsample raw point cloud and filter invalid classes. |
|
- Save point cloud data and relevant annotation files. |
|
|
|
And the core function `export` in `load_scannet_data.py` is as follows: |
|
|
|
```python |
|
def export(mesh_file, |
|
agg_file, |
|
seg_file, |
|
meta_file, |
|
label_map_file, |
|
output_file=None, |
|
test_mode=False): |
|
|
|
# label map file: ./data/scannet/meta_data/scannetv2-labels.combined.tsv |
|
# the various label standards in the label map file, e.g. 'nyu40id' |
|
label_map = scannet_utils.read_label_mapping( |
|
label_map_file, label_from='raw_category', label_to='nyu40id') |
|
# load raw point cloud data, 6-dims feature: XYZRGB |
|
mesh_vertices = scannet_utils.read_mesh_vertices_rgb(mesh_file) |
|
|
|
# Load scene axis alignment matrix: a 4x4 transformation matrix |
|
# transform raw points in sensor coordinate system to a coordinate system |
|
# which is axis-aligned with the length/width of the room |
|
lines = open(meta_file).readlines() |
|
# test set data doesn't have align_matrix |
|
axis_align_matrix = np.eye(4) |
|
for line in lines: |
|
if 'axisAlignment' in line: |
|
axis_align_matrix = [ |
|
float(x) |
|
for x in line.rstrip().strip('axisAlignment = ').split(' ') |
|
] |
|
break |
|
axis_align_matrix = np.array(axis_align_matrix).reshape((4, 4)) |
|
|
|
# perform global alignment of mesh vertices |
|
pts = np.ones((mesh_vertices.shape[0], 4)) |
|
# raw point cloud in homogeneous coordinates, each row: [x, y, z, 1] |
|
pts[:, 0:3] = mesh_vertices[:, 0:3] |
|
# transform raw mesh vertices to aligned mesh vertices |
|
pts = np.dot(pts, axis_align_matrix.transpose()) # Nx4 |
|
aligned_mesh_vertices = np.concatenate([pts[:, 0:3], mesh_vertices[:, 3:]], |
|
axis=1) |
|
|
|
# Load semantic and instance labels |
|
if not test_mode: |
|
# each object has one semantic label and consists of several segments |
|
object_id_to_segs, label_to_segs = read_aggregation(agg_file) |
|
# many points may belong to the same segment |
|
seg_to_verts, num_verts = read_segmentation(seg_file) |
|
label_ids = np.zeros(shape=(num_verts), dtype=np.uint32) |
|
object_id_to_label_id = {} |
|
for label, segs in label_to_segs.items(): |
|
label_id = label_map[label] |
|
for seg in segs: |
|
verts = seg_to_verts[seg] |
|
# each point has one semantic label |
|
label_ids[verts] = label_id |
|
instance_ids = np.zeros( |
|
shape=(num_verts), dtype=np.uint32) # 0: unannotated |
|
for object_id, segs in object_id_to_segs.items(): |
|
for seg in segs: |
|
verts = seg_to_verts[seg] |
|
# object_id is 1-indexed, i.e. 1,2,3,.,,,.NUM_INSTANCES |
|
# each point belongs to one object |
|
instance_ids[verts] = object_id |
|
if object_id not in object_id_to_label_id: |
|
object_id_to_label_id[object_id] = label_ids[verts][0] |
|
# bbox format is [x, y, z, x_size, y_size, z_size, label_id] |
|
# [x, y, z] is gravity center of bbox, [x_size, y_size, z_size] is axis-aligned |
|
# [label_id] is semantic label id in 'nyu40id' standard |
|
# Note: since 3D bbox is axis-aligned, the yaw is 0. |
|
unaligned_bboxes = extract_bbox(mesh_vertices, object_id_to_segs, |
|
object_id_to_label_id, instance_ids) |
|
aligned_bboxes = extract_bbox(aligned_mesh_vertices, object_id_to_segs, |
|
object_id_to_label_id, instance_ids) |
|
... |
|
|
|
return mesh_vertices, label_ids, instance_ids, unaligned_bboxes, \ |
|
aligned_bboxes, object_id_to_label_id, axis_align_matrix |
|
|
|
``` |
|
|
|
After exporting each scan, the raw point cloud could be downsampled, e.g. to 50000, if the number of points is too large (the raw point cloud won't be downsampled if it's also used in 3D semantic segmentation task). In addition, invalid semantic labels outside of `nyu40id` standard or optional `DONOT CARE` classes should be filtered. Finally, the point cloud data, semantic labels, instance labels and ground truth bounding boxes should be saved in `.npy` files. |
|
|
|
### Export ScanNet RGB data (optional) |
|
|
|
By exporting ScanNet RGB data, for each scene we load a set of RGB images with corresponding 4x4 pose matrices, and a single 4x4 camera intrinsic matrix. Note, that this step is optional and can be skipped if multi-view detection is not planned to use. |
|
|
|
```shell |
|
python extract_posed_images.py |
|
``` |
|
|
|
Each of 1201 train, 312 validation and 100 test scenes contains a single `.sens` file. For instance, for scene `0001_01` we have `data/scannet/scans/scene0001_01/0001_01.sens`. For this scene all images and poses are extracted to `data/scannet/posed_images/scene0001_01`. Specifically, there will be 300 image files xxxxx.jpg, 300 camera pose files xxxxx.txt and a single `intrinsic.txt` file. Typically, single scene contains several thousand images. By default, we extract only 300 of them with resulting space occupation of \<100 Gb. To extract more images, use `--max-images-per-scene` parameter. |
|
|
|
### Create dataset |
|
|
|
```shell |
|
python tools/create_data.py scannet --root-path ./data/scannet \ |
|
--out-dir ./data/scannet --extra-tag scannet |
|
``` |
|
|
|
The above exported point cloud file, semantic label file and instance label file are further saved in `.bin` format. Meanwhile `.pkl` info files are also generated for train or validation. The core function `process_single_scene` of getting data infos is as follows. |
|
|
|
```python |
|
def process_single_scene(sample_idx): |
|
|
|
# save point cloud, instance label and semantic label in .bin file respectively, get info['pts_path'], info['pts_instance_mask_path'] and info['pts_semantic_mask_path'] |
|
... |
|
|
|
# get annotations |
|
if has_label: |
|
annotations = {} |
|
# box is of shape [k, 6 + class] |
|
aligned_box_label = self.get_aligned_box_label(sample_idx) |
|
unaligned_box_label = self.get_unaligned_box_label(sample_idx) |
|
annotations['gt_num'] = aligned_box_label.shape[0] |
|
if annotations['gt_num'] != 0: |
|
aligned_box = aligned_box_label[:, :-1] # k, 6 |
|
unaligned_box = unaligned_box_label[:, :-1] |
|
classes = aligned_box_label[:, -1] # k |
|
annotations['name'] = np.array([ |
|
self.label2cat[self.cat_ids2class[classes[i]]] |
|
for i in range(annotations['gt_num']) |
|
]) |
|
# default names are given to aligned bbox for compatibility |
|
# we also save unaligned bbox info with marked names |
|
annotations['location'] = aligned_box[:, :3] |
|
annotations['dimensions'] = aligned_box[:, 3:6] |
|
annotations['gt_boxes_upright_depth'] = aligned_box |
|
annotations['unaligned_location'] = unaligned_box[:, :3] |
|
annotations['unaligned_dimensions'] = unaligned_box[:, 3:6] |
|
annotations[ |
|
'unaligned_gt_boxes_upright_depth'] = unaligned_box |
|
annotations['index'] = np.arange( |
|
annotations['gt_num'], dtype=np.int32) |
|
annotations['class'] = np.array([ |
|
self.cat_ids2class[classes[i]] |
|
for i in range(annotations['gt_num']) |
|
]) |
|
axis_align_matrix = self.get_axis_align_matrix(sample_idx) |
|
annotations['axis_align_matrix'] = axis_align_matrix # 4x4 |
|
info['annos'] = annotations |
|
return info |
|
``` |
|
|
|
The directory structure after process should be as below: |
|
|
|
``` |
|
scannet |
|
βββ meta_data |
|
βββ batch_load_scannet_data.py |
|
βββ load_scannet_data.py |
|
βββ scannet_utils.py |
|
βββ README.md |
|
βββ scans |
|
βββ scans_test |
|
βββ scannet_instance_data |
|
βββ points |
|
β βββ xxxxx.bin |
|
βββ instance_mask |
|
β βββ xxxxx.bin |
|
βββ semantic_mask |
|
β βββ xxxxx.bin |
|
βββ seg_info |
|
β βββ train_label_weight.npy |
|
β βββ train_resampled_scene_idxs.npy |
|
β βββ val_label_weight.npy |
|
β βββ val_resampled_scene_idxs.npy |
|
βββ posed_images |
|
β βββ scenexxxx_xx |
|
β β βββ xxxxxx.txt |
|
β β βββ xxxxxx.jpg |
|
β β βββ intrinsic.txt |
|
βββ scannet_infos_train.pkl |
|
βββ scannet_infos_val.pkl |
|
βββ scannet_infos_test.pkl |
|
``` |
|
|
|
- `points/xxxxx.bin`: The `axis-unaligned` point cloud data after downsample. Since ScanNet 3D detection task takes axis-aligned point clouds as input, while ScanNet 3D semantic segmentation task takes unaligned points, we choose to store unaligned points and their axis-align transform matrix. Note: the points would be axis-aligned in pre-processing pipeline [`GlobalAlignment`](https://github.com/open-mmlab/mmdetection3d/blob/9f0b01caf6aefed861ef4c3eb197c09362d26b32/mmdet3d/datasets/pipelines/transforms_3d.py#L423) of 3D detection task. |
|
- `instance_mask/xxxxx.bin`: The instance label for each point, value range: \[0, NUM_INSTANCES\], 0: unannotated. |
|
- `semantic_mask/xxxxx.bin`: The semantic label for each point, value range: \[1, 40\], i.e. `nyu40id` standard. Note: the `nyu40id` ID will be mapped to train ID in train pipeline `PointSegClassMapping`. |
|
- `seg_info`: The generated infos to support semantic segmentation model training. |
|
- `train_label_weight.npy`: Weighting factor for each semantic class. Since the number of points in different classes varies greatly, it's a common practice to use label re-weighting to get a better performance. |
|
- `train_resampled_scene_idxs.npy`: Re-sampling index for each scene. Different rooms will be sampled multiple times according to their number of points to balance training data. |
|
- `posed_images/scenexxxx_xx`: The set of `.jpg` images with `.txt` 4x4 poses and the single `.txt` file with camera intrinsic matrix. |
|
- `scannet_infos_train.pkl`: The train data infos, the detailed info of each scan is as follows: |
|
- info\['lidar_points'\]: A dict containing all information related to the lidar points. |
|
- info\['lidar_points'\]\['lidar_path'\]: The filename of the lidar point cloud data. |
|
- info\['lidar_points'\]\['num_pts_feats'\]: The feature dimension of point. |
|
- info\['lidar_points'\]\['axis_align_matrix'\]: The transformation matrix to align the axis. |
|
- info\['pts_semantic_mask_path'\]: The filename of the semantic mask annotation. |
|
- info\['pts_instance_mask_path'\]: The filename of the instance mask annotation. |
|
- info\['instances'\]: A list of dict contains all annotations, each dict contains all annotation information of single instance. For the i-th instance: |
|
- info\['instances'\]\[i\]\['bbox_3d'\]: List of 6 numbers representing the axis-aligned 3D bounding box of the instance in depth coordinate system, in (x, y, z, l, w, h) order. |
|
- info\['instances'\]\[i\]\['bbox_label_3d'\]: The label of each 3d bounding boxes. |
|
- `scannet_infos_val.pkl`: The val data infos, which shares the same format as `scannet_infos_train.pkl`. |
|
- `scannet_infos_test.pkl`: The test data infos, which almost shares the same format as `scannet_infos_train.pkl` except for the lack of annotation. |
|
|
|
## Training pipeline |
|
|
|
A typical training pipeline of ScanNet for 3D detection is as follows. |
|
|
|
```python |
|
train_pipeline = [ |
|
dict( |
|
type='LoadPointsFromFile', |
|
coord_type='DEPTH', |
|
shift_height=True, |
|
load_dim=6, |
|
use_dim=[0, 1, 2]), |
|
dict( |
|
type='LoadAnnotations3D', |
|
with_bbox_3d=True, |
|
with_label_3d=True, |
|
with_mask_3d=True, |
|
with_seg_3d=True), |
|
dict(type='GlobalAlignment', rotation_axis=2), |
|
dict(type='PointSegClassMapping'), |
|
dict(type='PointSample', num_points=40000), |
|
dict( |
|
type='RandomFlip3D', |
|
sync_2d=False, |
|
flip_ratio_bev_horizontal=0.5, |
|
flip_ratio_bev_vertical=0.5), |
|
dict( |
|
type='GlobalRotScaleTrans', |
|
rot_range=[-0.087266, 0.087266], |
|
scale_ratio_range=[1.0, 1.0], |
|
shift_height=True), |
|
dict( |
|
type='Pack3DDetInputs', |
|
keys=[ |
|
'points', 'gt_bboxes_3d', 'gt_labels_3d', 'pts_semantic_mask', |
|
'pts_instance_mask' |
|
]) |
|
] |
|
``` |
|
|
|
- `GlobalAlignment`: The previous point cloud would be axis-aligned using the axis-aligned matrix. |
|
- `PointSegClassMapping`: Only the valid category IDs will be mapped to class label IDs like \[0, 18) during training. |
|
- Data augmentation: |
|
- `PointSample`: downsample the input point cloud. |
|
- `RandomFlip3D`: randomly flip the input point cloud horizontally or vertically. |
|
- `GlobalRotScaleTrans`: rotate the input point cloud, usually in the range of \[-5, 5\] (degrees) for ScanNet; then scale the input point cloud, usually by 1.0 for ScanNet (which means no scaling); finally translate the input point cloud, usually by 0 for ScanNet (which means no translation). |
|
|
|
A typical training pipeline of ScanNet for 3D semantic segmentation is as below: |
|
|
|
```python |
|
train_pipeline = [ |
|
dict( |
|
type='LoadPointsFromFile', |
|
coord_type='DEPTH', |
|
shift_height=False, |
|
use_color=True, |
|
load_dim=6, |
|
use_dim=[0, 1, 2, 3, 4, 5]), |
|
dict( |
|
type='LoadAnnotations3D', |
|
with_bbox_3d=False, |
|
with_label_3d=False, |
|
with_mask_3d=False, |
|
with_seg_3d=True), |
|
dict( |
|
type='PointSegClassMapping'), |
|
dict( |
|
type='IndoorPatchPointSample', |
|
num_points=num_points, |
|
block_size=1.5, |
|
ignore_index=len(class_names), |
|
use_normalized_coord=False, |
|
enlarge_size=0.2, |
|
min_unique_num=None), |
|
dict(type='NormalizePointsColor', color_mean=None), |
|
dict(type='Pack3DDetInputs', keys=['points', 'pts_semantic_mask']) |
|
] |
|
``` |
|
|
|
- `PointSegClassMapping`: Only the valid category ids will be mapped to class label ids like \[0, 20) during training. Other class ids will be converted to `ignore_index` which equals to `20`. |
|
- `IndoorPatchPointSample`: Crop a patch containing a fixed number of points from input point cloud. `block_size` indicates the size of the cropped block, typically `1.5` for ScanNet. |
|
- `NormalizePointsColor`: Normalize the RGB color values of input point cloud by dividing `255`. |
|
|
|
## Metrics |
|
|
|
- **Object Detection**: Typically mean Average Precision (mAP) is used for evaluation on ScanNet, e.g. `[email protected]` and `[email protected]`. In detail, a generic function to compute precision and recall for 3D object detection for multiple classes is called. Please refer to [indoor_eval](https://github.com/open-mmlab/mmdetection3d/blob/dev-1.x/mmdet3d/evaluation/functional/indoor_eval.py) for more details. |
|
|
|
**Note**: As introduced in section `Export ScanNet data`, all ground truth 3D bounding box are axis-aligned, i.e. the yaw is zero. So the yaw target of network predicted 3D bounding box is also zero and axis-aligned 3D Non-Maximum Suppression (NMS), which is regardless of rotation, is adopted during post-processing . |
|
|
|
- **Semantic Segmentation**: Typically mean Intersection over Union (mIoU) is used for evaluation on ScanNet. In detail, we first compute IoU for multiple classes and then average them to get mIoU, please refer to [seg_eval](https://github.com/open-mmlab/mmdetection3d/blob/dev-1.x/mmdet3d/evaluation/functional/seg_eval.py). |
|
|
|
## Testing and Making a Submission |
|
|
|
By default, our codebase evaluates semantic segmentation results on the validation set. |
|
If you would like to test the model performance on the online benchmark, add `--format-only` flag in the evaluation script and change `ann_file=data_root + 'scannet_infos_val.pkl'` to `ann_file=data_root + 'scannet_infos_test.pkl'` in the ScanNet dataset's [config](https://github.com/open-mmlab/mmdetection3d/blob/main/configs/_base_/datasets/scannet-seg.py#L126). Remember to specify the `txt_prefix` as the directory to save the testing results. |
|
|
|
Taking PointNet++ (SSG) on ScanNet for example, the following command can be used to do inference on test set: |
|
|
|
``` |
|
./tools/dist_test.sh configs/pointnet2/pointnet2_ssg_16x2_cosine_200e_scannet-seg.py \ |
|
work_dirs/pointnet2_ssg/latest.pth --format-only \ |
|
--eval-options txt_prefix=work_dirs/pointnet2_ssg/test_submission |
|
``` |
|
|
|
After generating the results, you can basically compress the folder and upload to the [ScanNet evaluation server](http://kaldir.vc.in.tum.de/scannet_benchmark/semantic_label_3d). |
|
|