UHow commited on
Commit
24fca51
Β·
verified Β·
1 Parent(s): c780d4b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +196 -3
README.md CHANGED
@@ -1,3 +1,196 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ datasets:
4
+ - UHow/DanceTrack-L
5
+ - UHow/SportsMOT-L
6
+ - UHow/MOT17-L
7
+ ---
8
+ # LG-MOT
9
+ ### **Multi-Granularity Language-Guided Training for Multi-Object Tracking**
10
+
11
+ <p align="center">
12
+ <img src="https://i.imgur.com/waxVImv.png" alt="Oryx Video-ChatGPT">
13
+ </p>
14
+
15
+ #### [Yuhao Li](), [Jiale Cao](https://jialecao001.github.io/), [Muzammal Naseer](https://muzammal-naseer.com/), [Yu Zhu](), [Jinqiu Sun](https://teacher.nwpu.edu.cn/m/2009010133), [Yanning Zhang](https://jszy.nwpu.edu.cn/1999000059) and [Fahad Khan](https://sites.google.com/view/fahadkhans/home)
16
+
17
+
18
+ #### **Northwestern Polytechnical University, Mohamed bin Zayed University of AI, TianJin University, Khalifa University, Linkâping University**
19
+
20
+ <!-- [![Website](https://img.shields.io/badge/Project-Website-87CEEB)](site_url) -->
21
+ [![paper](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2406.04844.pdf)
22
+
23
+ ## Latest
24
+ - `2025/08/15`: We released our annotated DanceTrack-L, MOT17-L, and SportsMOT-L.
25
+ - `2025/05/22`: Our work has been accepted by IEEE TCSVT.
26
+ - `2024/06/14`: We released our trained models.
27
+ - `2024/06/12`: We released our code.
28
+ - `2024/06/11`: We released our technical report on [arxiv](https://arxiv.org/abs/2406.04844.pdf). Our code and models are coming soon!
29
+
30
+ <br>
31
+ <details>
32
+ <summary>
33
+ <font size="+1">Abstract</font>
34
+ </summary>
35
+ Most existing multi-object tracking methods typically learn visual tracking features via maximizing dis-similarities of different instances and minimizing similarities of the same instance. While such a feature learning scheme achieves promising performance, learning discriminative features solely based on visual information is challenging especially in case of environmental interference such as occlusion, blur and domain variance. In this work, we argue that multi-modal language-driven features provide complementary information to classical visual features, thereby aiding in improving the robustness to such environmental interference. To this end, we propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity (scene-and instance-level) and combines it with standard visual features to obtain discriminative representations. To develop LG-MOT, we annotate existing MOT datasets with scene-and instance-level language descriptions. We then encode both instance-and scene-level language information into high-dimensional embeddings, which are utilized to guide the visual features during training. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Extensive experiments on three benchmarks, MOT17, DanceTrack and SportsMOT, reveal the merits of the proposed contributions leading to state-of-the-art performance. On the DanceTrack test set, our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score), compared to the baseline using only visual features. Further, our LG-MOT exhibits strong cross-domain generalizability.
36
+ </details>
37
+
38
+ ## Intro
39
+
40
+ - **LG-MOT** first annotate
41
+ the training sets and validation sets of commonly used MOT datasets including
42
+ [MOT17](https://motchallenge.net/data/MOT17/), [DanceTrack](https://dancetrack.github.io/) and [SportsMOT](https://github.com/MCG-NJU/SportsMOT?tab=readme-ov-file) with language descriptions
43
+ at both scene and instance levels.
44
+
45
+
46
+
47
+ | Dataset|Videos (Scenes) |Annotated Scenes |Tracks (Instances) | Annotated Instances|Annotated Boxes |Frames|
48
+ |:---:|:---:|:---:|:---:|:---:|:---:|:---:|
49
+ | MOT17-L | 7 | 7 | 796 |796 | 614,103 | 110,407 |
50
+ | DanceTrack-L| 65 | 65| 682 | 682 | 576,078 | 67,304|
51
+ |SportsMOT-L |90| 90 | 1,280 | 1,280 |608,152|55,544 |
52
+ |Total | 162|162|2,758|2,758|1798,333|233,255|
53
+
54
+ - **LG-MOT** is a new multi-object tracking framework which leverages language information at different granularity during training to enhance object association capabilities. During training, our **ISG** module aligns each node embedding $\phi(b_i^k)$ with instance-level descriptions embeddings $\varphi_i$, while our **SPG** module aligns edge embeddings $\hat{E}_{(u,v)}$ with scene-level descriptions embeddings $\varphi_s$ to guide correlation estimation after message passing. Our approach does not require language description during inference.
55
+
56
+ <div align=center>
57
+ <img src="source/LG-MOT_overview.png" width="90%"/>
58
+ </div>
59
+
60
+ - **LG-MOT** increases 1.2% in terms of IDF1 over the baseline SUSHI intra-domain, while significantly improves 11.2% in cross-domain.
61
+
62
+ <div align=center>
63
+ <img src="source/performance.png" width="50%"/>
64
+ </div>
65
+ <!-- <img src="source/performance.png" width="50%"/> -->
66
+
67
+ ## Visualization
68
+ <div align=center>
69
+ <img src="source/visualization.png" width="90%"/>
70
+ </div>
71
+
72
+ ## Performance on Benchmarks
73
+ ### MOT17 Challenge - Test Set
74
+ | Dataset | IDF1 | HOTA | MOTA | ID Sw. | model|
75
+ |:---:|:---:|:---:|:---:|:---:|:---:|
76
+ |MOT17 |81.7 |65.6 |81.0 |1161 |[checkpoint.pth](https://drive.google.com/file/d/1LVFQ-i9cuilGX82dIlArXmm1TuvcxNvx/view?usp=drive_link)|
77
+
78
+
79
+
80
+ ### DanceTrack - Test Set
81
+ | Dataset | IDF1 | HOTA | MOTA | AssA | DetA | model|
82
+ |:---:|:---:|:---:|:---:|:---:|:---:|:---:|
83
+ |DanceTrack | 60.5 | 61.8| 89.0| 80.0| 47.8 |[checkpoint.pth](https://drive.google.com/file/d/1yD-k-rXsqhu8c9mxjCYG41OuHYeDSpWa/view?usp=drive_link)|
84
+
85
+ ### SportsMOT - Test Set
86
+ | Dataset | IDF1 | HOTA | MOTA | ID Sw. | model|
87
+ |:---:|:---:|:---:|:---:|:---:|:---:|
88
+ |SportsMOT | 77.1 |75.0 |91.0 |2847|[checkpoint.pth](https://drive.google.com/file/d/1e-xfa2yeASdWLjWKGSzrn46_hkl6L8vO/view?usp=drive_link)|
89
+
90
+
91
+ ## Setup
92
+
93
+ 1. Clone and enter this repository
94
+ ```
95
+ git clone https://github.com/WesLee88524/LG-MOT.git
96
+ cd LG-MOT
97
+ ```
98
+
99
+ 3. Create an [Anaconda environment](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) for this project:
100
+ ```
101
+ conda env create -f environment.yml
102
+ conda activate LG-MOT
103
+ ```
104
+
105
+ 3. Clone [fast-reid](https://github.com/JDAI-CV/fast-reid) (latest version should be compatible but we use [this](https://github.com/JDAI-CV/fast-reid/tree/afe432b8c0ecd309db7921b7292b2c69813d0991) version) and install its dependencies. The `fast-reid` repo should be inside `LG-MOT` root:
106
+ ```
107
+ LG-MOT
108
+ β”œβ”€β”€ src
109
+ β”œβ”€β”€ fast-reid
110
+ └── ...
111
+ ```
112
+
113
+ 4. Download [re-identification model we use from fast-reid](https://drive.google.com/file/d/1MovixfOLwnnXet05JLIGPy-Ag67fdW-2/view?usp=share_link) and move it inside `LG-MOT/fastreid-models/model_weights/`
114
+
115
+ 5. Download [MOT17](https://motchallenge.net/data/MOT17/), [SPORTSMOT](https://github.com/MCG-NJU/SportsMOT?tab=readme-ov-file) and [DanceTrack](https://dancetrack.github.io/) datasets. In addition, prepare seqmaps to run evaluation (for details see [TrackEval](https://github.com/JonathonLuiten/TrackEval)). We provide an [example seqmap](https://drive.google.com/drive/folders/1LYRYPuNWIWWz-HXSjuqyX8Fz9fDunTRI?usp=sharing). Overall, the expected folder structure is:
116
+
117
+ ```
118
+ DATA_PATH
119
+ β”œβ”€β”€ DANCETRACK
120
+ β”‚ └── ...
121
+ β”œβ”€β”€ SPORTSMOT
122
+ β”‚ └── ...
123
+ └── MOT17
124
+ └── seqmaps
125
+ β”‚ β”œβ”€β”€ seqmap_file_name_matching_split_name.txt
126
+ β”‚ └── ...
127
+ └── train
128
+ β”‚ β”œβ”€β”€ MOT17-02
129
+ β”‚ β”‚ β”œβ”€β”€ det
130
+ β”‚ β”‚ β”‚ └── det.txt
131
+ β”‚ β”‚ └── gt
132
+ β”‚ β”‚ β”‚ └── gt.txt
133
+ β”‚ β”‚ └── img1
134
+ β”‚ β”‚ β”‚ └── ...
135
+ β”‚ β”‚ └── seqinfo.ini
136
+ β”‚ └── ...
137
+ └── test
138
+ └── ...
139
+
140
+ ```
141
+
142
+
143
+
144
+ <!-- 6. (OPTIONAL) Download [our detections](https://drive.google.com/drive/folders/1bxw1Hz77LCCW3cWizhg_q03vk4aXUriz?usp=share_link) and [pretrained models](https://drive.google.com/drive/folders/1cU7LeTAeKxS-nvxqrUdUstNpR-Wpp5NV?usp=share_link) and arrange them according to the expected file structure (See step 5). -->
145
+
146
+
147
+ ## Training
148
+ You can launch a training from the command line. An example command for MOT17 training:
149
+
150
+ ```
151
+ RUN=example_mot17_training
152
+ REID_ARCH='fastreid_msmt_BOT_R50_ibn'
153
+
154
+ DATA_PATH=YOUR_DATA_PATH
155
+
156
+ python scripts/main.py --experiment_mode train --cuda --train_splits MOT17-train-all --val_splits MOT17-train-all --run_id ${RUN}_${REID_ARCH} --interpolate_motion --linear_center_only --det_file byte065 --data_path ${DATA_PATH} --reid_embeddings_dir reid_${REID_ARCH} --node_embeddings_dir node_${REID_ARCH} --zero_nodes --reid_arch $REID_ARCH --edge_level_embed --save_cp --inst_KL_loss 1 --KL_loss 1 --num_epoch 200 --mpn_use_prompt_edge 1 1 1 1 --use_instance_prompt
157
+ ```
158
+
159
+
160
+ ## Testing
161
+ You can test a trained model from the command line. An example for testing on MOT17 training set:
162
+ ```
163
+ RUN=example_mot17_test
164
+ REID_ARCH='fastreid_msmt_BOT_R50_ibn'
165
+
166
+ DATA_PATH=your_data_path
167
+ PRETRAINED_MODEL_PATH=your_pretrained_model_path
168
+
169
+ python scripts/main.py --experiment_mode test --cuda --test_splits MOT17-test-all --run_id ${RUN}_${REID_ARCH} --interpolate_motion --linear_center_only --det_file byte065 --data_path ${DATA_PATH} --reid_embeddings_dir reid_${REID_ARCH} --node_embeddings_dir node_${REID_ARCH} --zero_nodes --reid_arch $REID_ARCH --edge_level_embed --save_cp --hicl_model_path ${PRETRAINED_MODEL_PATH} --inst_KL_loss 1 --KL_loss 1 --mpn_use_prompt_edge 1 1 1 1
170
+
171
+ ```
172
+
173
+
174
+
175
+
176
+ ## Citation
177
+ if you use our work, please consider citing us:
178
+ ```BibTeX
179
+ @ARTICLE{11010839,
180
+ author={Li, Yuhao and Cao, Jiale and Naseer, Muzammal and Zhu, Yu and Sun, Jinqiu and Zhang, Yanning and Khan, Fahad Shahbaz},
181
+ journal={IEEE Transactions on Circuits and Systems for Video Technology},
182
+ title={Multi-Granularity Language-Guided Training for Multi-Object Tracking},
183
+ year={2025},
184
+ volume={},
185
+ number={},
186
+ pages={1-1},
187
+ keywords={Visualization;Training;Feature extraction;Standards;Object detection;Accuracy;Representation learning;Current transformers;Circuits and systems;Target tracking;Multi-object tracking;Language-guided features;Cross-domain generalizability},
188
+ doi={10.1109/TCSVT.2025.3572810}}
189
+
190
+ ```
191
+
192
+
193
+ ## License
194
+ This project is released under the Apache license. See [LICENSE](LICENSE) for additional details.
195
+ ## Acknowledgement
196
+ The code is mainly based on [SUSHI](https://github.com/dvl-tum/SUSHI). Thanks for their wonderful works.