Action Recognition

---
license: mit
language:
- en
metrics:
- accuracy
pipeline_tag: video-classification
tags:
- robotics
---

<img src="image/tuc.png" alt="drawing" width="200"/>

University of Technology Chemnitz, Germany<br>
Department Robotics and Human Machine Interaction<br>
Author: Robert Schulz

<h1>Action Recognition</h1>

<h2>Table of Contents</h2>

- [1 Overview](#1-overview)
- [2 Pretrained Models](#2-pretrained-models)
  - [2.1 TUC-AR Dataset](#21-tuc-ar-dataset)
  - [2.2 UCF101 Dataset](#22-ucf101-dataset)


## 1 Overview

Here, we provide a PyTorch model which was trained on different datasets (see [2 Pretrained Models](#2-pretrained-models)). The model consists of a 3D CNN multi-stage feature extraction module, followed by a classification head. It achieves state-of-the-art results on the UCF101 dataset.

![](image/model_architecture.png)
_**Figure 1** Model architecture_

## 2 Pretrained Models
### 2.1 TUC-AR Dataset
[Dataset Homepage](https://huggingface.co/datasets/SchulzR97/TUC-AR)

**Short Description**

- RGB and depth input recorded by Intel RealSense D435 depth camera
- 7 subjects
- 3 perspectives per sequence
- 11,031 sequences (train 8,893/ val 2,138)
- 6(+1) action categories

**Input**

| Dimension | Fixed   | Value | Parameter       | Description                               |
|-----------|---------|-------|-----------------|-------------------------------------------|
| 0         | no      | ?     | Batch Size      | Number of samples that will be propagated through the network (number of sequences) |
| 1         | yes     | 30    | Sequence Length | Number of frames in one sequence          |
| 2         | yes     | 4     | Input Channels  | Number of channels of one frame (RGB+D=4) |
| 3         | yes     | 400   | Width           | Width of one frame                        |
| 4         | yes     | 400   | Height          | Height of one frame                       |


**Output**

| Dimension | Fixed   | Value | Parameter       | Description                               |
|-----------|---------|-------|-----------------|-------------------------------------------|
| 0         | no      | ?     | Batch Size      | Number of samples that will be propagated through the network (number of sequences) |
| 1         | yes     | 10    | Number of action classes | Number of action classes<br>0 - None<br>1 - Waving<br>2 - Pointing<br>3 - Clapping<br>4 - Follow<br>5 - Walking<br>6 - Stop |

**Usage**

```python
from huggingface_hub import HfApi

api = HfApi()
model_path = api.hf_hub_download('SchulzR97/TUC-AR-C3D', filename='tuc-ar.pth')
model = torch.load(model_path)
```

### 2.2 UCF101 Dataset
[Dataset Homepage](https://www.crcv.ucf.edu/data/UCF101.php)

**Input**

| Dimension | Fixed   | Value | Parameter       | Description                               |
|-----------|---------|-------|-----------------|-------------------------------------------|
| 0         | no      | ?     | Batch Size      | Number of samples that will be propagated through the network (number of sequences) |
| 1         | yes     | 60    | Sequence Length | Number of frames in one sequence          |
| 2         | yes     | 3     | Input Channels  | Number of channels of one frame (RGB=3) |
| 3         | yes     | 400   | Width           | Width of one frame                        |
| 4         | yes     | 400   | Height          | Height of one frame                       |


**Output**

| Dimension | Fixed   | Value | Parameter       | Description                               |
|-----------|---------|-------|-----------------|-------------------------------------------|
| 0         | no      | ?     | Batch Size      | Number of samples that will be propagated through the network (number of sequences) |
| 1         | yes     | 101    | Number of action classes | Number of action classes |

**Usage**

```python
from huggingface_hub import HfApi

api = HfApi()
model_path = api.hf_hub_download('SchulzR97/TUC-AR-C3D', filename='ucf101.pth')
model = torch.load(model_path)
```