You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model Card for videomae-large-finetuned-UCF-Crime-dataset

This model is a fine-tuned version of the MCG-NJU/videomae-large model, specifically adapted for video classification tasks on the UCF Crime dataset. It is designed to classify various activities and events, including normal and anomalous behaviors such as burglary, vandalism, or fighting, based on video input.

Test the model at https://www.opear.org/demo

Model Details

Model Description

  • Developed by: Paulo Briceño
  • Model type: VideoMAE for Video Classification
  • Labels:
    • Abuse
    • Arrest
    • Arson
    • Assault
    • Burglary
    • Explosion
    • Fighting
    • Normal Videos
    • Road Accidents
    • Robbery
    • Shooting
    • Shoplifting
    • Stealing
    • Vandalism
  • Finetuned from model: MCG-NJU/videomae-large

Model Sources

Uses

Direct Use

The model can directly classify input videos into one of the 14 labels mentioned above. It is intended for anomaly detection tasks, especially in scenarios where automated video surveillance is required.

Downstream Use

This model can be integrated into real-time surveillance systems, used in forensic investigations, or applied in research to evaluate and improve crime detection algorithms.

Out-of-Scope Use

  • This model is not suitable for scenarios where input data deviates significantly from the types of videos in the UCF Crime dataset.
  • Misuse for surveillance without proper ethical considerations.

Bias, Risks, and Limitations

  • Biases: The model may inherit biases from the UCF Crime dataset, which could reflect cultural or situational assumptions specific to the dataset.
  • Limitations: Performance may degrade for scenarios or activities outside the scope of the training dataset.

Recommendations

Users should carefully evaluate the model's outputs and cross-verify results before taking critical decisions. Test performance in real-world scenarios to ensure reliability.

How to Get Started with the Model

Google Colab

Upload test videos to sample_data folder

import os
import torch
import cv2
import numpy as np
from torchvision import transforms
from torch.utils.data import Dataset, DataLoader
from transformers import VideoMAEForVideoClassification

# Define video directory
video_folder = "sample_data"

# Define class mapping
class_mapping = {
    "Abuse": 0, "Arrest": 1, "Arson": 2, "Assault": 3, "Burglary": 4,
    "Explosion": 5, "Fighting": 6, "Normal Videos": 7, "Road Accidents": 8,
    "Robbery": 9, "Shooting": 10, "Shoplifting": 11, "Stealing": 12, "Vandalism": 13
}
reverse_mapping = {v: k for k, v in class_mapping.items()}

# Load VideoMAE model
model_name = "OPear/videomae-large-finetuned-UCF-Crime"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = VideoMAEForVideoClassification.from_pretrained(
    model_name,
    label2id=class_mapping,
    id2label=reverse_mapping,
    ignore_mismatched_sizes=True,
).to(device)
model.eval()

# Video processing function
def load_video_frames(video_path, num_frames=16, size=(224, 224)):
    """
    Load video frames from a given path and resize them to (224, 224).
    Converts video into a tensor of shape [num_frames, 3, height, width].
    """
    cap = cv2.VideoCapture(video_path)
    frames = []
    
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    frame_indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)

    for i in range(total_frames):
        ret, frame = cap.read()
        if not ret:
            break
        if i in frame_indices:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frame = cv2.resize(frame, size)
            frames.append(frame)
    
    cap.release()

    if len(frames) < num_frames:  # Pad if not enough frames
        frames.extend([frames[-1]] * (num_frames - len(frames)))

    frames = np.stack(frames, axis=0)  # Shape: [num_frames, height, width, 3]
    frames = torch.tensor(frames, dtype=torch.float32).permute(0, 3, 1, 2) / 255.0  # Normalize

    return frames  # Shape: [num_frames, 3, height, width]

# Custom Dataset
class VideoDataset(Dataset):
    def __init__(self, video_folder):
        self.video_files = [os.path.join(video_folder, f) for f in os.listdir(video_folder) if f.endswith(".mp4")]
    
    def __len__(self):
        return len(self.video_files)
    
    def __getitem__(self, idx):
        video_path = self.video_files[idx]
        video_tensor = load_video_frames(video_path)
        return {"video": video_tensor, "filename": os.path.basename(video_path)}

# Load dataset
test_dataset = VideoDataset(video_folder)
test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False)

# Run inference
with torch.no_grad():
    for idx, sample in enumerate(test_loader):
        video_tensor = sample["video"].squeeze(0)  # Remove batch dimension from DataLoader
        video_tensor = video_tensor.unsqueeze(0).to(device)  # Correct shape: [1, 3, num_frames, H, W]

        # Forward pass
        outputs = model(video_tensor)

        # Get predictions
        probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_label = torch.argmax(probs, dim=-1).item()
        
        filename = sample["filename"][0]

        print(f"Video {idx}: {filename} - Predicted label = {reverse_mapping[predicted_label]}")

Training Details

Training Data

The model was fine-tuned on the UCF Crime dataset, which contains labeled videos of anomalous and normal events.

Training Procedure

  • Preprocessing: Videos were preprocessed to 224x224 resolution with 16 frames sampled per video clip.
  • Hyperparameters:
    • Batch size: 4
    • Epochs: 4
    • Learning rate: Linear warmup to a peak of 5e-5
    • Optimizer: AdamW
    • Mixed precision: fp16

Speeds, Sizes, Times

  • Number of parameters: ~323M
  • Best model checkpoint: checkpoint-1112
  • Evaluation accuracy: 92.96%
  • Evaluation loss: 0.15

Evaluation

Testing Data, Factors & Metrics

  • Testing Dataset: UCF Crime dataset (Test split)
  • Metrics Used: Accuracy, Evaluation Loss

Results

  • Best Accuracy: 92.96% (on the validation split after 4 epochs)

Summary

The model achieves state-of-the-art performance for video classification tasks on the UCF Crime dataset.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator.

  • Hardware Type: NVIDIA V100 GPUs
  • Hours used: ~50 hours
  • Cloud Provider: Google Colab
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

{
  learning_rate: 5e-05,
  train_batch_size: 4,
  eval_batch_size: 4,
  seed: 42,
  gradient_accumulation_steps: 2,
  total_train_batch_size: 8,
  optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments,
  lr_scheduler_type: linear,
  lr_scheduler_warmup_ratio: 0.1,
  training_steps: 13320,
  mixed_precision_training: Native AMP
}

Framework versions

Transformers 4.46.3 Pytorch 1.13.0+cu117 Datasets 3.1.0 Tokenizers 0.20.3

Model Architecture and Objective

This model is based on the VideoMAE architecture, which leverages masked autoencoders for efficient video feature learning. It uses:

  • Hidden size: 1024
  • Number of layers: 24
  • Attention heads: 16

Compute Infrastructure

  • Hardware: NVIDIA GPUs
  • Software: Transformers 4.46.3, PyTorch 1.13.1

Citation

BibTeX:

@article{videomae_large,
  title={VideoMAE: Masked Autoencoders for Video Representation Learning},
  author={MCG-NJU Team},
  year={2024},
  url={https://huggingface.co/MCG-NJU/videomae-large}
}
@InProceedings{Sultani_2018_CVPR,
author = {Sultani, Waqas and Chen, Chen and Shah, Mubarak},
title = {Real-World Anomaly Detection in Surveillance Videos},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2018}
}

APA: MCG-NJU Team. (2024). VideoMAE: Masked Autoencoders for Video Representation Learning. Retrieved from https://huggingface.co/MCG-NJU/videomae-large

Sultani, W., Chen, C., & Shah, M. (2018). Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

Downloads last month
431
Safetensors
Model size
304M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for OPear/videomae-large-finetuned-UCF-Crime

Finetuned
(16)
this model