HDPE: A Foundational Perception Model with Hyper-Dimensional Positional Encoding
π Research Paper (Coming Soon) | π Live Demo API (Powered by this Model)
π Overview: A New Foundation for Perception in Autonomous Driving
This repository contains the pre-trained weights for a novel autonomous driving perception model, the core of our Interfuser-HDPE system. This is not a standard Interfuser model; it incorporates fundamental innovations in its architecture and learning framework to achieve a more robust, accurate, and geometrically-aware understanding of driving scenes from camera-only inputs.
The innovations baked into these weights make this model a powerful foundation for building complete self-driving systems. It is designed to output rich perception data (object detection grids and waypoints) that can be consumed by downstream modules like trackers and controllers.
π‘ Key Innovations in This Model
The weights in this repository are the result of training a model with the following scientific contributions:
1. Hyper-Dimensional Positional Encoding (HDPE) - (Core Contribution)
- What it is: We replace the standard Sinusoidal Positional Encoding with HDPE, a novel, first-principles approach inspired by the geometric properties of n-dimensional spaces.
- Why it matters: HDPE generates an interpretable spatial prior that biases the model's attention towards the center of the image (the road ahead). This leads to more stable and contextually-aware feature extraction, and has shown to improve performance significantly, especially in multi-camera fusion scenarios.
2. Advanced Multi-Task Loss Framework
- What it is: This model was trained using a specialized combination of Focal Loss and Enhanced-IoU (EIoU) Loss.
- Why it matters: This framework is purpose-built to tackle the primary challenges in perception: Focal Loss addresses the severe class imbalance in object detection, while EIoU Loss ensures highly accurate bounding box regression by optimizing for geometric overlap.
3. High-Resolution, Camera-Only Architecture
- What it is: This model is vision-based (camera-only) and uses a ResNet-50 backbone with a smaller patch size (
patch_size=8
) for high-resolution analysis. - Why it matters: It demonstrates that strong perception performance can be achieved without costly sensors like LiDAR, aligning with modern, cost-effective approaches to autonomous driving.
ποΈ Model Architecture vs. Baseline
Component | Original Interfuser (Baseline) | Interfuser-HDPE (This Model) |
---|---|---|
Positional Encoding | Sinusoidal PE | β Hyper-Dimensional PE (HDPE) |
Perception Backbone | ResNet-26, LiDAR | β Camera-Only, ResNet-50 |
Training Objective | Standard BCE + L1 Loss | β Focal Loss + EIoU Loss |
Model Outputs | Waypoints, Traffic Grid, States | Same (Optimized for higher accuracy) |
π How to Use These Weights
These weights are intended to be loaded into a model class that incorporates our architectural changes, primarily the HyperDimensionalPositionalEncoding
module.
import torch
from huggingface_hub import hf_hub_download
# You need to provide the model class definition, let's call it InterfuserHDPE
from your_model_definition_file import InterfuserHDPE
# Download the pre-trained model weights
model_path = hf_hub_download(
repo_id="BaseerAI/Interfuser-Baseer-v1",
filename="interfuser_hdpe_v1.pth"
)
# Instantiate your model architecture
# The config must match the architecture these weights were trained on
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = InterfuserHDPE(**model_config).to(device)
# Load the state dictionary
state_dict = torch.load(model_path, map_location=device)
model.load_state_dict(state_dict)
model.eval()
# Now the model is ready for inference
with torch.no_grad():
# The model expects a dictionary of sensor data
# (e.g., {'rgb': camera_tensor, ...})
perception_outputs = model(input_data)
π Performance Highlights
When integrated into a full driving stack (like our Baseer Self-Driving API), this perception model is the foundation for:
- Significantly Improved Detection Accuracy: Achieves higher mAP on the PDM-Lite-CARLA dataset.
- Superior Driving Score: Leads to a higher overall Driving Score with fewer infractions compared to baseline models.
- Proven Scalability: Performance demonstrably improves when scaling from single-camera to multi-camera inputs, showcasing the robustness of the HDPE-based architecture.
(Detailed metrics and ablation studies will be available in our upcoming research paper.)
π οΈ Integration with a Full System
This model provides the core perception outputs. To build a complete autonomous agent, you need to combine it with:
- A Temporal Tracker: To maintain object identity across frames.
- A Decision-Making Controller: To translate perception outputs into vehicle commands.
An example of such a complete system, including our custom-built Hierarchical, Memory-Enhanced Controller, can be found in our Live Demo API Space.
π Citation
If you use the HDPE concept or this model in your research, please cite our upcoming paper. For now, you can cite this model repository:
@misc{interfuser-hdpe-2024,
title={HDPE: Hyper-Dimensional Positional Encoding for End-to-End Self-Driving Systems},
author={Altawil, Adam},
year={2024},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/BaseerAI/Interfuser-Baseer-v1}}
}
π¨βπ» Development
Lead Researcher: Adam Altawil
Project Type: Graduation Project - AI & Autonomous Driving
Contact: [Your Contact Information]
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π€ Contributing & Support
For questions, contributions, and support:
- π Try the Live Demo: Baseer Server Space
- π§ Contact: [Your Contact Information]
- π Issues: Create an issue in this repository