---
license: mit
datasets:
- CreitinGameplays/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70Bmistral
language:
- en
base_model:
- mistralai/Mistral-Nemo-Instruct-2407
pipeline_tag: text-generation
library_name: transformers
---

## Mistral Nemo 12B R1
![mistralthink](https://autumn.revolt.chat/attachments/zIqa-Q6gKlwm7BbOvKvFFRLHDdy5OOy30KcU5iFle1/image.png)

Took **96 hours** to finetune on **2x Nvidia RTX A6000** with the following settings:
- Batch size: 3
- Gradient accumulation steps: 1
- Epochs: 1
- Learning rate: 1e-4
- Warmup ratio: 0.1

Run the model:
```python
import torch
from transformers import pipeline

model_id = "CreitinGameplays/Mistral-Nemo-12B-R1-v0.1"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "How many r's are in strawberry?"}
]

outputs = pipe(
    messages,
    temperature=0.3,
    top_p=0.8,
    top_k=40,
    repetition_penalty=1.1,
    max_new_tokens=2048
)

print(outputs[0]["generated_text"][-1])
```

Recommended system prompt:
```
You are an AI focused on providing systematic, well-reasoned responses. Response Structure: - Format: <think>{reasoning}</think>{answer} - Process: Think first, then answer.
```

Response example:
```
<s>[INST]You are an AI focused on providing systematic, well-reasoned responses. Response Structure: - Format: <think>{reasoning}</think>{answer} - Process: Think first, then answer.

what is a nvidia tesla t4[/INST]<think>Okay, so I need to figure out what a NVIDIA Tesla T4 is. I remember that NVIDIA makes GPUs for various purposes, but I'm not exactly sure about the specifics of the Tesla models. Let me start by recalling what I know about GPUs in general. They're used for graphics rendering, right? But I also heard that they can be used for other tasks like machine learning and deep learning. 

I think the Tesla series from NVIDIA is specifically designed for these non-gaming applications. So, the Tesla T4 must be one of those. I've heard terms like Tensor cores before, which I believe are specialized hardware units inside some GPUs for accelerating certain operations. Maybe the T4 has Tensor cores?

Let me try to break it down. The name "Tesla" suggests it's part of their data center or professional line. The number "4" probably indicates it's the fourth generation or something similar. I should check if there are previous versions like T1, T2, etc., to see where it stands in the lineup.

I remember that the Tesla V100 was a high-end model with a lot of memory and performance. The T4 might be a step below that or perhaps a different type altogether. Maybe it's more affordable or targeted at specific use cases.

I think the T4 is known for its support for mixed precision training, which allows using both floating-point and half-precision numbers together. This could make training neural networks faster and more efficient. It might also have features like NVLink for connecting multiple GPUs, which would be useful for distributed computing.

Looking at specs, I believe the T4 has 320 Tensor cores. Each Tensor core can perform matrix multiplications much faster than regular CUDA cores. So, it's optimized for machine learning tasks. The amount of memory is probably around 16 GB, which is less than the V100's 16 GB or 32 GB, but still sufficient for many applications.

I also recall that the T4 supports NVIDIA's GPU virtualization technology, which allows multiple VMs to share a single GPU. That would be great for cloud providers who want to maximize resource utilization.

In terms of performance, the T4 is likely slower than the V100 but faster than older models like the P400. It might be used in applications where cost is a factor, but performance isn't as critical as in high-end research or enterprise settings.

I should also consider the form factors. I think the T4 comes in PCIe cards, maybe with different sizes depending on the application—like full-height or low-profile. It might also be available in server boards or as part of pre-built systems.

Another thing is power consumption. The T4 is probably more energy-efficient than earlier models, which is important for data centers looking to reduce costs and environmental impact.

I wonder how it compares to AMD's offerings. I know AMD has GPUs like the Radeon Instinct, but I'm not sure how they stack up against the Tesla T4 in terms of performance and features.

In summary, the Tesla T4 is a professional-grade GPU designed for data centers and machine learning tasks. It features Tensor cores for acceleration, supports mixed precision, and has enough memory and connectivity options to handle modern workloads efficiently.
</think>

The NVIDIA Tesla T4 is a professional-grade GPU designed for data centers and machine learning tasks. Key features include:

1. **Tensor Cores**: Equipped with 320 Tensor cores, each optimizing matrix multiplications for accelerated machine learning tasks.

2. **Mixed Precision Training**: Supports simultaneous use of floating-point and half-precision numbers, enhancing efficiency in neural network training.

3. **Memory and Connectivity**: Offers 16 GB of memory and NVLink support for connecting multiple GPUs, ideal for distributed computing.

4. **GPU Virtualization**: Enables multiple virtual machines to share a single GPU, maximizing resource utilization in cloud environments.

5. **Performance and Power**: Slower than the V100 but faster than older models, with improved energy efficiency suitable for cost-conscious data centers.

6. **Form Factors**: Available in PCIe cards with varying sizes, including full-height and low-profile options, along with server board and pre-built system configurations.

7. **Comparison with AMD**: While the T4 competes with AMD's Radeon Instinct GPUs, it offers unique features tailored to data center and machine learning needs.

In conclusion, the Tesla T4 is a versatile solution for data centers seeking balance between performance, efficiency, and affordability. Its features make it suitable for a range of applications, from cloud services to machine learning projects.</s>
```

**Note**: The model was mainly finetuned on English dataset, meaning the model may not perform well in other languages.