Spaces:
Sleeping
Sleeping
File size: 2,238 Bytes
623a78b 197f827 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
---
title: VizAttn
emoji: 🐈
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
license: mit
---
# ViT
- GitHub source repo⭐:: [VitCiFar](https://github.com/Muthukamalan/VitCiFar)
As we all know Transformer architecture, taken up the world by Storm.
In this Repo, I practised (from scratch) how we implement this to Vision. Transformers are data hungry don't just compare with CNN (not apples to apple comparison here)
#### Model
<div align='center'><img src="https://raw.githubusercontent.com/Muthukamalan/VitCiFar/main/assets/vit.png" width=500 height=300></div>
**Patches**
```python
nn.Conv2d(
in_chans,
emb_dim,
kernel_size = patch_size,
stride = patch_size
)
```
<div align='center'>
<img src="https://raw.githubusercontent.com/Muthukamalan/VitCiFar/main/assets/patches.png" width=500 height=300 style="display:inline-block; margin-right: 10px;" alt="patchs">
<img src="https://raw.githubusercontent.com/Muthukamalan/VitCiFar/main/assets/embedding.png" width=500 height=300 style="display:inline-block;">
</div>
> [!NOTE] CASUAL MASK
> Unlike in words, we don't use casual mask here.
<!-- <div align='center'><img src="assets/attention-part.png" width=300 height=500 style="display:inline-block; margin-right: 10px;"></div> -->
<p align="center">
<img src="https://raw.githubusercontent.com/Muthukamalan/VitCiFar/main/assets/attention-part.png" alt="Attention Visualization" />
</p>
At Final Projection layer,
- pooling (combine) and projected what peredicted layer
- Add One Token before train transformer-block after then pick that token pass it to projection layer (like `BERT` did) << ViT chooses
```python
# Transformer Encoder
xformer_out = self.enc(out) # [batch, 65, 384]
if self.is_cls_token:
token_out = xformer_out[:,0] # [batch, 384]
else:
token_out = xformer_out.mean(1)
# MLP Head
projection_out = self.mlp_head(token_out) # [batch, 10]
```
#### Context Grad-CAM
[Xplain AI](https://github.com/jacobgil/pytorch-grad-cam)
- register_forward_hook:: hook will be executed during the forward pass of the model |