File size: 2,238 Bytes
623a78b
 
 
 
 
 
 
 
 
 
 
 
 
197f827
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
title: VizAttn
emoji: 🐈
colorFrom: red
colorTo: green
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
license: mit
---


# ViT
- GitHub source repo⭐:: [VitCiFar](https://github.com/Muthukamalan/VitCiFar)

As we all know Transformer architecture, taken up the world by Storm.  

In this Repo, I practised (from scratch) how we implement this to Vision. Transformers are data hungry don't just compare with CNN (not apples to apple comparison here)


#### Model
<div align='center'><img src="https://raw.githubusercontent.com/Muthukamalan/VitCiFar/main/assets/vit.png" width=500 height=300></div>


**Patches**
```python
nn.Conv2d(
            in_chans, 
            emb_dim, 
            kernel_size = patch_size, 
            stride = patch_size
        )
```
<div align='center'>
    <img src="https://raw.githubusercontent.com/Muthukamalan/VitCiFar/main/assets/patches.png" width=500 height=300 style="display:inline-block; margin-right: 10px;" alt="patchs">
    <img src="https://raw.githubusercontent.com/Muthukamalan/VitCiFar/main/assets/embedding.png" width=500 height=300 style="display:inline-block;">
</div>


> [!NOTE] CASUAL MASK
> Unlike in words, we don't use casual mask here.


<!-- <div align='center'><img src="assets/attention-part.png" width=300 height=500 style="display:inline-block; margin-right: 10px;"></div> -->
<p align="center">
  <img src="https://raw.githubusercontent.com/Muthukamalan/VitCiFar/main/assets/attention-part.png" alt="Attention Visualization" />
</p>


At Final Projection layer,
- pooling (combine) and projected what peredicted layer
- Add One Token before train transformer-block after then pick that token pass it to projection layer (like `BERT` did)  << ViT chooses

```python

        # Transformer Encoder
        xformer_out = self.enc(out) # [batch, 65, 384]
        if self.is_cls_token:
            token_out = xformer_out[:,0] # [batch, 384]
        else:
            token_out = xformer_out.mean(1)

        # MLP Head
        projection_out = self.mlp_head(token_out) # [batch, 10]

```


#### Context Grad-CAM 
[Xplain AI](https://github.com/jacobgil/pytorch-grad-cam)

- register_forward_hook::  hook will be executed during the forward pass of the model