kyrylokumar commited on
Commit
8398dfc
Β·
verified Β·
1 Parent(s): 3a1b2c0

Added extra files

Browse files
Files changed (2) hide show
  1. README.md +101 -68
  2. q3.ipynb +19 -2
README.md CHANGED
@@ -1,69 +1,102 @@
1
- ## Part 1
2
-
3
- Normal model
4
- Memory usage of model alone = 510.342192
5
- 0%| | 0/491 [00:00<?, ?it/s]Memory usage at forward pass = 838.783488
6
- 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 489/491 [00:25<00:00, 18.97it/s]
7
- Loss = 26.38488006591797
8
- Time taken: 25.795103549957275
9
-
10
- Full model quant
11
- Memory usage of model alone = 294.250369
12
- 0%| | 0/491 [00:00<?, ?it/s]Memory usage at forward pass = 1465.776128
13
- 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 489/491 [00:21<00:00, 22.39it/s]
14
- Loss = 26.954803466796875
15
- Time taken: 21.855380058288574
16
-
17
- Full model without lm_head
18
- Memory usage of model alone = 255.602736
19
- 0%| | 0/491 [00:00<?, ?it/s]Memory usage at forward pass = 1269.30176
20
- 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 489/491 [00:21<00:00, 22.68it/s]
21
- Loss = 26.41402816772461
22
- Time taken: 21.578929662704468
23
-
24
- Only LM head
25
- Memory usage of model alone = 548.989825
26
- 0%| | 0/491 [00:00<?, ?it/s]Memory usage at forward pass = 1036.319744
27
- 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 489/491 [00:20<00:00, 23.39it/s]
28
- Loss = 26.924053192138672
29
- Time taken: 20.919220209121704
30
-
31
- Last 4 attention layers
32
- Memory usage of model alone = 425.42904
33
- 0%| | 0/491 [00:00<?, ?it/s]Memory usage at forward pass = 983.949824
34
- 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 489/491 [00:20<00:00, 23.40it/s]
35
- Loss = 26.39584732055664
36
- Time taken: 20.912957668304443
37
-
38
- Only q,k,v
39
- Memory usage of model alone = 425.425968
40
- 0%| | 0/491 [00:00<?, ?it/s]Memory usage at forward pass = 989.827584
41
- 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 489/491 [00:21<00:00, 23.11it/s]
42
- Loss = 26.396583557128906
43
- Time taken: 21.17274236679077
44
-
45
-
46
- ## Part 2:
47
- 4 bit model
48
- Memory usage of model alone = 134.060568
49
- 0%| | 0/491 [00:00<?, ?it/s]Memory usage at forward pass = 308.803072
50
- 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 489/491 [00:16<00:00, 29.78it/s]
51
- Loss = 31.296875
52
- Time taken: 16.42749333381653
53
-
54
- `low_cpu_mem_usage` was None, now set to True since model is quantized.
55
- 8 bit model
56
- Memory usage of model alone = 176.527896
57
- 0%| | 0/491 [00:00<?, ?it/s]Memory usage at forward pass = 494.142976
58
- 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 489/491 [00:29<00:00, 16.70it/s]
59
- Loss = 26.5625
60
- Time taken: 29.27569341659546
61
-
62
- `low_cpu_mem_usage` was None, now set to True since model is quantized.
63
- 4 bit nf4 model
64
- Memory usage of model alone = 134.060568
65
- 0%| | 0/491 [00:00<?, ?it/s]Memory usage at forward pass = 494.85824
66
- 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 489/491 [00:15<00:00, 30.64it/s]
67
- Loss = 28.375
68
- Time taken: 15.961309671401978
69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Quantizing gpt2: Analysis of Time and Memory Predictions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
+ This document outlines various quantization techniques applied to the gpt2 model and analyzes their impact on memory usage, loss, and execution time, focusing on explaining the observed trends in time and memory usage.
4
+
5
+ ## Part 1 - Manual Quantization
6
+
7
+ ### Baseline Model
8
+
9
+ * **Memory Usage (Forward Pass):** 838.78 MB
10
+ * **Loss:** 26.38
11
+ * **Time Taken:** 25.80 seconds
12
+
13
+ ### Full Model Quantization
14
+
15
+ * **Model Memory Usage:** 294.25 MB (Significantly reduced due to lower precision weights)
16
+ * **Memory Usage (Forward Pass):** 1465.78 MB (Increased due to overhead introduced by quantization/dequantization operations)
17
+ * **Loss:** 26.95 (Slight increase compared to baseline, expected due to reduced precision)
18
+ * **Time Taken:** 21.86 seconds (Faster than baseline potentially due to smaller memory footprint during some operations, despite the overhead)
19
+
20
+ **Comment:** While model size decreases, forward pass memory increases, suggesting overhead in managing quantized weights. Time reduction might be due to faster memory access or specific optimizations for quantized operations.
21
+
22
+ ### Full Model without lm_head Quantization
23
+
24
+ * **Model Memory Usage:** 255.60 MB (Further reduction compared to full quantization)
25
+ * **Memory Usage (Forward Pass):** 1269.30 MB (Lower than full model quantization but still higher than baseline)
26
+ * **Loss:** 26.41 (Closer to baseline, suggesting lm_head might be more sensitive to quantization)
27
+ * **Time Taken:** 21.58 seconds (Slightly faster than full quantization, consistent with reduced memory footprint)
28
+
29
+ **Comment:** Excluding lm_head from quantization seems to balance memory usage and accuracy. The forward pass memory still shows overhead, but is lower than fully quantized, indicating lm_head contributes to it.
30
+
31
+ ### Only lm_head Quantization
32
+
33
+ * **Model Memory Usage:** 548.99 MB (Larger than other quantizations, lm_head might be a significant portion of the model)
34
+ * **Memory Usage (Forward Pass):** 1036.32 MB (Lower than full quantization, suggests other parts of the model contribute more to the overhead)
35
+ * **Loss:** 26.92 (Similar to full quantization, confirms lm_head's sensitivity to quantization)
36
+ * **Time Taken:** 20.92 seconds (Fastest among manual quantization, possibly due to a better trade-off between quantization and computation)
37
+
38
+ **Comment:** Quantizing only lm_head provides a faster inference with reasonable loss, but the model size is larger compared to other methods. This indicates that the benefit of quantization depends on the specific part of the model being targeted.
39
+
40
+ ### Last 4 Attention Layers Quantization
41
+
42
+ * **Model Memory Usage:** 425.43 MB
43
+ * **Memory Usage (Forward Pass):** 983.95 MB (Further reduction in forward pass memory)
44
+ * **Loss:** 26.40 (Minimal impact on loss, suggesting these layers are less sensitive to quantization)
45
+ * **Time Taken:** 20.91 seconds (Fastest among manual quantization, consistent with lower forward pass memory)
46
+
47
+ **Comment:** Targeting specific layers like the last attention layers provides significant memory and time improvements with minimal accuracy loss, demonstrating that selective quantization can be effective.
48
+
49
+ ### Only q, k, v Matrices Quantization
50
+
51
+ * **Model Memory Usage:** 425.43 MB (Same as last 4 attention layers, as expected)
52
+ * **Memory Usage (Forward Pass):** 989.83 MB (Slightly higher than last 4 attention layers, potentially due to different access patterns)
53
+ * **Loss:** 26.40 (Similar to last 4 attention layers, confirming these matrices can be quantized effectively)
54
+ * **Time Taken:** 21.17 seconds (Slightly slower than last 4 attention layers)
55
+
56
+ **Comment:** Quantizing only q, k, v matrices shows similar benefits as quantizing last attention layers. The slight difference in time and memory might be due to implementation details and data access patterns.
57
+
58
+ ## Part 2 - Quantization using bitsandbytes (bnb)
59
+
60
+ ### 4-bit Quantization
61
+
62
+ * **Model Memory Usage:** 134.06 MB (Significantly smaller than manual quantization due to 4-bit precision)
63
+ * **Memory Usage (Forward Pass):** 308.80 MB (Significantly smaller than all manual quantization, bnb is efficient)
64
+ * **Loss:** 31.30 (Larger increase in loss, indicating a trade-off between memory and accuracy at 4-bit)
65
+ * **Time Taken:** 16.43 seconds (Fastest among all methods, due to highly optimized bnb implementation and reduced memory footprint)
66
+ * **Note:** `low_cpu_mem_usage` set to True.
67
+
68
+ **Comment:** bnb's 4-bit quantization offers substantial memory and time savings, but at the cost of higher loss. This indicates its suitability for scenarios where speed and memory are prioritized over accuracy.
69
+
70
+ ### 8-bit Quantization
71
+
72
+ * **Model Memory Usage:** 176.53 MB (Larger than 4-bit but still smaller than manual quantization)
73
+ * **Memory Usage (Forward Pass):** 494.14 MB (Larger than 4-bit but smaller than manual quantization)
74
+ * **Loss:** 26.56 (Close to baseline, 8-bit offers a good balance between accuracy and memory)
75
+ * **Time Taken:** 29.28 seconds (Slower than 4-bit but faster than the baseline, indicating 8-bit trade-offs)
76
+ * **Note:** `low_cpu_mem_usage` set to True.
77
+
78
+ **Comment:** 8-bit quantization with bnb provides a good balance, reducing memory usage and improving speed compared to the baseline while maintaining accuracy close to the original model.
79
+
80
+ ### 4-bit NF4 Quantization
81
+
82
+ * **Model Memory Usage:** 134.06 MB (Same as 4-bit quantization, as expected)
83
+ * **Memory Usage (Forward Pass):** 494.86 MB (Higher than 4-bit, NF4 might introduce different overhead)
84
+ * **Loss:** 28.38 (Lower loss than 4-bit quantization, indicating NF4's better accuracy)
85
+ * **Time Taken:** 15.96 seconds (Fastest among all methods, suggesting NF4 is computationally efficient)
86
+
87
+ **Comment:** NF4 (NormalFloat4) in bnb provides a better accuracy than standard 4-bit quantization while maintaining similar speed benefits, making it a good option for scenarios needing both efficiency and reasonable accuracy.
88
+
89
+ ## General Comments on Time and Memory Trends
90
+
91
+ * **Quantization Precision:** Lower precision (e.g., 4-bit) leads to smaller model sizes but can increase loss.
92
+ * **Quantization Overhead:** Introducing quantization and dequantization operations can increase the memory footprint during the forward pass, even if the model size is reduced.
93
+ * **Implementation Efficiency:** Libraries like bnb are optimized for quantized operations, leading to significant speed improvements and potentially lower memory usage during inference.
94
+ * **Selective Quantization:** Targeting specific parts of the model (e.g., certain layers or matrices) can provide a good balance between memory reduction, speed improvement, and accuracy preservation.
95
+ * **Data Movement:** Reduced memory footprint can lead to faster inference due to reduced data movement between memory and processing units.
96
+
97
+ By understanding these factors, one can choose the appropriate quantization strategy based on the specific requirements of the application, considering the trade-offs between memory, speed, and accuracy.
98
+
99
+ ## Part 3 - Quantization using llama.cpp
100
+
101
+ * The PyTorch model (`pytorch_model.bin`) is converted to a quantized gguf file (`gpt2.ggml`) using llama.cpp.
102
+ * The quantized model is uploaded to Hugging Face: [gpt2-quantized-gguf](https://huggingface.co/kyrylokumar/gpt2-quantzed-gguf)
q3.ipynb CHANGED
@@ -255,9 +255,26 @@
255
  },
256
  {
257
  "cell_type": "code",
258
- "execution_count": null,
259
  "metadata": {},
260
- "outputs": [],
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
261
  "source": [
262
  "from huggingface_hub import create_repo, upload_folder\n",
263
  "\n",
 
255
  },
256
  {
257
  "cell_type": "code",
258
+ "execution_count": 35,
259
  "metadata": {},
260
+ "outputs": [
261
+ {
262
+ "name": "stderr",
263
+ "output_type": "stream",
264
+ "text": [
265
+ "/home/kyrylo/Sem-7/Anlp/Grokking/Minimal/lib/python3.8/site-packages/huggingface_hub/hf_api.py:9628: UserWarning: Warnings while validating metadata in README.md:\n",
266
+ "- empty or missing yaml metadata in repo card\n",
267
+ " warnings.warn(f\"Warnings while validating metadata in README.md:\\n{message}\")\n"
268
+ ]
269
+ },
270
+ {
271
+ "name": "stdout",
272
+ "output_type": "stream",
273
+ "text": [
274
+ "Directory './' pushed to: kyrylokumar/gpt2-quantzed-gguf\n"
275
+ ]
276
+ }
277
+ ],
278
  "source": [
279
  "from huggingface_hub import create_repo, upload_folder\n",
280
  "\n",