File size: 5,112 Bytes
d3cd765
 
 
 
 
 
 
eced3b9
 
 
d3cd765
 
51d1fdd
d3cd765
09a98da
f0865f8
 
d3cd765
6416772
d34e331
3aa4741
09a98da
f6dbdc2
f02d9eb
bd35011
f6dbdc2
cd0668f
 
 
925eb22
6f83445
4c708f2
f6dbdc2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c708f2
f6dbdc2
cc6b835
f6dbdc2
 
 
 
 
 
 
 
 
 
0f18ddc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
license: mit
language:
- en
base_model:
- microsoft/phi-4-gguf
pipeline_tag: text-generation
tags:
- phi4
- gguf-connector
---

# GGUF quantized and bug fixed version of **phi4**

### review
- bug fixed for: "ResponseError: llama runner process has terminated: GGML_ASSERT(hparams.n_swa > 0) failed"
- define the architecture (from none) to llama; all works right away

### run the model
use any gguf connector to interact with gguf file(s), i.e., [connector](https://pypi.org/project/gguf-connector/)

### reference
- base model: microsoft/[phi-4](https://huggingface.co/microsoft/phi-4)
- bug fixed following the guide written by [unsloth](https://unsloth.ai/blog/phi4)
- tool used for quantization: [cutter](https://pypi.org/project/gguf-cutter)

### citation
[Phi-4 Technical Report](https://arxiv.org/pdf/2412.08905)

### appendices: model summary and quality (written by microsoft)

#### model summary

|                         |                                                                               |     
|-------------------------|-------------------------------------------------------------------------------|
| **Developers**          | Microsoft Research                                                            |
| **Description**         | `phi-4` is a state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.<br><br>`phi-4` underwent a rigorous enhancement and alignment process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures                |
| **Architecture**        | 14B parameters, dense decoder-only Transformer model                          |
| **Inputs**              | Text, best suited for prompts in the chat format                              |
| **Context length**      | 16K tokens                                                                    |
| **GPUs**                | 1920 H100-80G                                                                 |
| **Training time**       | 21 days                                                                       |
| **Training data**       | 9.8T tokens                                                                   |
| **Outputs**             | Generated text in response to input                                           |
| **Dates**               | October 2024 – November 2024                                                  |
| **Status**              | Static model trained on an offline dataset with cutoff dates of June 2024 and earlier for publicly available data                                                                               |
| **Release date**        | December 12, 2024                                                             |
| **License**             | MIT                                                                         |

#### model quality

to understand the capabilities, we (here refer to microsoft side) compare `phi-4` with a set of models over OpenAI’s SimpleEval benchmark; at the high-level overview of the model quality on representative benchmarks; for the table below, higher numbers indicate better performance: 

| **Category**                 | **Benchmark** | **phi-4** (14B) | **phi-3** (14B) | **Qwen 2.5** (14B instruct) | **GPT-4o-mini** | **Llama-3.3** (70B instruct) | **Qwen 2.5** (72B instruct) | **GPT-4o** |
|------------------------------|---------------|-----------|-----------------|----------------------|----------------------|--------------------|-------------------|-----------------|
| Popular Aggregated Benchmark | MMLU          | 84.8      | 77.9            | 79.9                 | 81.8                 | 86.3               | 85.3              | **88.1**            |
| Science                      | GPQA          | **56.1**      | 31.2            | 42.9                 | 40.9                 | 49.1               | 49.0              | 50.6            |
| Math                         | MGSM<br>MATH  | 80.6<br>**80.4** | 53.5<br>44.6 | 79.6<br>75.6 | 86.5<br>73.0 | 89.1<br>66.3* | 87.3<br>80.0              | **90.4**<br>74.6            |
| Code Generation              | HumanEval     | 82.6      | 67.8            | 72.1                 | 86.2                 | 78.9*               | 80.4              | **90.6**            |
| Factual Knowledge            | SimpleQA      | 3.0       | 7.6            | 5.4                 | 9.9                  | 20.9               | 10.2              | **39.4**             |
| Reasoning                    | DROP          | 75.5      | 68.3            | 85.5                 | 79.3                 | **90.2**               | 76.7              | 80.9            |

\* these scores are lower than those reported by Meta, perhaps because simple-evals has a strict formatting requirement that Llama models have particular trouble following.