File size: 16,694 Bytes
d23c90b
9292fbb
d23c90b
 
 
 
 
 
 
 
 
 
 
9292fbb
 
d23c90b
 
9292fbb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
text1 = """<h1 id="how-many-data-points-is-a-prompt-worth">一条提示抵得上多少条数据?</h1>
<img class='center' style='height: 5em; float: right;' src='https://raw.githubusercontent.com/TevenLeScao/transformer-xl/master/pytorch/assets/avatar_logo_joint.png' alt='avatar'>
<h4>发表于 2021 年 4 月 6 日</h4> 
<h4>Teven Le Scao, Hugging Face 研究员 • <a href="https://twitter.com/Fluke_Ellington">@Fluke_Ellington</a> </h4> 
<p>当前 NLP 应用的主流方法是针对各式各样的特定任务,分别对预训练语言模型的分类头进行微调。随着语言模型变得越来越大,各种替代方法相继涌现,开始叫板在 <a href="https://arxiv.org/abs/1810.04805">BERT</a>、<a href="https://arxiv.org/abs/1905.03197">UniLM</a> 以及 <a href="https://openai.com/research/language-unsupervised">GPT</a> 中广泛使用的分类头法。特别地,GPT-3 向大家普及了提示法,该方法通过自然语言输入来引导预训练语言模型依靠自身解决任务,而不再需要额外的分类头。</p>
<p>提示很有意思,用户可以通过它向模型提供信息,这与传统的 ML 监督学习有很大不同。在与 <a href="http://rush-nlp.com/">Sasha Rush</a> 合作的 NAACL 2021 <a href="https://arxiv.org/abs/2103.08493">论文</a>中,我们研究了基于提示的微调,该方法可用于替代当前标准的有监督微调法,我们的实验表明提示通常比标准方法更有优势,因此该方法很有前途。在分析这种优势时,我们认为提示为模型带来了额外的信息,这促使我们想用数据点这个指标来量化这种优势,也就是说:<strong>一个提示可以抵多少条数据?</strong> </p>

<h2 id="prompting">提示法</h2>
<p>为了使预训练语言模型能够完成特定任务,当前的主流方法是用随机初始化的线性分类头替换原模型的最后一层:词预测层。然后使用有监督的任务数据通过反向传播来训练修改后的模型,主要学习这个新分类头的权重,同时也可以更新模型其他层的权重。我们将这种方法称为<em>分类头</em>法。</p>
<p>一种与之相竞争的方法是*提示法*:这类方法主要尝试使用原语言模型来预测目标类相应的单词来“回答”分类问题,而不是像传统方法那样“预测”类标签。这使得我们可以直接使用语言模型本身来执行分类任务。在这里,*提示*就是精心设计的、用于生成所需的答案文本的输入文本。</p>
<p id="footnote1back">这听起来可能很抽象,但这其实恰恰就是人类在实际生活中进行文本推理时所使用的非常自然的方法:例如,学校练习往往以一个文本输入(例如,一篇关于火星的文章)加上一个问题(&quot;火星上有生命吗?&quot;)的形式呈现,并期望你提供一个自然语言的答案(&quot;否&quot;<a href="#footnote1"><sup>1</sup></a>),该答案其实就可以映射到分类任务的某个类别(这里,&quot;否&quot;对应<code>假</code>,&quot;是&quot;对应<code>真</code>,本例就是个二分类问题)。在这种范式中,就像做语法练习一样,我们把特定于任务的数据输入给模型,而模型就像学生一样,需要以固定的方式进行填空。提示法希望能显式利用语言模型中包含的预训练信息,而不是仅以将其隐含表征馈送给线性分类头的方式隐式利用这些信息。</p>
<p>以下是 <a href="https://arxiv.org/abs/1905.00537">SuperGLUE</a> 中的 <a href="https://arxiv.org/abs/1905.10044">BoolQ</a> 任务的示例,其题型为判断题,每条数据包括一个文本 <span style="color: #0c593d">passage</span> 及其对应的问题 <span style="color: #031154">question</span> ,其答案为布尔值,要么为真,要么为假。每条数据可以和 <span style="color: #910713">**模板(pattern)**</span> 一起组装成一个文本序列,该序列只有一个需预测的 <span style="color: #ba9004"><strong>掩码词</strong></span>。预测出该掩码词后,预测词会被一个预设的 <em>言语器(verbalizer)</em> 转换为类,也就是说<em>言语器</em>负责输出词与类别之间的映射:比较该词被映射为<em>是</em>和<em>否</em>的概率,如果<em>是</em>的概率高,则最终预测为<code>真</code>,反之则为<code>假</code>。
</p>
"""

text2 = """<h2 id="fine-tuning">微调</h2>
<p>这样,我们就把通用语言模型转变成了针对特定任务的分类器。这种基于提示的语言模型分类器的用法很多: </p>
<ul>
<li>The preserved language modeling functionality from the pre-trained model allows them to perform without additional data, as opposed to linear classifier <em>heads</em> that are initialized from scratch and always start at random performance. A variety of papers have used this for <a href="https://arxiv.org/abs/1912.10165">zero-shot classification.</a>  </li>
<li>In order to incorporate supervised task data, they can use backpropagation with the usual language modeling cross-entropy loss objective: the verbalizer token associated with the correct class then serves as the correct token prediction. This is a component of <a href="https://arxiv.org/abs/2001.07676">PET</a>, and is the objective used by <a href="https://arxiv.org/abs/1910.10683">T5</a> - although T5 uses prefixes to indicate the task rather than describing it with a natural-language prompt.  </li>
<li>They can also use <em>priming</em>, where the sequence that needs to be filled in is prefixed with a list of correctly-filled examples. No backpropagation is used, and the weights of the language model are never modified: instead, it can attend to correct examples at inference time. This is the method used by <a href="https://arxiv.org/abs/2005.14165">GPT-3</a>.  </li>
<li>Finally, PET uses prompt models to pseudo-label unlabeled data that is then fed to a linear head model.  </li>
</ul>
<p>In this paper, our goal is to present the fairest comparison possible with head models, so we fine-tune with backpropagation.</p>
<h2 id="how-many-data-points-is-a-prompt-worth-">How many data points is a prompt worth?</h2>
<p>As we have seen, both heads and prompting can be used in a task specific supervised setting. The core difference is that the prompted model is given a specific sentence that roughly describes the task in addition to supervised examples. In some sense, this sentence is supervision as it tells the model about the task, but it is qualitatively a very different form of supervision than is standard in ML. How should we think about this supervision? How do we quantify how “zero-shot” this setup really is?  </p>
<p>We do this by comparing the <em>head</em> and <em>prompt</em> setups on the SuperGLUE tasks and MNLI. For each task, we extract subsets of the dataset of growing size, and repeat fine-tuning on <a href="https://arxiv.org/abs/1907.11692"><code>RoBERTa-large</code></a> with both methods on every subset, keeping everything else the same. For fairness, we tune the hyperparameters on the head baseline until they&#39;ve attained the level of performance of the BERT++ baseline from the SuperGLUE leaderboard, and keep them the same for the <em>prompt</em> model. </p>
<p id="footnote2back">The curves of final performance (on each task&#39;s metric) vs dataset size are plotted below for each task <a href="#footnote2"><sup>2</sup></a>. They allow us to contrast the amount of data required to attain a certain level of performance with both setups on a given task. We call this difference the <em>data advantage</em> of a training setup over the other at that level of performance. We call the range of performance that has been attained by both models the <em>comparison window</em>. By integrating over it we get the <em>average data advantage</em> of a method over the other on the task. Graphically, that is simply the area between the curves, divided by the height of the comparison window. <a href="#footnote3"><sup>3</sup></a></p>
"""

text3 = """<html>
<head>
<style>
table, th, td {
  border: 1px solid black;
  border-collapse: collapse;
}
.styled-table {
  margin-left: auto; 
  margin-right: auto;
}
.styled-table {
    border-collapse: collapse;
    font-size: 1em;
    font-family: sans-serif;
    min-width: 400px;
    box-shadow: 0 0 20px rgba(0, 0, 0, 0.15);
}
.styled-table thead tr {
    background-color: #ffebcd;
    color: #000000;
    text-align: left;
}
.styled-table th,
.styled-table td {
    padding: 6px 8px;
    font-size: 13px;
}
.styled-table tbody tr {
    border-bottom: 1px solid #dddddd;
}

.styled-table tbody tr:nth-of-type(even) {
    background-color: #f3f3f3;
}

.styled-table tbody tr:last-of-type {
    border-bottom: 2px solid #29004a;
}
}


</style>
</head>
<body>
<p id="footnote4back">Here&#39;s a recapitulative table of the average data advantage of the prompt model over the head model per task, with error bounds obtained by a bootstrapping approach where we hold out one of the 4 head runs and 4 prompt runs (16 combinations total for every data size), and compute the standard deviation of those outcomes. Results are very different from task to task; they even vary for the same task on different dataset, for example for MNLI and RTE, both entailment tasks. However, on every task but WiC <a href="#footnote4"><sup>4</sup></a>, the prompt method has a significant edge. <strong>The additional information provided by the prompt is consistently equivalent to hundreds of data points</strong>.  </p>
<table id="footnote5back" class="styled-table">
<thead>
<tr>
<th></th>
<th><a href="https://arxiv.org/abs/1704.05426">MNLI</a></th>
<th><a href="https://arxiv.org/abs/1905.10044">BoolQ</a></th>
<th><a href="https://ojs.ub.uni-konstanz.de/sub/index.php/sub/article/view/601">CB</a></th>
<th><a href="https://people.ict.usc.edu/~gordon/publications/AAAI-SPRING11A.PDF">COPA</a></th>
<th><a href="https://www.aclweb.org/anthology/N18-1023/">MultiRC</a><sup><a href="#footnote5">5</a></sup></th>
<th><a href="https://link.springer.com/chapter/10.1007/978-94-024-0881-2_42">RTE</a></th>
<th><a href="https://arxiv.org/abs/1808.09121">WiC</a></th>
<th><a href="https://arxiv.org/abs/1808.09121">WSC</a></th>
</tr>
</thead>
<tbody>
<tr>
<td>Prompt vs Head</td>
<td>3506±536</td>
<td>752±46</td>
<td>90±2</td>
<td>288±242</td>
<td>384±378</td>
<td>282±34</td>
<td>-424±74</td>
<td>281±137</td>
</tr>
</tbody>
</table>
<h2 id="patterns-and-verbalizers">Patterns and verbalizers</h2>
<h4 id="control-verbalizers">Control verbalizers</h4>
<p>Prompting has for now mostly been used as a tool for zero-shot classification, which is a natural use case. However, zero-shot is usually tricky and requires perfectly aligning the prompt and verbalizer. We have already shown that prompting could be applied more generally, including in the full-data regime. In order to contrast the zero-shot and adaptive natures of prompts, we consider a <em>null verbalizer</em>, a control with a verbalizer that is completely decorrelated from the task. For tasks that only require filling in one token (thus excluding the more free-form COPA and WSC), we replace the verbalizers, for example, &quot;yes&quot;, &quot;no&quot;, &quot;maybe&quot;, &quot;right&quot; or &quot;wrong&quot;,  with random first names. This makes the model unusable without training data, much like a head model. We plot the corresponding curves and perform the same advantage analysis below:</p>
</body>
</html>
"""

text4 = """<table id="footnote6back" class="styled-table">
<thead>
<tr>
<th></th>
<th>MNLI</th>
<th>BoolQ</th>
<th>CB</th>
<th>MultiRC<a href="#footnote5"><sup>6</sup></a></th>
<th>RTE</th>
<th>WiC</th>
</tr>
</thead>
<tbody>
<tr>
<td>Prompt vs Head</td>
<td>3506±536</td>
<td>752±46</td>
<td>90±2</td>
<td>384±378</td>
<td>282±34</td>
<td>-424±74</td>
</tr>
<tr>
<td>Prompt vs Null</td>
<td>150±252</td>
<td>299±81</td>
<td>78±2</td>
<td>74±56</td>
<td>404±68</td>
<td>-354±166</td>
</tr>
<tr>
<td>Null vs Head</td>
<td>3355±612</td>
<td>453±90</td>
<td>12±1</td>
<td>309±320</td>
<td>-122±62</td>
<td>-70±160</td>
</tr>
</tbody>
</table>
<p>Results are noisier than for the straight prompt vs head comparison; however, we find that even with a null verbalizer, the language model is able to adapt to the task, generally catching up with the proper prompted model even with a few data points, and generally doing either on par with or better than the head model, showing the inductive bias of the prompt patterns is beneficial even without an informative verbalizer.  </p>
<h4 id="influence-of-the-pattern-choice">Influence of the pattern choice</h4>
<p>Another choice that can make or break zero-shot classification is that of the pattern, and we investigate whether that still holds in our setting. In all of our experiments, we have re-used the pattern choices from PET - two or three quite different formulations per task - and repeated all of our prompt experiments with every pattern available on the task. We plot the median, maximum and minimum performance over the 4 runs for each pattern below; they show that the choice of prompt does not generally have a significant influence, with only the few-shot settings of BoolQ and WiC seeing a pattern consistently above the others.  </p>
"""

text5 = """<h2 id="mot-de-la-fin">Mot de la fin</h2>
<p>In this work, we investigate alternate methods of fine-tuning based on natural language prompts, that aim to use the language modeling ability of pre-trained models explicitly through word predictions, instead of implicitly through linear classifiers based on the model&#39;s internal representations. We isolate the problem of fine-tuning prompt-based classifier language models with backpropagation, and find that they generally outperform standard fine-tuned linear classifiers. We estimate this advantage in terms of data point to measure the additional information provided by the human via the prompt, and find that <strong>writing a prompt is consistently worth hundreds of data points</strong>. Furthermore, this advantage holds even with non-informative target tokens and is fairly robust to the choice of prompt. </p>
<p>For practitioners, we believe that prompt-based fine-tuning should become a standard tool: especially for small- and middle-size task-specific datasets, designing a prompt yourself is a small effort for a sizable data advantage. For researchers, we believe that a lot of questions remain unexplored in this space: Why is the same prompt worth 3500 MNLI data points but only 282 RTE data points? How are prompts related to standard ML supervision? Do they react differently to adversarial or out-of domain examples, since they have some zero-shot behaviour?</p>
<p id="footnote1"><sup><a href="#footnote1back">1</a></sup>: Or at least not that we know of.</p>
<p id="footnote2"><sup><a href="#footnote2back">2</a></sup>: A sharp-eyed reader will have noticed that all those curves are monotonous. We&#39;ve performed 4 runs for every experiment (i.e. every data size of every task for head and prompt models). For clarity, and because fine-tuning can sometimes fail for both methods, resulting in negative outliers, we report for every data size the maximum performance that has been attained at this data size or smaller, which we call the <em>accumulated maximum</em> aggregate. This does not have a big impact on the reported data advantage besides reducing variance, and the graphical interpretation would still hold even with non-monotonous curves. </p>
<p id="footnote3"><sup><a href="#footnote2back">3</a></sup>: We treat each metric linearly to calculate advantage; alternatively, we could re-parameterize the y axis for each task. This choice does not have a consistent effect for or against prompting. For example, emphasizing gains close to convergence increases prompting advantage on CB and MNLI but decreases it on COPA or BoolQ. </p>
<p id="footnote4"><sup><a href="#footnote4back">4</a></sup>: where, interestingly, PET had already found prompting to be ineffective</p>
<p id="footnote5"><sup><a href="#footnote5back">5</a> <a href="#footnote6back">6</a></sup>: The comparison window of MultiRC is too small as the head baseline fails to learn beyond majority class; we use the full region for a lower-bound result.</p>
"""

initial_passage = "In informal games, it is customary to announce 'check' when making a move that puts the opponent's king in check. In formal competitions, however, check is rarely announced."

initial_question = "do you always have to say check in chess?"