|
In the next frame we have Dropout which renormalizes |
|
the weights, after it zeroed some of the elements, which pushes the absolute max value to more than 64K, and we get an |
|
overflow (inf). |
|
As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16 |
|
numbers. |
|
Let's match the report to the code from models/t5/modeling_t5.py: |
|
thon |
|
class T5DenseGatedGeluDense(nn.Module): |
|
def init(self, config): |
|
super().init() |
|
self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False) |
|
self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False) |
|
self.wo = nn.Linear(config.d_ff, config.d_model, bias=False) |
|
self.dropout = nn.Dropout(config.dropout_rate) |
|
self.gelu_act = ACT2FN["gelu_new"] |
|
def forward(self, hidden_states): |
|
hidden_gelu = self.gelu_act(self.wi_0(hidden_states)) |
|
hidden_linear = self.wi_1(hidden_states) |
|
hidden_states = hidden_gelu * hidden_linear |
|
hidden_states = self.dropout(hidden_states) |
|
hidden_states = self.wo(hidden_states) |
|
return hidden_states |
|
|
|
Now it's easy to see the dropout call, and all the previous calls as well. |
|
Since the detection is happening in a forward hook, these reports are printed immediately after each forward |
|
returns. |
|
Going back to the full report, to act on it and to fix the problem, we need to go a few frames up where the numbers |
|
started to go up and most likely switch to the fp32 mode here, so that the numbers don't overflow when multiplied |
|
or summed up. |