autoprogrammer commited on
Commit
94724ad
·
verified ·
1 Parent(s): 19b1d66

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +9 -195
  2. __init__.py +9 -0
  3. config.json +2 -2
  4. configuration_custom.py +27 -0
  5. modeling_custom.py +207 -0
README.md CHANGED
@@ -1,199 +1,13 @@
1
- ---
2
- library_name: transformers
3
- tags: []
4
- ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
 
10
 
11
-
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
1
+ # DenseBackwardOLMoE
 
 
 
2
 
3
+ 自定义的OLMoE模型,使用DenseBackwardOlmoeSparseMoeBlock替换原版的MoE模块,实现dense backward功能。
4
 
5
+ ## 用法
6
 
7
+ ```python
8
+ from transformers import AutoConfig, AutoModelForCausalLM
9
 
10
+ # 使用trust_remote_code=True加载模型
11
+ config = AutoConfig.from_pretrained("autoprogrammer/olmoe_densebackward", trust_remote_code=True)
12
+ model = AutoModelForCausalLM.from_pretrained("autoprogrammer/olmoe_densebackward", config=config, trust_remote_code=True)
13
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
__init__.py ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ # 导出自定义配置和模型类
2
+ from .configuration_custom import DenseBackwardOLMoEConfig
3
+ from .modeling_custom import DenseBackwardOLMoEForCausalLM, DenseBackwardOlmoeSparseMoeBlock
4
+
5
+ __all__ = [
6
+ "DenseBackwardOLMoEConfig",
7
+ "DenseBackwardOLMoEForCausalLM",
8
+ "DenseBackwardOlmoeSparseMoeBlock"
9
+ ]
config.json CHANGED
@@ -1,7 +1,7 @@
1
  {
2
- "_name_or_path": "./saved_densebackwardolmoe",
3
  "architectures": [
4
- "OlmoeForCausalLM"
5
  ],
6
  "attention_bias": false,
7
  "attention_dropout": 0.0,
 
1
  {
2
+ "_name_or_path": "allenai/OLMoE-1B-7B-0924",
3
  "architectures": [
4
+ "DenseBackwardOLMoEForCausalLM"
5
  ],
6
  "attention_bias": false,
7
  "attention_dropout": 0.0,
configuration_custom.py ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # my_custom_olmoe/configuration_custom.py
2
+
3
+ # 注意:根据你的 transformers 版本,导入官方 OLMoE 配置的路径可能需要调整
4
+ from transformers.models.olmoe.configuration_olmoe import OlmoeConfig
5
+
6
+ class DenseBackwardOLMoEConfig(OlmoeConfig):
7
+ model_type = "DenseBackward_olmoe" # 这里覆盖 model_type 字段,便于后续识别
8
+
9
+ # 添加auto_map用于支持AutoClass
10
+ auto_map = {
11
+ "AutoConfig": "configuration_custom.DenseBackwardOLMoEConfig",
12
+ "AutoModelForCausalLM": "modeling_custom.DenseBackwardOLMoEForCausalLM"
13
+ }
14
+
15
+ def __init__(self, model_marker="DenseBackward_olmoe_marker", **kwargs):
16
+ super().__init__(**kwargs)
17
+ self.model_marker = model_marker
18
+ self.intermediate_size= 1024
19
+ self.torch_dtype= "bfloat16"
20
+ #test
21
+ def main():
22
+ config = DenseBackwardOLMoEConfig(model_marker="DenseBackward_olmoe_marker",
23
+ torch_dtype="bfloat16")
24
+ print(config)
25
+
26
+ if __name__ == "__main__":
27
+ main()
modeling_custom.py ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # my_custom_olmoe/modeling_custom.py
2
+
3
+ import torch
4
+ import torch.nn as nn
5
+ import torch.nn.functional as F
6
+
7
+ # 导入官方实现(注意根据你的 transformers 版本调整导入路径)
8
+ from transformers.models.olmoe.modeling_olmoe import OlmoeForCausalLM, OlmoeSparseMoeBlock, OlmoeMLP
9
+ from configuration_custom import DenseBackwardOLMoEConfig
10
+
11
+
12
+ class DenseBackwardOlmoeSparseMoeBlock(OlmoeSparseMoeBlock):
13
+ """
14
+ 继承自官方 OlmoeSparseMoeBlock,实现 dense backward 功能:
15
+ 前向输出依旧保持与官方相同(即稀疏计算结果),
16
+ 但在反向传播时,通过直通梯度让 dense 计算的梯度传递回来,
17
+ dense 输出通过对每个专家在所有 token 上进行计算,并利用全 routing 权重加权获得。
18
+
19
+ 输入:
20
+ hidden_states: Tensor, shape (batch_size, sequence_length, hidden_dim)
21
+ 输出:
22
+ final_output: Tensor, shape (batch_size, sequence_length, hidden_dim)
23
+ router_logits: Tensor, shape (batch_size * sequence_length, num_experts)
24
+ """
25
+ def forward(self, hidden_states: torch.Tensor):
26
+ """
27
+ 输入:
28
+ hidden_states: Tensor, shape (batch_size, sequence_length, hidden_dim)
29
+ 输出:
30
+ final_output: Tensor, shape (batch_size, sequence_length, hidden_dim)
31
+ router_logits: Tensor, shape (batch_size * sequence_length, num_experts)
32
+ 实现思路:
33
+ 1. 将输入展平为 (B*seq_len, hidden_dim),通过 self.gate 得到 router_logits,
34
+ 并计算全专家的 routing 权重(softmax 后)。
35
+ 2. 对 routing 权重取 top-k,得到 routing_weights_topk 与 selected_experts;
36
+ 如配置要求,归一化 top-k 概率。
37
+ 3. 稀疏计算部分:仅计算每个 token 对于 top-k 专家的输出,
38
+ 并累加得到 sparse_output(保留原版计算流程,同时记录激活专家的实际输出)。
39
+ 4. Dense 估计部分:先计算所有专家对所有 token 的输出(all_expert_outputs),
40
+ 再逐 token 调用 estimate_dense_output 得到 dense 输出(dense_estimated)。
41
+ 5. 使用直通梯度技巧:前向输出用 sparse_output,但梯度来源于 dense_estimated。
42
+ 6. 最后 reshape 为 (batch_size, sequence_length, hidden_dim) 并返回 final_output 及 router_logits.
43
+ """
44
+ #determine the shape of hidden_states
45
+ batch_size, seq_length, hidden_dim = hidden_states.shape
46
+ flat_hidden = hidden_states.view(-1, hidden_dim) # (B*seq_len, hidden_dim)
47
+
48
+ # 计算路由 logits 和全专家 routing 权重
49
+ router_logits = self.gate(flat_hidden) # (B*seq_len, num_experts)
50
+ routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float) # (B*seq_len, num_experts)
51
+
52
+ # Top-k 选择
53
+ routing_weights_topk, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
54
+ if self.norm_topk_prob:
55
+ routing_weights_topk = routing_weights_topk / routing_weights_topk.sum(dim=-1, keepdim=True)
56
+ routing_weights_topk = routing_weights_topk.to(flat_hidden.dtype)
57
+
58
+ # ---------- 稀疏计算部分 ----------
59
+ # 初始化稀疏输出,shape: (B*seq_len, hidden_dim)
60
+ sparse_output = torch.zeros((flat_hidden.size(0), hidden_dim), dtype=flat_hidden.dtype, device=flat_hidden.device)
61
+ # 用于记录每个 token 对激活专家的实际输出
62
+ activated_outputs = [{} for _ in range(flat_hidden.size(0))]
63
+ # one-hot 编码 top-k 专家,shape: (B*seq_len, top_k, num_experts)
64
+ expert_mask = F.one_hot(selected_experts, num_classes=self.num_experts) # (B*seq_len, top_k, num_experts)
65
+ expert_mask = expert_mask.permute(2, 1, 0) # (num_experts, top_k, B*seq_len)
66
+
67
+ for expert_idx in range(self.num_experts):
68
+ expert_layer = self.experts[expert_idx]
69
+ idx, top_x = torch.where(expert_mask[expert_idx])
70
+ if top_x.numel() > 0:
71
+ current_state = flat_hidden[top_x] # (n, hidden_dim)
72
+ current_output = expert_layer(current_state) # (n, hidden_dim)
73
+ weight = routing_weights_topk[top_x, idx].unsqueeze(-1) # (n, 1)
74
+ weighted_output = current_output * weight
75
+ sparse_output.index_add_(0, top_x, weighted_output.to(flat_hidden.dtype))
76
+ # 保存当前 token 对该专家的实际输出
77
+ for pos, token_idx in enumerate(top_x.tolist()):
78
+ activated_outputs[token_idx][expert_idx] = current_output[pos]
79
+ # ---------- 稀疏计算结束 ----------
80
+
81
+ # ---------- Dense估计部分 ----------
82
+ # 计算所有专家对所有 token 的 dense 输出,shape: (B*seq_len, num_experts, hidden_dim)
83
+ all_expert_outputs = torch.stack([expert(flat_hidden) for expert in self.experts], dim=1)
84
+ # 将 selected_experts 转换为 list,每个 token 的激活专家列表
85
+ all_routing = selected_experts.tolist() # 长度为 (B*seq_len)
86
+
87
+ dense_outputs = []
88
+ for i in range(flat_hidden.size(0)):
89
+ dense_est = self.estimate_dense_output(
90
+ token_idx=i,
91
+ activated=all_routing[i], # 当前 token 激活的专家列表,例如 [a, b]
92
+ gate_prob=routing_weights[i], # 当前 token 的完整 routing 权重 (num_experts,)
93
+ activated_outputs=activated_outputs[i], # 当前 token 对激活专家的实际输出
94
+ all_routing=all_routing, # 全 batch 每个 token 的激活专家列表(list of lists)
95
+ all_expert_outputs=all_expert_outputs # (B*seq_len, num_experts, hidden_dim)
96
+ )
97
+ dense_outputs.append(dense_est.unsqueeze(0))
98
+ dense_outputs = torch.cat(dense_outputs, dim=0) # (B*seq_len, hidden_dim)
99
+ # ---------- Dense估计结束 ----------
100
+
101
+ # 使用直通梯度:前向输出用稀疏结果,但反向传播时梯度来源于 dense 估计
102
+ final_flat = sparse_output.detach() + (dense_outputs - dense_outputs.detach())
103
+ final_output = final_flat.view(batch_size, seq_length, hidden_dim)
104
+ return final_output, router_logits
105
+
106
+ def estimate_dense_output(self, token_idx, activated, gate_prob, activated_outputs, all_routing, all_expert_outputs):
107
+ """
108
+ 对于当前 token,根据 mini-batch 中的信息估计 dense 输出。
109
+ 参数:
110
+ token_idx: 当前 token 的索引(标量)
111
+ activated: 当前 token 激活的专家列表,例如 [1, 3]
112
+ gate_prob: 当前 token 的 routing 权重,形状 (num_experts,)
113
+ activated_outputs: dict,当前 token 对激活专家的实际输出,形状 (hidden_dim,)
114
+ all_routing: list,每个 token 的激活专家列表(长度为 N,每个元素为 list)
115
+ all_expert_outputs: Tensor, (N, num_experts, hidden_dim)
116
+ 返回:
117
+ estimated_dense: Tensor, (hidden_dim,)
118
+ """
119
+ num_experts = gate_prob.size(0)
120
+ dense_parts = {}
121
+ # 对于激活的专家,直接使用其实际输出
122
+ for idx in activated:
123
+ dense_parts[idx] = activated_outputs[idx]
124
+ # 对于未激活的专家,使用 mini-batch 中其他 token 的输出估计
125
+ non_activated = [i for i in range(num_experts) if i not in activated]
126
+ for i in non_activated:
127
+ indices = []
128
+ for idx, r_dec in enumerate(all_routing):
129
+ if (i in r_dec) and (len(set(r_dec) & set(activated)) > 0):
130
+ indices.append(idx)
131
+ if indices:
132
+ selected_outputs = all_expert_outputs[indices, i, :] # (n, hidden_dim)
133
+ estimated = selected_outputs.mean(dim=0)
134
+ else:
135
+ estimated = all_expert_outputs[:, i, :].mean(dim=0)
136
+ dense_parts[i] = estimated
137
+ # 按 gate_prob 加权求和各专家输出
138
+ estimated_dense = 0
139
+ for i in range(num_experts):
140
+ estimated_dense += gate_prob[i] * dense_parts[i]
141
+ return estimated_dense
142
+
143
+
144
+ class DenseBackwardOLMoEForCausalLM(OlmoeForCausalLM):
145
+ """
146
+ 自定义的 Olmoe ForCausalLM 模型,使用新的 DenseBackwardOlmoeSparseMoeBlock 替换原版的 MoE 模块,
147
+ 以实现 dense backward 功能。
148
+
149
+ 配置类:DenseBackwardOLMoEConfig
150
+ """
151
+ config_class = DenseBackwardOLMoEConfig
152
+ base_model_prefix = "olmoe"
153
+
154
+ def __init__(self, config):
155
+ # 首先调用父类初始化方法
156
+ super().__init__(config)
157
+
158
+ # 不要尝试重新赋值self,而是从预训练模型加载并更新当前模型
159
+ pretrained_model = OlmoeForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924", torch_dtype=torch.bfloat16)
160
+
161
+ # 复制预训练模型的状态到当前模型
162
+ self.config = pretrained_model.config
163
+ self.model = pretrained_model.model
164
+ self.vocab_size = pretrained_model.vocab_size
165
+ self.router_aux_loss_coef = pretrained_model.router_aux_loss_coef
166
+ self.num_experts = pretrained_model.num_experts
167
+ self.lm_head = pretrained_model.lm_head
168
+
169
+ # 遍历模型中所有 decoder 层,替换每个 OlmoeSparseMoeBlock 为 DenseBackward 版本
170
+ # 此处假设官方模型在 self.model.layers 中组织 decoder 层,
171
+ # 且每层中 mlp 模块包含属性 sparse_moe_block。
172
+ for layer in self.model.layers:
173
+ if hasattr(layer.mlp, "gate"):
174
+ print("111")
175
+ orig_block = layer.mlp
176
+ # 通过直接复制原版属性创建新的块
177
+ new_block = DenseBackwardOlmoeSparseMoeBlock(config) # 或其他适当参数
178
+ # 然后手动复制需要共享的属性:
179
+ new_block.gate = orig_block.gate
180
+ new_block.experts = orig_block.experts
181
+ new_block.num_experts = orig_block.num_experts
182
+ new_block.top_k = orig_block.top_k
183
+ new_block.norm_topk_prob = orig_block.norm_topk_prob
184
+ layer.mlp = new_block
185
+ print(type(layer.mlp))
186
+ # 在调用post_init()前
187
+ test_param = self.model.layers[0].mlp.experts[0].up_proj.weight.data[0, 0].item()
188
+ print(f"权重示例值(前): {test_param}")
189
+ self.post_init()
190
+ # 在调用post_init()后
191
+ test_param_after = self.model.layers[0].mlp.experts[0].up_proj.weight.data[0, 0].item()
192
+ print(f"权重示例值(后): {test_param_after}")
193
+
194
+ def main():
195
+ config = DenseBackwardOLMoEConfig( # 官方模型参数
196
+ model_marker="DenseBackward_olmoe_marker",
197
+ torch_dtype="bfloat16"
198
+ )
199
+ # 创建自定义模型实例
200
+ model = DenseBackwardOLMoEForCausalLM(config)
201
+ print(type(model))
202
+ print(type(model.model))
203
+ print(type(model.model.layers[0]))
204
+ print(type(model.model.layers[0].mlp))
205
+ print(type(model.model.layers[0].mlp.experts))
206
+ if __name__ == "__main__":
207
+ main()