Invalid shape `[1, 8192]`
Great work!
I think some of the weights have invalid sizes? I get following errors when loading with transformers
:
ValueError: Trying to set a tensor of shape torch.Size([1, 8192]) in "weight" (which has shape torch. Size ( [8192])), this looks incorrect.
I resized them and put on tmfi-us/Progenitor-V5-Final-LLaMa-70B.
Oh wow thanks for this. I have no clue what is the reason, but despite the error I was still able to use and even quant this model?! Testing the fixed version against this one gave me very different outputs too. If anyone understands this I would love to hear about it?
@Tarek07 hmm what quant tool did you use?
I tried to use following tools against this one, and all failed due to the shape error that I mentioned:
- lm-evaluation-harness
- vllm
- sglang
- AutoAWQ
- transformer (using
AutoModelForCausalLM
)
Following is lm-evaluation-harness
result that I ran against V3.3 and V5 (fixed one), and despite you mentioned about the output being different, evaluation result seems promising?
V3.3
Model | Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|---|
Progenitor-V3.3 | leaderboard | N/A | |||||||
Progenitor-V3.3 | - leaderboard_bbh | N/A | |||||||
Progenitor-V3.3 | - leaderboard_bbh_boolean_expressions | 1 | none | 3 | acc_norm | ↑ | 0.9520 | ± | 0.0135 |
Progenitor-V3.3 | - leaderboard_bbh_causal_judgement | 1 | none | 3 | acc_norm | ↑ | 0.6364 | ± | 0.0353 |
Progenitor-V3.3 | - leaderboard_bbh_date_understanding | 1 | none | 3 | acc_norm | ↑ | 0.7080 | ± | 0.0288 |
Progenitor-V3.3 | - leaderboard_bbh_disambiguation_qa | 1 | none | 3 | acc_norm | ↑ | 0.7240 | ± | 0.0283 |
Progenitor-V3.3 | - leaderboard_bbh_formal_fallacies | 1 | none | 3 | acc_norm | ↑ | 0.7880 | ± | 0.0259 |
Progenitor-V3.3 | - leaderboard_bbh_geometric_shapes | 1 | none | 3 | acc_norm | ↑ | 0.4480 | ± | 0.0315 |
Progenitor-V3.3 | - leaderboard_bbh_hyperbaton | 1 | none | 3 | acc_norm | ↑ | 0.6080 | ± | 0.0309 |
Progenitor-V3.3 | - leaderboard_bbh_logical_deduction_five_objects | 1 | none | 3 | acc_norm | ↑ | 0.6080 | ± | 0.0309 |
Progenitor-V3.3 | - leaderboard_bbh_logical_deduction_seven_objects | 1 | none | 3 | acc_norm | ↑ | 0.6200 | ± | 0.0308 |
Progenitor-V3.3 | - leaderboard_bbh_logical_deduction_three_objects | 1 | none | 3 | acc_norm | ↑ | 0.9440 | ± | 0.0146 |
Progenitor-V3.3 | - leaderboard_bbh_movie_recommendation | 1 | none | 3 | acc_norm | ↑ | 0.8640 | ± | 0.0217 |
Progenitor-V3.3 | - leaderboard_bbh_navigate | 1 | none | 3 | acc_norm | ↑ | 0.6440 | ± | 0.0303 |
Progenitor-V3.3 | - leaderboard_bbh_object_counting | 1 | none | 3 | acc_norm | ↑ | 0.6440 | ± | 0.0303 |
Progenitor-V3.3 | - leaderboard_bbh_penguins_in_a_table | 1 | none | 3 | acc_norm | ↑ | 0.7192 | ± | 0.0373 |
Progenitor-V3.3 | - leaderboard_bbh_reasoning_about_colored_objects | 1 | none | 3 | acc_norm | ↑ | 0.8400 | ± | 0.0232 |
Progenitor-V3.3 | - leaderboard_bbh_ruin_names | 1 | none | 3 | acc_norm | ↑ | 0.8760 | ± | 0.0209 |
Progenitor-V3.3 | - leaderboard_bbh_salient_translation_error_detection | 1 | none | 3 | acc_norm | ↑ | 0.7120 | ± | 0.0287 |
Progenitor-V3.3 | - leaderboard_bbh_snarks | 1 | none | 3 | acc_norm | ↑ | 0.6966 | ± | 0.0346 |
Progenitor-V3.3 | - leaderboard_bbh_sports_understanding | 1 | none | 3 | acc_norm | ↑ | 0.9400 | ± | 0.0151 |
Progenitor-V3.3 | - leaderboard_bbh_temporal_sequences | 1 | none | 3 | acc_norm | ↑ | 1.0000 | ± | 0 |
Progenitor-V3.3 | - leaderboard_bbh_tracking_shuffled_objects_five_objects | 1 | none | 3 | acc_norm | ↑ | 0.2960 | ± | 0.0289 |
Progenitor-V3.3 | - leaderboard_bbh_tracking_shuffled_objects_seven_objects | 1 | none | 3 | acc_norm | ↑ | 0.3240 | ± | 0.0297 |
Progenitor-V3.3 | - leaderboard_bbh_tracking_shuffled_objects_three_objects | 1 | none | 3 | acc_norm | ↑ | 0.3560 | ± | 0.0303 |
Progenitor-V3.3 | - leaderboard_bbh_web_of_lies | 1 | none | 3 | acc_norm | ↑ | 0.5960 | ± | 0.0311 |
Progenitor-V3.3 | - leaderboard_gpqa | N/A | |||||||
Progenitor-V3.3 | - leaderboard_gpqa_diamond | 1 | none | 0 | acc_norm | ↑ | 0.3687 | ± | 0.0344 |
Progenitor-V3.3 | - leaderboard_gpqa_extended | 1 | none | 0 | acc_norm | ↑ | 0.4634 | ± | 0.0214 |
Progenitor-V3.3 | - leaderboard_gpqa_main | 1 | none | 0 | acc_norm | ↑ | 0.4397 | ± | 0.0235 |
Progenitor-V3.3 | - leaderboard_ifeval | 3 | none | 0 | inst_level_loose_acc | ↑ | 0.8885 | ± | N/A |
Progenitor-V3.3 | none | 0 | inst_level_strict_acc | ↑ | 0.8705 | ± | N/A | ||
Progenitor-V3.3 | none | 0 | prompt_level_loose_acc | ↑ | 0.8373 | ± | 0.0159 | ||
Progenitor-V3.3 | none | 0 | prompt_level_strict_acc | ↑ | 0.8133 | ± | 0.0168 | ||
Progenitor-V3.3 | - leaderboard_math_hard | N/A | |||||||
Progenitor-V3.3 | - leaderboard_math_algebra_hard | 2 | none | 4 | exact_match | ↑ | 0.1954 | ± | 0.0227 |
Progenitor-V3.3 | - leaderboard_math_counting_and_prob_hard | 2 | none | 4 | exact_match | ↑ | 0.1951 | ± | 0.0359 |
Progenitor-V3.3 | - leaderboard_math_geometry_hard | 2 | none | 4 | exact_match | ↑ | 0.0303 | ± | 0.0150 |
Progenitor-V3.3 | - leaderboard_math_intermediate_algebra_hard | 2 | none | 4 | exact_match | ↑ | 0.0393 | ± | 0.0116 |
Progenitor-V3.3 | - leaderboard_math_num_theory_hard | 2 | none | 4 | exact_match | ↑ | 0.1818 | ± | 0.0312 |
Progenitor-V3.3 | - leaderboard_math_prealgebra_hard | 2 | none | 4 | exact_match | ↑ | 0.2746 | ± | 0.0322 |
Progenitor-V3.3 | - leaderboard_math_precalculus_hard | 2 | none | 4 | exact_match | ↑ | 0.0222 | ± | 0.0127 |
Progenitor-V3.3 | - leaderboard_mmlu_pro | 0.1 | none | 5 | acc | ↑ | 0.5505 | ± | 0.0045 |
Progenitor-V3.3 | - leaderboard_musr | N/A | |||||||
Progenitor-V3.3 | - leaderboard_musr_murder_mysteries | 1 | none | 0 | acc_norm | ↑ | 0.5800 | ± | 0.0313 |
Progenitor-V3.3 | - leaderboard_musr_object_placements | 1 | none | 0 | acc_norm | ↑ | 0.2773 | ± | 0.0280 |
Progenitor-V3.3 | - leaderboard_musr_team_allocation | 1 | none | 0 | acc_norm | ↑ | 0.5400 | ± | 0.0316 |
V5
Model | Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|---|
Progenitor-V5 | leaderboard | N/A | |||||||
Progenitor-V5 | - leaderboard_bbh | N/A | |||||||
Progenitor-V5 | - leaderboard_bbh_boolean_expressions | 1 | none | 3 | acc_norm | ↑ | 0.9520 | ± | 0.0135 |
Progenitor-V5 | - leaderboard_bbh_causal_judgement | 1 | none | 3 | acc_norm | ↑ | 0.6471 | ± | 0.0350 |
Progenitor-V5 | - leaderboard_bbh_date_understanding | 1 | none | 3 | acc_norm | ↑ | 0.7040 | ± | 0.0289 |
Progenitor-V5 | - leaderboard_bbh_disambiguation_qa | 1 | none | 3 | acc_norm | ↑ | 0.7200 | ± | 0.0285 |
Progenitor-V5 | - leaderboard_bbh_formal_fallacies | 1 | none | 3 | acc_norm | ↑ | 0.7880 | ± | 0.0259 |
Progenitor-V5 | - leaderboard_bbh_geometric_shapes | 1 | none | 3 | acc_norm | ↑ | 0.4560 | ± | 0.0316 |
Progenitor-V5 | - leaderboard_bbh_hyperbaton | 1 | none | 3 | acc_norm | ↑ | 0.6120 | ± | 0.0309 |
Progenitor-V5 | - leaderboard_bbh_logical_deduction_five_objects | 1 | none | 3 | acc_norm | ↑ | 0.6200 | ± | 0.0308 |
Progenitor-V5 | - leaderboard_bbh_logical_deduction_seven_objects | 1 | none | 3 | acc_norm | ↑ | 0.6240 | ± | 0.0307 |
Progenitor-V5 | - leaderboard_bbh_logical_deduction_three_objects | 1 | none | 3 | acc_norm | ↑ | 0.9480 | ± | 0.0141 |
Progenitor-V5 | - leaderboard_bbh_movie_recommendation | 1 | none | 3 | acc_norm | ↑ | 0.8680 | ± | 0.0215 |
Progenitor-V5 | - leaderboard_bbh_navigate | 1 | none | 3 | acc_norm | ↑ | 0.6400 | ± | 0.0304 |
Progenitor-V5 | - leaderboard_bbh_object_counting | 1 | none | 3 | acc_norm | ↑ | 0.6400 | ± | 0.0304 |
Progenitor-V5 | - leaderboard_bbh_penguins_in_a_table | 1 | none | 3 | acc_norm | ↑ | 0.7260 | ± | 0.0370 |
Progenitor-V5 | - leaderboard_bbh_reasoning_about_colored_objects | 1 | none | 3 | acc_norm | ↑ | 0.8480 | ± | 0.0228 |
Progenitor-V5 | - leaderboard_bbh_ruin_names | 1 | none | 3 | acc_norm | ↑ | 0.8720 | ± | 0.0212 |
Progenitor-V5 | - leaderboard_bbh_salient_translation_error_detection | 1 | none | 3 | acc_norm | ↑ | 0.6960 | ± | 0.0292 |
Progenitor-V5 | - leaderboard_bbh_snarks | 1 | none | 3 | acc_norm | ↑ | 0.7022 | ± | 0.0344 |
Progenitor-V5 | - leaderboard_bbh_sports_understanding | 1 | none | 3 | acc_norm | ↑ | 0.9440 | ± | 0.0146 |
Progenitor-V5 | - leaderboard_bbh_temporal_sequences | 1 | none | 3 | acc_norm | ↑ | 1.0000 | ± | 0 |
Progenitor-V5 | - leaderboard_bbh_tracking_shuffled_objects_five_objects | 1 | none | 3 | acc_norm | ↑ | 0.3080 | ± | 0.0293 |
Progenitor-V5 | - leaderboard_bbh_tracking_shuffled_objects_seven_objects | 1 | none | 3 | acc_norm | ↑ | 0.3360 | ± | 0.0299 |
Progenitor-V5 | - leaderboard_bbh_tracking_shuffled_objects_three_objects | 1 | none | 3 | acc_norm | ↑ | 0.3520 | ± | 0.0303 |
Progenitor-V5 | - leaderboard_bbh_web_of_lies | 1 | none | 3 | acc_norm | ↑ | 0.6160 | ± | 0.0308 |
Progenitor-V5 | - leaderboard_gpqa | N/A | |||||||
Progenitor-V5 | - leaderboard_gpqa_diamond | 1 | none | 0 | acc_norm | ↑ | 0.3990 | ± | 0.0349 |
Progenitor-V5 | - leaderboard_gpqa_extended | 1 | none | 0 | acc_norm | ↑ | 0.4670 | ± | 0.0214 |
Progenitor-V5 | - leaderboard_gpqa_main | 1 | none | 0 | acc_norm | ↑ | 0.4375 | ± | 0.0235 |
Progenitor-V5 | - leaderboard_ifeval | 3 | none | 0 | inst_level_loose_acc | ↑ | 0.8981 | ± | N/A |
Progenitor-V5 | none | 0 | inst_level_strict_acc | ↑ | 0.8729 | ± | N/A | ||
Progenitor-V5 | none | 0 | prompt_level_loose_acc | ↑ | 0.8503 | ± | 0.0154 | ||
Progenitor-V5 | none | 0 | prompt_level_strict_acc | ↑ | 0.8170 | ± | 0.0166 | ||
Progenitor-V5 | - leaderboard_math_hard | N/A | |||||||
Progenitor-V5 | - leaderboard_math_algebra_hard | 2 | none | 4 | exact_match | ↑ | 0.2020 | ± | 0.0229 |
Progenitor-V5 | - leaderboard_math_counting_and_prob_hard | 2 | none | 4 | exact_match | ↑ | 0.1626 | ± | 0.0334 |
Progenitor-V5 | - leaderboard_math_geometry_hard | 2 | none | 4 | exact_match | ↑ | 0.0303 | ± | 0.0150 |
Progenitor-V5 | - leaderboard_math_intermediate_algebra_hard | 2 | none | 4 | exact_match | ↑ | 0.0429 | ± | 0.0121 |
Progenitor-V5 | - leaderboard_math_num_theory_hard | 2 | none | 4 | exact_match | ↑ | 0.1883 | ± | 0.0316 |
Progenitor-V5 | - leaderboard_math_prealgebra_hard | 2 | none | 4 | exact_match | ↑ | 0.2591 | ± | 0.0316 |
Progenitor-V5 | - leaderboard_math_precalculus_hard | 2 | none | 4 | exact_match | ↑ | 0.0222 | ± | 0.0127 |
Progenitor-V5 | - leaderboard_mmlu_pro | 0.1 | none | 5 | acc | ↑ | 0.5513 | ± | 0.0045 |
Progenitor-V5 | - leaderboard_musr | N/A | |||||||
Progenitor-V5 | - leaderboard_musr_murder_mysteries | 1 | none | 0 | acc_norm | ↑ | 0.5600 | ± | 0.0315 |
Progenitor-V5 | - leaderboard_musr_object_placements | 1 | none | 0 | acc_norm | ↑ | 0.2852 | ± | 0.0283 |
Progenitor-V5 | - leaderboard_musr_team_allocation | 1 | none | 0 | acc_norm | ↑ | 0.5440 | ± | 0.0316 |
@y-ryan So for the quant I used 'koboldcpp_tools_19dec' First the 'convert_hf_to_gguf.py' script and then the 'quantize_gguf.exe' to Q8. (this is what I run most models with locally.)
Then as for deployment I used Friendli and the HF Inference Endpoints on Google cloud, both of which worked and gave me good outputs. (I deployed your fixed version in the same way to test some test prompts I use.)
I can certainly see the .safetensor files show: model.layers.0.input_layernorm.weight [1, 8 192]
When it should be: model.layers.0.input_layernorm.weight [8 192]
I just have no idea why it works and works well (from the test prompts at least)?
The GGUF Q8 quant also seemed to work well (only did one RP to test.)
**On some further testing with my GGUF qaunt there are glaring logic issues. It is truly broken. **
@Tarek07
checking out the old version of convert_hf_to_gguf.py
from koboldcpp, I found following:
def prepare_tensors(self):
...
for name, data_torch in chain(self.generate_extra_tensors(), self.get_tensors()):
...
for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
data = data_torch.squeeze().numpy()
...
Above code effectively converts [1, 8192]
to [8192]
by using torch.squeeze()
.
This code has been then updated (by this commit) not to use torch.squeeze()
no more:
- data = data_torch.squeeze().numpy()
+ # TODO: why do we squeeze here?
+ # data = data_torch.squeeze().numpy()
+ data = data_torch.numpy()
If my assumption is correct, running the latest code will break.
I used torch.Tensor.view()
to reshape from [1, 8192]
to [8192]
. While I believe torch.Tensor.view()
and torch.squeeze()
will basically have same effect in this case, I will try to reshape again using torch.squeeze()
and re-upload the model to see if it makes any difference.
@Tarek07 checking out the old version of
convert_hf_to_gguf.py
from koboldcpp, I found following:
def prepare_tensors(self): ... for name, data_torch in chain(self.generate_extra_tensors(), self.get_tensors()): ... for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)): data = data_torch.squeeze().numpy() ...
Above code effectively converts
[1, 8192]
to[8192]
by usingtorch.squeeze()
.
This code has been then updated (by this commit) not to usetorch.squeeze()
no more:
- data = data_torch.squeeze().numpy() + # TODO: why do we squeeze here? + # data = data_torch.squeeze().numpy() + data = data_torch.numpy()
If my assumption is correct, running the latest code will break.
I used
torch.Tensor.view()
to reshape from[1, 8192]
to[8192]
. While I believetorch.Tensor.view()
and [torch.squeeze()
] will basically have same effect in this case, I will try to reshape again using [torch.squeeze()
] and re-upload the model to see if it makes any difference.
Aha! Good catch.
It is broken. On the gguf quant I tried another more complex character card and it had glaring logic issues which other Progenitor models didnt have. So I am convinced somethign broke. Its weird because when I deployed the models on the endpoints the outputs were stellar? Unless the endpoints also fix the error somehow? Regardless I will redo this model just in case. I really appreciate letting me know about the issue I would never have seen it!
@Tarek07
I ran following code:
import os
import torch
from safetensors.torch import load_file, save_file
MODEL_DIR = "/root/.cache/huggingface/hub/models--Tarek07--Progenitor-V5-Final-LLaMa-70B/snapshots/2364d825a8ea4414d3a661b9cf2fa0f29af71756"
def check_shard(shard_path):
data = load_file(shard_path)
for key, tensor in data.items():
if list(tensor.shape) == [1, 8192]:
print(f" Checking {key} in {os.path.basename(shard_path)} with size {tensor.shape}")
viewed_tensor = tensor.view(8192)
squeezed_tensor = tensor.squeeze()
print(f" Equality: {torch.equal(viewed_tensor, squeezed_tensor)}")
def main():
# Look for .safetensors files in MODEL_DIR
for filename in sorted(os.listdir(MODEL_DIR)):
if filename.endswith(".safetensors"):
shard_path = os.path.join(MODEL_DIR, filename)
print(f"Processing: {shard_path}")
check_shard(shard_path)
if __name__ == "__main__":
main()
and confirmed that tensor.view()
and tensor.squeeze()
indeed returned identical Tensors. Thus I'm not reuploading the tensor.squeeze()
version to my repo.
Glad my debugging was helpful, and thanks again for the great work! Please feel free to let me know at any time if you need other help. I will be around to test the redo-ed version out when it's ready.
I will update the readme on my repo to let others know that you will be redoing and we all are looking forward for it!