@grimjim on Hugging Face: "This recent paper points to an explanation for the unreasonable effectiveness…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

grimjim

posted an update Feb 13

Post

2406

This recent paper points to an explanation for the unreasonable effectiveness of Frankenmerges: Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (2502.05171)

Specifically, the duplication of layers in Frankenmerges serves a purpose similar to what occurs in their recurrent-depth architecture. Successful frankenmerges that operate without additional fine-tuning are able to recover or "heal" from any damage due to abrupt transitions between layer blocks. Operational replicated layer blocks can provide functional benefits grounded in latent reasoning. Frankenmerges can also result in hybrid reasoning, by splicing together the latent reasoning of different models.

Back in April 2024, I was able to duplicate a few layers in the Llama 3 8B model, turning it into a 9B model, without harming benchmarks significantly, despite any transition damage.
grimjim/llama-3-experiment-v1-9B
My informal experimentation suggested that latent reasoning circuits could occupy continguous stacks of 2-4 layers, though the result was highly sensitive to the choice of transition location between layers.

Raiff1982

Feb 15

one way that has worked for me in combining models is finding a common thread like a perspective that they can all use. i was able to combine a deepseek and a gpt4o into something completly new. check out codette space and youll see what i mean. if you have questions on the code im happy to share it.

In this post