This AI Paper Introduces TinyLoRA, A 13-Parameter Fine-Tuning Method That Reaches 91.8 Percent GSM8K on Qwen2.5-7B

发布时间：2026-03-24来源：MarkTechPost

Researchers from

FAIR at Meta

,

Cornell University

, and

Carnegie Mellon University

have demonstrated that large language models (LLMs) can learn to reason using a remarkably small number of trained parameters. The research team introduces

TinyLoRA

, a parameterization that can scale down to a single trainable parameter under extreme sharing settings. Using this method on a

Qwen2.5-7B-Instruct

backbone, the research team achieved

91.8% accuracy

on the GSM8K benchmark with only

13 parameters

, totaling just 26 bytes in bf16.

Overcoming the Constraints of Standard LoRA

Standard Low-Rank Adaptation (LoRA) adapts a frozen linear layer

W ∈ R
^d

^x

^k

using trainable matrices

A ∈ R
^d

^x

^r

and

B ∈ R
^r

^x

^k

. The trainable parameter count in standard LoRA still scales with layer width and rank, which leaves a nontrivial lower bound even at rank 1. For a model like Llama3-8B, this minimum update size is approximately

3 million parameters

.

TinyLoRA circumvents this by building upon

LoRA-XS

, which utilizes the

truncated Singular Value Decomposition (SVD)

of frozen weights. While LoRA-XS typically requires at least one parameter per adapted module, TinyLoRA replaces the trainable matrix with a low-dimensional trainable vector

𝜐 ∈ R
^u

projected through a fixed random tensor

P

∈

R
^u

^x

^r

^x

^r

.

The update rule is defined as:

$$W’ = W + U\Sigma(\sum_{i=1}^{u}v_{i}P_{i})V^{\top}$$

By applying a weight tying factor (

n

_tie
), the total trainable parameters scale as

O

(

nmu/n

_tie

), allowing updates to scale down to a single parameter when all modules across all layers share the same vector.

Reinforcement Learning: The Catalyst for Tiny Updates

A core finding of the research is that

Reinforcement Learning (RL)

is fundamentally more efficient than

Supervised Finetuning (SFT)

at extremely low parameter counts. The research team reports that models trained via SFT require updates

100 to 1,000 times larger

to reach the same performance as those trained with RL.

This gap is attributed to the ‘information density’ of the training signal. SFT forces a model to absorb many bits of information—including stylistic noise and irrelevant structures of human demonstrations—because its objective treats all tokens as equally informative. In contrast, RL (specifically

Group Relative Policy Optimization

or

GRPO

) provides a sparser but cleaner signal. Because rewards are binary (e.g., exact match for a math answer), reward-relevant features correlate with the signal while irrelevant variations cancel out through resampling.

Optimization Guidelines for Devs

The research team isolated several strategies to maximize the efficiency of tiny updates:

Optimal Frozen Rank (

r

):

Analysis showed that a frozen SVD rank of

r

=2

was optimal. Higher ranks introduced too many degrees of freedom, complicating the optimization of the small trainable vector.

Tiling vs. Structured Sharing:

The research team compared ‘structured’ sharing (modules of the same type share parameters) with

’tiling

‘ (nearby modules of similar depth share parameters). Surprisingly, tiling was more effective, showing no inherent benefit to forcing parameter sharing exclusively between specific projections like Query or Key modules.

Precision:

In bit-constrained regimes, storing parameters in

fp32

proved most performant bit-for-bit, even when accounting for its larger footprint compared to

bf16

or

fp16

.

Benchmark Performance

The research team reports that

Qwen-2.5

models often needed around

10x fewer

updated parameters than

LLaMA-3

to reach similar performance in their setup.

Model	Parameters Trained	GSM8K Pass@1
Qwen2.5-7B-Instruct (Base)	0	88.2%
Qwen2.5-7B-Instruct	1	82.0%
Qwen2.5-7B-Instruct	13	91.8%
Qwen2.5-7B-Instruct	196	92.2%
Qwen2.5-7B-Instruct (Full FT)	~7.6 Billion	91.7%

On harder benchmarks like

MATH500

and

AIME24

, 196-parameter updates for Qwen2.5-7B-Instruct retained

87%

of the absolute performance improvement of full finetuning across six difficult math benchmarks

.

Key Takeaways

Extreme Parameter Efficiency

: It is possible to train a

Qwen2.5-7B-Instruct

model to achieve

91.8% accuracy

on the GSM8K math benchmark using only

13 parameters

(26 total bytes).

The RL Advantage

: Reinforcement Learning (RL) is fundamentally more efficient than Supervised Finetuning (SFT) in low-capacity regimes; SFT requires

100–1000x larger updates

to reach the same performance level as RL.

TinyLoRA Framework

: The research team developed

TinyLoRA

, a new parameterization that uses weight tying and random projections to scale low-rank adapters down to a

single trainable parameter

.

Optimizing the “Micro-Update”

: For these tiny updates,

fp32 precision

is more bit-efficient than half-precision formats , and

“tiling”

(sharing parameters by model depth) outperforms structured sharing by module type.

Scaling Trends

: As models grow larger, they become more ‘programmable’ with fewer absolute parameters, suggesting that

trillion-scale models

could potentially be tuned for complex tasks using just a handful of bytes.

Check out the

Paper

.

Also, feel free to follow us on

Twitter

and don’t forget to join our

120k+ ML SubReddit

and Subscribe to

our Newsletter

. Wait! are you on telegram?

now you can join us on telegram as well.

转载说明：本文系转载内容，版权归原作者及原出处所有。转载目的在于传递更多行业信息，文章观点仅代表原作者本人，与本平台立场无关。若涉及作品版权问题，请原作者或相关权利人及时与本平台联系，我们将在第一时间核实后移除相关内容。