ai-after-attention

Attention is all you need


KV Caching


Vergangene Token noch nötig


$$\vec{E} \rightarrow \vec{Q}, \vec{K}, \vec{V}$$


step++


KV Cache

Step Input Normal K/V Cached K/V Computed K/V
1 Time Time - Time
2 Time flies Time, flies Time flies
3 Time flies fast Time, flies, fast Time, flies fast

Quantization

  • kleiner
  • schneller
  • im Idealfall fast gleich gut


Beispiel

70B Parameter Model

Datentyp RAM[GB]
FP32 280
FP16 140
FP8/INT8 70

2020 GPT3

Larger models make increasingly efficient use of in-context information.

[...] reflect on the small 1.5% improvement achieved by a doubling of model size between two recent state of the art results and argue that “continuing to expand hardware and data sizes by orders of magnitude is not the path forward”. We find that path is still promising...


Scaling


Ethik

Finally, GPT-3 shares some limitations common to most deep learning systems – its decisions are not easily interpretable, it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on. This last issue – biases in the data that may lead the model to generate stereotyped or prejudiced content – is of special concern from a societal perspective...



2024 DeepSeek V3

  • FP8 Quantization-Aware-Training
  • Mixture of Experts
  • Latent Attention
  • Rotary Position Embedding


"Experts"


2025 Time Scaling

  • GPT3: mehr Parameter ⟹ weniger Loss
  • 💡: mehr Zeit ⟹ weniger Loss


2024 o1 R1

  • Trainingsdaten (Supervised Fine Tuning)
  • Chain of Thought Prompting

Their dependence on human-annotated reasoning traces hinders scalability and introduces cognitive biases.


Reinforcement Learning (Alpha Go)

AlphaGo Lee 3800 < AlphaGo Zero 5200


Reinforcement Learning

$$Reward_{rule} = Reward_{acc} + Reward_{format}$$

$$Reward_{acc} \rightarrow Compiler, Leetcode$$

$$Reward_{format} \rightarrow LLM\mbox{ }Judge$$

Furthermore, by constraining models to replicate human thought processes, their performance is inherently capped by the human provided exemplars, which prevents the exploration of superior, non-human-like reasoning pathways.


How to reason?


Reasoning? - R1-Zero