Calibration

A LLM is highly calibrated if it’s predicted confidence in an answer generally matches the probability of being correct. Pre-trained GPT4 is highly calibrated. Post-training actually reduces its calibration.

Calibration is a widely studied property in the literature on uncertainty quantification: a model is calibrated if it assigns meaningful probabilities to its predictions. Concretely, if a well-calibrated model predicts that 1,000 sentences are toxic each with probability 0.7, then we expect around 700 of them to be toxic.

page 30 of [2]

For example, P("Positive" | "Input: nothing Sentiment:") should be 0.5 because there is no input given, so the sentiment is neutral. Yet GPT4 might state that the probability is higher, such as 0.9. This would mean that the LLM’s output distribution is biased toward the label Positive.

To address this, we need to calibrate the output distribution so that a probability of 0.5 is assigned to both Positive and Negative.

Why is calibration important?

As we integrate LLMs into systems and into society, we need to know their uncertainty. (see page 30 of [2])

When machine learning models are integrated into broader systems, it is critical for these models to be simultaneously accurate (i.e. frequently correct) and able to express their uncertainty (so that their errors can be appropriately anticipated and accommodated). Calibration and appropriate expression of model uncertainty is especially critical for systems to be viable for deployment in high-stakes settings, including those where models inform decision-making (e.g. resume screening), which we increasingly see for language technology as its scope broadens. For example, if a model is uncertain in its prediction, a system designer could intervene by having a human perform the task instead to avoid a potential error (i.e. selective classification).

From page 10 of the GPT4 technical report, we know that GPT4 is biased and confidently wrong in its predictions because of the relatively poor calibration.

In-context vs in-weights learning

  • in-context learning – This is the ability to generalize rapidly from a few examples of a new concept on which they have not been previously trained, without gradient updates to the model.
  • in-weights learning – The standard setting for supervised learning – this is slow (requiring many examples), and depends on gradient updates to the weights.

In the case of transformer language models, the capacity for in-context learning is emergent [3]. In-context learning emerges in a transformer model only when trained on data that includes both burstiness and a large enough set of rarely occurring classes. At the same time, architecture is also important. Unlike transformers, recurrent models like LSTMs and RNNs (matched on number of parameters) were unable to exhibit in-context learning when trained on the same data distribution. It is important to note, however, that transformer models trained on the wrong data distributions still did fail to exhibit in-context learning. Thus, attention is not all you need – architecture and data are both key to the emergence of in-context learning.

Open Source LLMs

Explanation of 4-bit, 8-bit, 16-bit open source LLMs

Link to Reddit post

“Traditionally a LLM would be trained at 16 bit and ship that way. So a 7B model would require maybe 20gigs of video ram to run. A 13B would require maybe 40GB of vram, etc. This kept them running only on server rated hardware.

Then 8bit became something widely supported. This allowed a 7B model to run on 10GB of vram, and a 13B on 20GB. This became a lot more viable to run on consumer hardware.

Then someone got it to work on 4bits without much quality loss. This meant that 7, 13 and 30 could all run on consumer video cards and a 65B could even run on a consumer dual video card setup.

However, training these models still happened at 16 and 8 bits, which meant you needed high end hardware for training. This got solved a little with LoRA training, where you only train maybe 2% of the model, but it was still at 8bit so consumer hardware was sort of limited to LoRA training 7B and 13B. Because to do LoRA training you still have to load the model into video ram, freeze it, then train on that 2% part.

Now we’re getting into high quality 4bit LoRA training. This means we can more easily train all the way up to 65B on home, consumer level hardware because we can natively load the models in at 4bit and train them that way. So we can now train a 30B on a single Nvidia 3090 video card or a 65B on a dual 3090 setup, which isn’t that expensive or hard to setup.”

Base Model vs SFT vs RLHF

Base model – bulk of compute is here. Takes months to train. The model simply predicts the next word. It has high entropy, meaning the probability distribution of the next token is relatively diffuse. Ex: Llama v1, v2

SFT – fine-tuned base model with objective.

RLHF – Create a reward model and then use the reward model to assign rewards to the SFT’s output. Use RL to reinforce/increase probability of tokens in completions with higher rewards. Ex: ChatGPT, Claude.

Resources