Taking a token's temperature

Have you ever wondered why an AI assistant can give completely different answers to the same questions when you repeat them? The answer to this relates to tokens, which we learned about in a previous blog post, and the nature of token generation.

I explained tokens here. And I went a bit deeper into the statistical nature of how they are generated here.

In this post I'm going to talk about how the generation of tokens can be manipulated to give you responses ranging from completely factual to wildly creative.

Temperature

A large language model has a concept called temperature that determines how rigidly or loosely it should apply its statistical biases to the next token that is generated.

In physics, temperature is a measure of how much particles move around and how likely they are to jump to different energy states. The analogy carried over to AI because it describes similar behaviour in how tokens are selected, i.e. at a higher temperature tokens have more “energy” allowing them to jump to less probable choices.

If the temperature is set low then tokens with higher probabilities are favoured, a medium temperature allows for more invention in the response and a high temperature lets the assistant know that it can be as creative as it wants to be.

When dinner plans turn dark

Let's look at a simple example with a response from an AI assistant that starts with the words “Tonight I will eat… "

A very low temperature, somewhere around zero might result in “Tonight I will eat chicken”. It’s statistically highly likely, safe, logical and entirely boring. Think of zero as basically autocomplete. That isn’t strictly true, but it’s a useful lie to help you remember.

A slightly higher temperature of 0.3, say, could give you ”Tonight I will eat outside”. Still quite likely, but it has added a contextual variation, shifting focus from what you will eat to where. It’s entirely conventional, if less boring.

0.5 might get you something like “Tonight I will eat some street food”. This is a more creative response, adding a specific scenario while at the same time showing the same kind of contextual variation as 0.3.

0.7 could be “Tonight I will eat with my friends” shifting focus to a social aspect of eating. Less predictable, but still natural.

1.0 might be “Tonight I will eat my friends” and we have reached the point that, while it has still remained grammatically correct, it is now highly disturbing, as the assistant has suddenly turned into Hannibal Lecter.

Many of the more popular AI assistants, like ChatGPT and Claude, don’t let you control temperature directly. Instead they will infer a required temperature based on the conversation you are having. If you have programming questions, the temperature will be much lower compared to if you are asking for a short story about a bear that has found illicit materials in the woods. You can also give indications in your prompts that help the assistant, with phrases like “keep it short and to the point” or “be as creative as you like”.

You can, however, control the temperature when using their APIs. Or use HuggingFace as an alternative when playing around with LLMs.

Adding toppings

Temperature isn’t the only way to control token generation. There is also “Top-p sampling”.

Top-p filters out tokens below a certain cumulative probability threshold, discarding the least likely tokens before they are considered by the assistant. This creates a cutoff that completely removes unlikely tokens. And again top-p values can range from low (discards more tokens that don’t make the cut) to high (allows for more variety).

Top-p acts as a quality control for token generation, helping to keep token generation sensible even when temperature is high, as the more nonsensical outputs haven’t made it past the filter.

Additionally, there’s also “top-k”. Like top-p, top-k is another quality control filter, but is based on count rather than probability. Tokens are sorted in order of probability and only the top k are kept. The k means count, or maybe constant, because they both begin with a c, or a k, or something. You know what? It doesn’t matter.

While top-p creates a dynamic cutoff based on cumulative probability, top-k uses a fixed number of tokens regardless of their probabilities. Which makes it a very simple approach when considering the next token.

Each of top-k, top-p and temperature can be combined


  • Apply top-k first to remove clearly unlikely tokens (k=40 is common, sorting and keeping only the top 40 tokens, but other ks can be used.)
  • Use top-p to further filter the remaining tokens based on their probabilities
  • Apply temperature to control randomness in final selection

Tonight I will eat… a month's supply of ice-cream

Temperature, top-p, and top-k are the heat, seasoning, and portion control of AI token generation. Turn up the heat for hotter responses, use top-p to keep the flavours balanced, and top-k to make sure you don’t accidentally tip the whole spice rack in.