Context window

The context window of a large language model (LLM) refers to the maximum number of tokens it can process at a time to generate a response. Imagine looking through a window into a garden. Just as the window limits your view to only a portion of the garden, the context window limits the amount of text the LLM can consider at one time.

Fine-tuning

Fine-tuning is the process of feeding a task-specific dataset to a pre-trained model and adjusting its parameters through backpropagation to improve the performance of general models for specific use cases.

Large language model (LLM)

LLMs are transformer-based models with a basic understanding of language heuristics. Trained on huge amounts of text, LLMs are good at understanding patterns in written language and generating response tokens. In simple terms, they take in a string and output a string.

Latency

The latency of an LLM is the time it takes for the model to generate a full response to an input.

Multimodal

Multimodal models are architectures that can consume and output a variety of inputs, such as image, video, audio, and text. In most cases, multimodal models convert inputs to embeddings to send to the underlying LLM, which then processes the text output.

Pre-training

Pre-training is the foundation of LLMs. During development, LLMs are pre-trained on large amounts of data to improve their language understanding and generating capabilities. Pre-training is similar to the phase that human babies go through when they begin to understand the structure of language before they learn to speak.

Reinforcement learning from human feedback (RLHF)

RLHF uses human feedback to fine-tune LLMs to follow human instructions better. This technique requires humans to rate the outputs generated by an LLM so that their feedback can be used as training data for the LLM.

Retrieval-augmented generation (RAG)

RAG is a technique used to improve the prediction quality of an LLM by creating an external datastore that it can use to build richer prompts. In most cases, the external datastore is created by chunking source documents and storing the chunks in a vector database. The LLM can then retrieve the most relevant vector and use it to develop an output.

Temperature

Temperature is a parameter that controls the randomness of a model’s generated output. Temperature ranges from 0 to 1, with higher values increasing the randomness of the output. Generally, lower values around 0.2 are a good starting point for LLMs, as higher temperatures can lead to nonsensical outputs.

Time to first token (TTFT)

TTFT is the time it takes an LLM to generate its first token after processing a complete input string.

Token

Tokens are the smallest units of data processed by LLMs. Individual tokens form bigger strings. Depending on the tokenizer, text that you perceive as a single word could be generated or processed as two tokens. For example, the word that's is converted into two tokens: that and 's.