Context Window

Context window is the amount of information an LLM can consider at a time to generate the output which is the number of tokens it can use as history while generating new token. One analogy for this like a real window where you can only remember what you are currently seeing out of the window, and if I ask you what you see, you would only be able to tell me what you are seeing currently.

LLM (Large Language Model)

Large language models are transformer based models that are trained on huge amount of text and so have a basic understanding of language heuristics and can understand the pattern in your text and are good at generating tokens.

for simpler terms they take in a string and output a string.


Latency in a LLM is the time it takes for the model to generate a full response for a user.


Multimodal models are architectures that can consume and output a wide variety of inputs such as image, video, audio, etc. but in most cases they convert them to embeddings to send them to the underlying LLM to then process the text output.


Tokens are the smallet unit of data that are processed by an LLM to form a bigger string, depending on the tokenizer what you percieve as a single word could have been generated with two tokens.

eg. the word that's is converted into two token that and 's


Temperature is a parameter that controls the randomness of the model’s output, it range between 0 to 1, generally lower values around 0.2 are a good starting point otherwise LLMs start outputting a lot of garbage.


Finetuning is the process of feeding the task-specific dataset to a pre-trained model and adjusting its parameters through back propagation. This enables a general model to work better for your specific use case.

TTFT (Time to first token)

This is the time taken for a LLM to generate the first token after processing the complete input string.


Pre-training is the foundation of LLMs By training on lots of text data, these models get good at understanding and generating language. Pre-training is like the phase when you are born and are understanding the structure of language and only with that you get the understanding of a language and start speaking.

RAG (Retrieval augmented generation)

RAG is a technique that improves prediction quality by using an external datastore to build a richer prompt, this generally works by chunking a document that you want to use and storing it in a vector database so your LLM system can then retrieve the most relevant vector to be used to answer the question.

RLHF (Reinforcement learning from human)

It’s a technique that uses human feedback to finetune LLMs to follow human instructions by allowing humans to rate output generated by the LLM and then using that as training data for the LLM