What it is

A large language model (LLM): a program that predicts the next word (token) given the previous ones.
Built on the Transformer architecture, which uses self-attention to decide which parts of the input matter for predicting the next token.

How it learns (training)

Pretraining (self-supervised):
It reads a huge amount of text and learns to model
$P(next token∣previous tokens)P(\text{next token} \mid \text{previous tokens})$ .
No labels are needed—just text.
Instruction tuning:
Further trained on examples of questions + good answers so it follows instructions.
Feedback alignment (e.g., RLHF/DPO):
Humans rank outputs; the model is adjusted to prefer helpful, honest, harmless replies.

Your prompt + chat history are converted to tokens (numbers).
The model computes attention over the whole context to create embeddings and predicts a probability for each possible next token.
A decoding strategy (e.g., temperature, top-p nucleus sampling) chooses the next token. Repeat → a full answer.
It does not search the web by default; it generates from patterns it learned. (Some setups add tools like browsing or retrieval, but that’s optional.)

Strengths: fluent language, pattern recognition, combining info across long contexts.
Limits: can hallucinate (sound confident but be wrong), lacks real-time awareness, and doesn’t “understand” like humans—it’s pattern prediction.

That’s the core: predict-the-next-token, scaled up with lots of data, Transformers, and alignment so it behaves helpfully in a chat.