NLP Interview Questions Collection
Last updated on October 3, 2025 12:07 AM
Common NLP Interview Questions
Tokenization
Q1. Explain what is a token, a tokenizer, and tokenization.
So, a token is basically the smallest piece of text the model works with. It could be a whole word, a single character, or sometimes just a subword piece. For example, the phrase “natural language processing” might get split into “natural,” “language,” and “processing.” The model doesn’t really see the raw text—it only sees the tokens mapped into numerical IDs.
And a tokenizer is the tool that does this splitting. Its job is to take raw text and break it down into a sequence of tokens, then turn them into IDs. Different tokenizers split text differently—some do it by whitespace, some by characters, and more advanced ones like BPE or WordPiece learn from data how to create the best subword units.
And tokenization refers to the entire process of splitting text into tokens. It’s not about a single token or the tool itself, but the overall pipeline step where natural language gets turned into a sequence of tokens the model can understand. In other words, it’s the act of converting raw text into model-friendly units.
Q2. What are the common tokenization methods and algorithms in NLP?
I usually group tokenization into three types. Word-level works well for English because spaces split words. It is simple and fast, but many word forms become different tokens, and OOV can happen. Character-level is common for Chinese or Japanese. It has almost no OOV and works for any script, but sequences get longer and each token carries little meaning. Subword-level is the most common today. It keeps frequent words whole and splits rare words into pieces; the units are learned from data (like BPE, WordPiece, Unigram).
Now the algorithms. Older ones are rule-based or statistical, like the Moses tokenizer. It uses clear rules and regular expressions to split words, punctuation, and contractions, and it can normalize some symbols; it is stable and easy to control. Then came ML methods like CRF (Conditional Random Fields), which treats tokenization as sequence labeling and predicts word boundaries (often with B/M/E/S or BIO tags), sometimes with a BiLSTM for context. In modern large models, subword algorithms dominate: BPE (merge frequent adjacent symbols; greedy longest match), WordPiece (add subwords that maximize corpus likelihood; longest-match encoding), Unigram (start with a big vocab, use EM to learn probs, prune; pick the most probable split), the SentencePiece toolkit (train/encode on raw text), and byte-level BPE (BBPE), which operates on UTF-8 bytes and has no OOV.
Q3. Explain the BPE and BBPE tokenization algorithms, and their differences. Also explain Unicode and UTF-8.
Let’s start with BPE (Byte Pair Encoding). The method is simple. At the beginning, the vocabulary only has single characters, like a, b, c…. In a large corpus, we count which character pairs appear most often, then merge the most frequent pair into a new token. We repeat this until the vocab reaches the target size. When we tokenize, we apply these merge rules and split text into the longest subwords. For example, “unhappiness” might be split into “un” + “happiness”, and “happiness” could further split into “happy” + “ness”.
BBPE (Byte-level BPE) works the same way, but its base unit is not a character, it’s a UTF-8 byte. Text is first encoded into bytes, with each Unicode character taking 1–4 bytes. The initial vocab is 256 bytes, because one byte has 8 bits and can represent 2 to the power of 8 which is 256 values. A byte is also the smallest unit computers use to store and process data. BBPE merges based on byte pairs, not characters. The big advantage is that it has no OOV problem, since any symbol can be represented with bytes. But the drawback is readability. Tokens may not match what humans see as words. For example, “apple” could be split into “ap” + “pl” + “e”, or even smaller byte pieces.
The difference between BPE and BBPE is: BPE is character-based, more human-readable, but can fail if a character was not in training. BBPE is byte-based, always safe, but tokens can look strange.
To understand why BBPE uses bytes, let’s look at Unicode and UTF-8. Unicode is a character set. It assigns each symbol a unique code point, for example the letter “A” is U+0041. UTF-8 is an encoding that stores these code points as 1–4 bytes. This system works for English, German, Arabic, and many other scripts, all in one unified standard.
This is important because smaller NLP models often used different tokenization per language. For example, English by spaces, Chinese by characters. But large models need one unified way. Unicode gives a universal set of characters, and UTF-8 makes sure they can all be stored in bytes. BBPE takes advantage of this, using bytes so one model can handle almost any language consistently.
Q4. What is the WordPiece algorithm, the Unigram algorithm, and the SentencePiece framework?
Let me start with WordPiece. It was first used in Google’s BERT. The idea is similar to BPE, but it focuses more on probability. At the beginning, the vocab only has all the single characters. Then the model looks through the corpus and checks which pairs of subwords appear together most often and give the biggest improvement in likelihood. In other words, it always merges the pair that has the strongest connection. This process repeats again and again until the vocab grows to the target size. When tokenizing, WordPiece uses the longest match strategy. That means at each position, it tries to match the longest possible subword from the vocab. If there are many choices, it picks the longest one. For example, the word playing can be split into play and ##ing (“hash hash ing”). The two hash marks show that this token is not a new word but a continuation.
WordPiece and BPE are quite similar, since both start with characters and keep merging frequent pieces. The difference is that BPE only looks at frequency, while WordPiece uses a probability model and chooses the pair that gives the biggest likelihood gain. So WordPiece is like a more statistical version of BPE.
Unigram works in a different way. Instead of building from small to large, it starts with a very large candidate vocab that includes many possible subwords. Then it uses a probabilistic model to estimate how likely each subword is, and step by step it removes those with the least impact on the total likelihood. This process is trained with EM, which keeps updating the probabilities and pruning the vocab until it reaches the target size. When tokenizing, Unigram does not use greedy matching. Instead, it searches for the segmentation with the highest probability overall. A common method here is the Viterbi algorithm. Viterbi is a dynamic programming method that checks all possible paths and finds the one with the best total probability. This makes Unigram more flexible than WordPiece.
Finally, SentencePiece. This is not a new algorithm, but a tokenizer framework. Its key feature is that it works directly on raw text, without needing pre-tokenization. It even treats spaces as normal symbols, and it uses the special marker “▁” (“underline”) to show the start of a word. SentencePiece can train with either BPE or Unigram, so it is more like a unified platform. It also handles normalization and special tokens. Because it does not rely on language-specific rules, it works very well across many languages, and that is why modern large models, like T5 or mBART, often use SentencePiece.
Token Embedding
Q5. What is a token embedding, an embedding matrix, the lookup mechanism, and how do they relate to tokenization?
A token embedding is a way to turn text into numbers. Since computers can’t understand words directly, each token is mapped into a token vector. For example, the vector for apple may be close to banana, but far from car. This helps the model learn meaning from distance in vector space.
The embedding matrix is basically a big table. The number of rows equals the vocabulary size, and the number of columns is the vector dimension. If we have 50,000 tokens and each vector has 300 dimensions, the embedding matrix is 50,000 by 300. At the start, this matrix can be initialized randomly, or it can use pre-trained embeddings like Word2Vec, GloVe, or FastText. During training, the values are updated through backpropagation so that vectors capture semantic information useful for the task.
The lookup mechanism is how we connect tokenization and embedding. After tokenization, each token is mapped to an ID. The embedding layer then performs a lookup: it takes the ID and retrieves the corresponding row from the embedding matrix. That row is the embedding vector for the token. This process is called embedding lookup, and it’s basically a fast table lookup.
So in short, tokenization splits text and assigns IDs, and embedding converts those IDs into vectors. Together, they turn raw text into numerical input that neural networks can process.
Q6. What are common token embedding methods? What are static and dynamic embeddings?
Let me start with token embedding. It’s a lookup table that maps each token ID to a vector. In most models, this is the first step: turn an ID into a vector so the model can work in vector space.
We usually talk about two styles: static and dynamic (contextual).
For static embeddings, classic tools are Word2Vec, GloVe, and FastText. Word2Vec learns from which words show up together—either predict the neighbors or predict the center word. GloVe learns from a big co-occurrence matrix and breaks it down to get vectors. FastText splits a word into subword n-grams and sums them, so prefixes and suffixes help; this also reduces OOV (out-of-vocabulary). After training, each token has one fixed vector everywhere. It’s fast and simple—just do a lookup. The downside is sense mixing: bank in “river bank” and “credit bank” gets the same vector.
For dynamic (contextual) embeddings, the vector depends on the sentence. You still do the first lookup, then feed those vectors into a language model—ELMo (bi-LSTM), GPT (one-way Transformer), or BERT (two-way Transformer with masked language modeling). The hidden states you get are the context-aware embeddings. Same token, different sentence, different vector. This helps tell senses apart, but it costs more compute. And because modern models use subword tokenization (BPE, WordPiece, Unigram), it’s more precise to call these token embeddings, not “word” embeddings.
Picking one: if compute is tight, data is small, or you just need quick features, go static. If you want stronger meaning and clear sense separation—or you plan to fine-tune a downstream model—go dynamic (BERT/GPT-style).
Q7. What is Word2Vec? What are the CBOW and Skip-gram models?
Let me start with one-hot. It uses a very long vector to represent a word: one position is 1, all others are 0. The problems are clear. First, it’s huge and sparse, so storage and compute are heavy. Second, no similarity: different words are almost orthogonal, so cat is not closer to dog than to table. Third, no generalization: a new word has no vector at all.
word2vec fixes this by learning dense, low-dimensional vectors from data. We slide a context window over a large corpus. Words that co-occur often are pulled closer; words that rarely co-occur are pushed farther. After training we get an embedding matrix. Each row is a word (or token). At inference we just look up the row to get the vector—small, fast, and it captures meaning.
There are two training set-ups. CBOW means “use neighbors to predict the center word.” In There is an apple on the table, with window size 2, the context for apple is is, an, on, the. We look up these context embeddings, average or sum them, and predict the target word in the middle. CBOW is fast and stable for frequent words. In practice we add subsampling to drop very frequent words, and we speed up the output with negative sampling or hierarchical softmax. Negative sampling updates scores for a few true contexts and a few sampled “fake” contexts, not the whole vocab. Hierarchical softmax predicts a path in a binary tree instead of a full softmax over all words.
Skip-gram flips the direction: “use the center word to predict the neighbors.” With apple as center, it predicts is, an, on, the one by one. This makes many pairs per center word, so it learns rare words better, but it takes more steps. It uses the same tricks—negative sampling or hierarchical softmax, plus subsampling for very frequent words. After training we usually take the input-side embeddings as the final word vectors; the output-side vectors can work too, but people use the input side more often.
Which to choose? If you want speed or the corpus is small, go CBOW. If you have lots of data and care about rare words, go Skip-gram. Either way, the core idea is the same: learn vectors from co-occurrence so that meaning lives in a compact, useful space.
Q8. What are the GloVe and FastText methods?
Let me start with GloVe (Global Vectors). The idea is to learn static word vectors from global co-occurrence counts. In practice, we scan the entire corpus with a fixed window. Any time two words fall in the same window, we add one to their pair count. So apple–pie gets a high count if they often appear near each other, while apple–engine stays low. Then we train vectors so their match scores roughly reflect those counts, with a weighting scheme to reduce noise from very frequent or very rare events. In the end, each word has one vector everywhere. It makes strong use of global statistics, trains stably, but it’s not contextual, so it can’t split senses across sentences.
Now FastText. Before that, a quick note on n-grams: an n-gram is a contiguous chunk of n units. The units can be characters or words. For characters, play has bigrams pl, la, ay and trigrams pla, lay. For words, natural language processing has the bigram natural language and the trigram natural language processing.
With that in mind, FastText keeps the word2vec CBOW/Skip-gram training style but changes the representation: a word is the sum of its character n-grams. For playing, we add boundary markers and take 3–6 character pieces (like <pla, lay, ayi, ing>), then sum their vectors. This shares prefixes and suffixes across related forms, so rare words learn better, and OOV words can still get a vector by composing their n-grams. Training still uses a context window plus common speed-ups like negative sampling or hierarchical softmax. In short: GloVe fits vectors from global co-occurrence, while FastText builds on word2vec but represents words with subword n-grams, which helps with morphology and long tails.
Position Embedding
Q9. What is Position Embedding in NLP, and why is it important?
Here’s the core idea: position embeddings give the model a sense of order. Self-attention by itself is order-blind. Without positions, dog bites man can look like man bites dog. Position info tags each token with “where am I” and “how far am I from others.”
We need this because language depends on order, and the meaning changes when order changes. With position signals, attention scores don’t just reflect which words are related; they also factor in who comes first and how many steps apart the words are. That helps the model capture syntax, coreference, and long-range links in long sentences.
Common approaches include: absolute positions (give each index a vector and add it to the token), trigonometric/sine–cosine encodings (fixed patterns), and relative positions (make attention care about distance rather than the absolute index). There are also variants like rotary position embeddings (RoPE) and ALiBi that inject distance information in different ways.
Q10. Explain absolute position encoding, trigonometric encoding, and relative position encoding
Absolute position encoding means we learn one vector for index 0, one for index 1, and so on. At input time, we add the token vector and its position vector, then feed that into the Transformer. The model knows “which index I am at” from the start. It’s simple and stable. The trade-off is a fixed max length during training; going longer needs interpolation or an extended table. It’s good when sequences are short or have a known length.
Trigonometric (sine–cosine) encoding uses fixed sine and cosine waves to turn an index into a vector; no training needed. We also add it to the token vector. Different frequencies act like rulers: some capture coarse order, others capture fine steps. It extrapolates to longer inputs, which is nice, but it’s less flexible than learned tables, so the model must adapt to these fixed patterns.
Relative position encoding doesn’t say “I am index k.” Instead, it feeds distance between tokens into attention. When computing attention scores, the model becomes sensitive to “how many steps apart” and “who comes first.” Example: in the cat sat on the mat, cat should attend more to sat because they’re close; if we move the words but keep distances similar, attention can behave similarly. Upsides: it stays stable when words move around, works better on long texts, and still runs on inputs longer than those seen in training. The trade-off: it’s a bit harder to implement, because you need to adjust how attention scores are computed (or tweak the vectors).
Q11. What is Rotary Position Embedding (RoPE), and where is it used?
Rotary Position Embedding (RoPE) works like this: you first compute the usual Q and K. Then you rotate each of them by a small angle that depends on the token’s index. Early tokens rotate less; later tokens rotate more. When Q and K take a dot product, the score now naturally encodes relative shifts—how many steps apart, and who comes first. A simple picture is two arrows on a plane: position changes rotate the arrows; the angle between them changes the attention score. There’s no big learned position table; RoPE injects distance right into attention, which tends to be stable on long contexts and extrapolates beyond training length (many implementations also apply RoPE scaling to push the context further).
Where is it used? It started with RoFormer for Transformers. Today it’s in many open-source LLMs, for example the LLaMA family, plus Google Gemma 3, Mistral, Qwen, and more. In short, most modern LLMs use RoPE or a close variant, because it encodes relative position directly inside attention with a clean design.
Q12. Besides token and position embeddings, what other embeddings are used in NLP?
Besides token and position embeddings, these three are the ones I reach for most.
Segment embedding: when the input has two parts (e.g., NLI pairs or “question + passage”), add one vector to tokens in part A and another to part B. The model can separate the two spans instead of mixing them.
Task embedding: in multitask or instruction settings, add a small learned vector (or a task tag) so the model knows which task it is solving. It acts like a mode switch: share the backbone but steer the behavior.
Speaker embedding: in dialogue, tag who is speaking (User/Assistant or speaker A/B) and add that to the tokens from that speaker. It helps with turn tracking and coreference, and reduces confusion between speakers.
Attention
Q13. What is the attention mechanism, and what is its role?
Attention lets the model focus. For a token, the model asks: which parts of the input matter most right now? It scores all positions, looks more at the important ones, and less at the rest.
Here’s the simple loop: make a Query (Q) for the current token, compare it with all Keys (K) to get scores, turn scores into weights with softmax, then take a weighted sum of Values (V). That sum is the context vector—what the model actually “looks at.” Self-attention looks within the same sequence; cross-attention lets one sequence look at another (e.g., decoder looking at the encoder in translation).
What does it give us? Long-range dependencies, dynamic filtering of noise, and with multi-head it can track different relations at the same time. It also enables fast parallel training in Transformers, and offers some interpretability via the weights. Two notes: attention itself has no word order, so we add position embeddings; and its cost grows roughly with the square of sequence length, which is why long-context tricks are common.
Q14. What is the encoder–decoder framework? Which LLMs use encoder–decoder, encoder-only, or decoder-only?
The encoder–decoder setup splits the job in two. The encoder reads the whole input and builds contextual representations. The decoder then generates the output token by token: it uses masked self-attention over what it has written so far, and cross-attention to the encoder outputs. This is great for turning one sequence into another—machine translation, summarization, and generative QA.
An encoder-only model only understands. It encodes the sentence bidirectionally and a small head does classification, extraction, retrieval, or scoring. Well-known examples: BERT, RoBERTa, DeBERTa, ALBERT, ELECTRA, plus many embedding/retrieval models (Sentence-BERT, E5).
A decoder-only model only generates. Given a prompt, it writes left-to-right with masked self-attention—no separate encoder—so inference is simple and it shines for chat, writing, and code. Common families: GPT-2/3, LLaMA, Mistral, and Google’s PaLM and Gemma; most chat LLMs follow this style.
For encoder–decoder models, the two parts work together via cross-attention to map input into output. Typical models include Google Translate’s Transformer-based NMT system, OpenAI Whisper (speech encoder + text decoder), and the original Transformer, widely used in translation and summarization.
A practical way to choose is this: if you only need understanding, pick an encoder-only model; if you need free-form generation like chat or code, go decoder-only; if you’re mapping one sequence to another (e.g., translation, summarization), an encoder–decoder model is usually best.
Q15. Explain self-attention, multi-head attention (MHA), and cross-attention.
What is self-attention?
It means “a sequence looks at itself.” From the previous layer, we project into three parts: Q (what I’m looking for), K (what I can offer), and V (what I return if picked). Each token matches its Q against all Ks; high match → read more of that token’s V. In the encoder it’s bidirectional; in the decoder it’s causal-masked, so you only look left.
The famous formula (Scaled Dot-Product Attention):
$$
attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V
$$
$d_k$ is the key dimension. The scale $\sqrt{d_k}$ keeps scores stable; in the decoder we also add a causal mask so future tokens are hidden.
What about Multi-Head Attention (MHA)?
We don’t run attention once—we run many heads in parallel. Each head has its own $W_Q^i, W_K^i, W_V^i$ and can focus on different patterns (short vs. long range, syntax, coreference, etc.). Then we concat heads and apply a final linear map:
$$
Q_i=QW_i^Q, K_i=KW_i^K, V_i=VW_i^V, i=1,2,..h \
head_i=Attention(Q_i,K_i,V_i) \
MultiHead(Q,K,V)=Concat(head_1, head_2,…,head_h)W^O
$$
Think of it as multiple views looking at the same sentence.
And cross-attention?
Self-attention is within one sequence. Cross-attention lets one sequence look at another. In encoder–decoder models, Q comes from the decoder, while K and V come from the encoder outputs. This lets the decoder read the whole input while generating the output—perfect for translation and summarization.
Q16. What are attention improvements like MHA → MQA → GQA → MLA?
Let me walk through the path from MHA to MQA to GQA to MLA with the serving pain point in mind: KV cache size and bandwidth.
With MHA, every head has its own $W_Q, W_K, W_V$, so each head produces its own K and V. Great expressiveness, but during autoregressive decoding you must store K/V for all past tokens per head, so memory and bandwidth grow with heads × sequence length.
MQA is the first big savings: all heads share one K and one V; only Q stays per-head. Now each new token stores just one set of K/V, cutting the KV cache from “H copies” to “1 copy.” You save a lot of bandwidth; the trade-off is a small drop in modeling flexibility.
GQA is the middle ground: share K/V within groups of heads. For example, 32 heads split into 8 groups means you store 8 K/V copies. Quality stays closer to MHA, while memory is far lower than MHA—commonly used in modern LLM serving.
MLA (Multi-Head Latent Attention from DeepSeek) goes one step further. Think of it as: compress K/V into a small shared latent, cache only that, then expand per head on the fly. Concretely, each token’s representation is projected into a shared low-dim latent K/V for caching; at attention time, small head-specific projections recover head-wise K/V from the latent. You get KV cache near MQA size (or smaller), while keeping per-head diversity closer to MHA/GQA. The cost is a tiny bit of extra compute for the “expand” step, which is usually more than paid back by the bandwidth savings. Quick feel with numbers: 32 heads × 128 dim → MHA caches $32×128×2$ per token; MQA caches $1×128×2$; GQA (8 groups) caches $8×128×2$; MLA with a 96-dim latent caches $1×96×2$, yet still lets each head use its own expansion.
With MHA, each head keeps its own K and V. With MQA, all heads share a single K and a single V. GQA sits in the middle: you group the heads, and each group shares one set of K and V. MLA goes a step further: you cache one small shared latent K/V and, when you compute attention, each head expands that latent into its own K and V. None of these change the core attention formula—they only change how K and V are built and cached, so you cut memory and bandwidth while trying to keep quality high.
As a rule of thumb: if quality is your top priority and you have plenty of memory, choose MHA or GQA. If memory is tight, choose MQA or GQA. And if you want the tiny cache of MQA but still want richer per-head behavior, go with MLA, as long as your stack supports it.
FFN (Feed-Forward Network)
Q17. Explain what a Feed-Forward Network (FFN) is in the Transformer, and what role it plays.
So, FFN (Feed-Forward Network) is a small but key part inside the Transformer. It is basically a two-layer fully connected network, with a non-linear activation in between, like ReLU or GELU. The process is simple: the input is expanded to a higher dimension, then passed through the activation, and finally reduced back to the original size.
Its main roles are:
- Non-linear transformation: it captures complex patterns that a simple linear layer cannot.
- Increase model capacity: by expanding the hidden size, it helps the model learn more complex features.
- Complement attention: attention looks at relations across positions, while FFN focuses on each position itself.
So in short, FFN gives the Transformer more “brain power” — it makes the model not only good at finding relationships between tokens, but also at understanding each token’s own representation.
Q18. What is the role of activation functions? Introduce common ones such as Sigmoid, Softmax, Tanh, and ReLU.
In a neural network, the activation function adds non-linearity. Without it, the whole network is just linear, so it cannot learn complex patterns.
Here, a linear transformation means something like:
$$
y = W \cdot x + b
$$
This is basically multiplying the input by a weight matrix and adding a bias. The key point is: input and output keep a straight-line relation. It can do things like scaling, rotation, and shifting, but it cannot make curves or capture complex shapes.
Here are the main ones:
Sigmoid:
$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$It squashes values into 0 to 1, like an S curve. Good for binary classification (as probability). But it has vanishing gradient issues in deep nets.
Tanh:
$$
\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
$$Output is from -1 to 1, centered around 0. Usually better than Sigmoid, but still has vanishing gradient problems.
ReLU:
$$
f(x) = \max(0, x)
$$If input > 0, keep it; if ≤ 0, set to 0. It’s simple, fast, and reduces gradient vanishing. But it can cause “dead neurons.”
Softmax:
$$
\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}
$$It turns a vector into probabilities (0 to 1), with sum = 1. Very common in multi-class classification outputs.
So basically: activation functions add non-linearity. ReLU is the most used in hidden layers, while Sigmoid and Softmax are mostly used in the output.
Q19. Introduce advanced activation functions such as Leaky ReLU, ELU, GELU, Swish, GLU, and SwiGLU, and explain why they are used.
So, many new activation functions are built on top of ReLU to fix its issues, like dead neurons. Let me go one by one:
Leaky ReLU:
$$
f(x) = \begin{cases}
x, & x > 0 \
\alpha x, & x \leq 0
\end{cases}
$$For negative inputs, instead of 0, it keeps a small slope ($\alpha$, often 0.01). This helps avoid dead neurons.
ELU (Exponential Linear Unit):
$$
f(x) = \begin{cases}
x, & x > 0 \
\alpha (e^x - 1), & x \leq 0
\end{cases}
$$Negative part is exponential, making the curve smoother and mean closer to 0. That helps training stability.
GELU (Gaussian Error Linear Unit):
$$
f(x) = x \cdot \Phi(x)
$$where $\Phi(x)$ is the Gaussian CDF. It can be seen as input times a probability factor. It’s smoother than ReLU, and is widely used in Transformers now.
Swish:
$$
f(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}
$$It multiplies input by its Sigmoid. Smooth curve, often better than ReLU in practice.
GLU (Gated Linear Unit):
$$
\text{GLU}(a, b) = a \cdot \sigma(b)
$$The input splits into two parts, $a$ and $b$. $b$ acts as a gate through Sigmoid to control $a$. This lets the model control information flow.
SwiGLU (Swish-Gated Linear Unit):
$$
\text{SwiGLU}(a, b) = (a \cdot \sigma(b)) \cdot \text{Swish}(b)
$$It’s an improved version of GLU with Swish added. Very common in large models like GPT-4.
So in short: these functions make things smoother, more stable, or more flexible. Today, GELU and SwiGLU are the most popular in big models.
Add & Norm
Q20. What is a residual connection in neural networks? What are its main advantages?
Residual connection is basically a “skip connection”, first introduced in ResNet. The formula is:
$$
y = F(x) + x
$$
where $F(x)$ is the transformation by some layers, and $x$ is the original input. So the network doesn’t just learn $F(x)$, it learns $F(x)+x$.
Advantages are:
- Reduce vanishing gradients: gradients can flow directly through the skip path.
- Enable very deep networks: without residuals, deeper nets often perform worse. With residuals, we can train 50, 100, or even more layers.
- Preserve information: the input is passed forward directly, so the model learns the “residual difference” instead of starting from scratch.
- Faster convergence: learning the residual is easier, so training becomes more stable.
In short, residual connections are like a fast lane inside deep networks, making training easier and more effective.
Q21. Explain why residual connections help alleviate the vanishing gradient problem and why they speed up convergence.
1. Why does it reduce vanishing gradients?
In a deep network, gradients usually shrink as they go backward through many layers. In a residual block:
$$
y = F(x) + x
$$
The derivative is:
$$
\frac{\partial y}{\partial x} = \frac{\partial F(x)}{\partial x} + 1
$$
That “+1” is the key. Even if $\frac{\partial F(x)}{\partial x}$ is very small, the gradient can still flow through the skip path. So gradients do not vanish completely.
2. Why does it speed up convergence?
Normally, the network must learn a complex mapping $H(x)$. But with residuals, it only needs to learn:
$$
F(x) = H(x) - x
$$
This means it learns the “residual difference,” which is usually smaller and easier to optimize. As a result, training is faster and more stable.
Q22. What is normalization? What are the common normalization techniques used in NLP models?
Normalization in deep learning means adjusting the input distribution to be more stable. In simple words, it keeps the values balanced so training is faster, gradients are more stable, and we avoid vanishing or exploding.
The key idea is to solve internal covariate shift — when updates in one layer change the input distribution of the next. Normalization acts like a “calibrator.”
In NLP, the common normalization methods are:
Batch Normalization (BN)
Normalizes across the batch dimension, same channel over a batch.
Used in early CNN/RNN models, but not good when batch size is small.Layer Normalization (LN)
Normalizes across hidden dimensions for each sample independently.
This is the most common in NLP, especially in Transformers, since it doesn’t depend on batch size.Instance Normalization (IN)
Normalizes per sample over spatial dimensions.
More common in vision; in NLP, rarely used, sometimes for style transfer tasks.Group Normalization (GN)
Splits channels into groups and normalizes inside each group.
Less common in NLP, but sometimes used in special architectures.
Summary: In NLP, LayerNorm is the main one, almost always used in Transformers. BatchNorm was used earlier but is now mostly replaced.
Transformer
Q23. Introduce the full architecture of a Transformer, including tokenization, embeddings, positional encoding, attention, feed-forward networks, residual connections, and normalization.
Transformer is an encoder–decoder architecture. Encoder understands the source, Decoder generates the target, and attention ties them together.
- Encoder
- Tokens go through tokenization, embeddings, and positional encodings, then enter stacked encoder blocks.
- Each block has MHSA → FFN, each sublayer wrapped with Residual + LayerNorm (often Pre-LN).
- The encoder outputs contextual vectors (memory) for all source tokens.
- Decoder
Each decoder block has two attentions:
Masked self-attention on target side.
- Q/K/V all come from the decoder’s current hidden states.
- A causal mask prevents looking at future positions.
Cross-attention to the encoder.
- Q comes from the decoder, K and V come from the encoder outputs (memory).
- Apply a source padding mask on K/V to ignore
<PAD>from the source.
Then an FFN, each sublayer with Residual + LayerNorm.
On top: linear + softmax to predict the next token.
In plain words: the decoder first “looks at what it has written” via self-attention, then “looks up the source notes” via cross-attention. Self-attn Q/K/V are from the decoder; cross-attn takes Q from the decoder and K/V from the encoder.
- Masking
- A mask is usually a matrix with 1/0 entries: 1 = allowed to compute, 0 = masked out. In practice we add
-infto the masked logits before softmax. - Padding mask is used in encoder self-attn and in decoder cross-attn to ignore
<PAD>tokens. - Sequence/Causal mask is used only in decoder self-attn to forbid looking ahead.
- Encoder–Decoder relation and differences
- Relation: the decoder’s cross-attention uses encoder outputs as K/V to align source information while generating.
- Differences: the encoder has only self-attention with full visibility; the decoder has masked self-attention plus cross-attention to the encoder.
Components (input → inside a block)
A) Tokenization → Embedding → Positional Encoding
- Tokenization: The raw text is split into tokens: words, subwords, or characters. This makes variable-length text easier to process. typically subword (BPE/WordPiece).
- Token Embedding: Each token is mapped into a vector. The embedding captures its meaning. The dimension usually matches the model’s hidden size.
- Positional encoding: Self-attention itself has no sense of order. So we add positional encodings — either fixed sinusoidal ones or learned vectors — so the model knows token positions.
B) Attention (single head)
$$
\text{Attention}(Q,K,V)=\text{softmax}!\left(\frac{QK^\top}{\sqrt{d_k}}+\text{mask}\right)V
$$
- Multi-head Each head computes attention independently to capture different patterns, then the head outputs are concatenated along the last dimension and passed through a final linear layer to fuse information.
- Decoder self-attn: In the decoder’s self-attention, $Q$, $K$, and $V$ all come from the decoder’s current hidden states (the target sequence side). To keep autoregressive behavior, a lower-triangular causal mask is applied so position $i$ can only attend to itself and earlier positions, never to future tokens.
- Cross-attn: In cross-attention, the queries $Q$ come from the decoder (the output of the previous sublayer), while the keys $K$ and values $V$ come from the encoder outputs (the memory). A source padding mask is applied on $K/V$ so the model focuses only on real tokens and ignores
<PAD>positions, preventing invalid information from affecting alignment.
C) FFN
$$
\text{FFN}(x)=W_2,\sigma(W_1x+b_1)+b_2
$$
- After attention, each token passes through a small two-layer fully connected network with activation (like GELU). This adds non-linearity and more capacity.
D) Residual & LayerNorm
- Residual $y=x+\text{sublayer}(x)$, Both the attention and the FFN outputs are wrapped with skip connections. This helps information and gradients flow smoothly, reducing vanishing gradients.
- LayerNorm After each sub-layer (Attention or FFN), LayerNorm is applied to stabilize training and speed up convergence. In NLP, LayerNorm is the default choice.
Q24. What is the masking mechanism in Transformers? What purpose does it serve? Introduce Padding Mask and Sequence Mask.
Masking in Transformers is basically a way to “hide” some positions in the sequence. It decides what the attention can see and what it must ignore. This makes training efficient and also keeps the logic correct.
In practice, a mask is usually a matrix with elements of 1 or 0, where 1 means the position can be used in computation, and 0 means the position is masked and cannot be used.
There are two common types:
Padding Mask
In NLP, input sentences have different lengths. We pad shorter ones with<PAD>tokens.- The problem is
<PAD>is not real content. If the model attends to it, results will be wrong. - Padding mask hides those padded positions so attention only looks at real words.
- The problem is
Sequence Mask
This is used in the Decoder. When generating text, each step should only see the past, not the future.- Sequence mask makes sure position $i$ can only attend to positions ≤ $i$.
- For example, in translation, when predicting the next word, the model cannot peek at future words.
Summary:
The purpose of masking is to block invalid or illegal information paths. The main role of the padding mask is to handle different sequence lengths by masking <PAD> tokens, and the main role of the sequence mask is to ensure the decoder can only look at past tokens and not cheat by looking at future ones.
Q25. What are decoding strategies in Transformers? Introduce four common decoding strategies.
At a high level, a decoding strategy is the way the model turns the per-step probability distribution over next tokens into an actual sequence, and different strategies make different trade-offs between determinism and diversity as well as quality and speed.
In greedy search, the model selects the highest-probability token at every step, which makes generation fast and highly reproducible. At the same time, this approach can be short-sighted and prone to repetition, so practical setups often add a minimum length and a repetition penalty to reduce these issues.
In beam search, the process keeps the top-B partial hypotheses, expands each of them at the next step, and then keeps only the best B again, which usually yields more globally coherent text. At the same time, the method is more expensive and tends to favor short outputs, so a length penalty is commonly applied to balance the behavior.
In top-k sampling, the distribution is truncated to the k most likely tokens and a random draw is made within that set, which introduces controlled diversity. At the same time, a very small k can be too conservative and a very large k can be unstable, so the choice of k should reflect the task, often together with temperature and a repetition penalty.
In top-p (nucleus) sampling, the draw is taken from the smallest set whose cumulative probability reaches p, so the candidate set adapts to uncertainty and balances plausibility and variety. At the same time, many general-purpose tasks work well with p around 0.9, although the best value still depends on the goal and benefits from light tuning.
From a practical choice standpoint, if a task prioritizes accuracy and reproducibility, greedy search or small-beam search with a length penalty usually fits well; if a task prioritizes naturalness and creativity, top-p or top-k with a suitable temperature and a repetition penalty often produces better results; if a task resembles machine translation where precision matters, beam search with a length penalty tends to be a reliable default.
Q26. What are the key differences between Transformers and Convolutional Neural Networks (CNNs)?
Point 1: The way they see context is different.
One-sentence summary: A Transformer can look globally from the start, while a CNN looks locally first and only grows its view by stacking layers.
Details: Self-attention lets any two tokens interact directly, so long-range links appear within a single layer; convolution covers only a neighborhood, so the model needs more depth or larger/dilated kernels to reach far context. This makes Transformers handy when long-range dependencies matter.
Point 2: The built-in bias and order handling are different.
One-sentence summary: A Transformer has weak inductive bias and needs positional encodings for order, while a CNN has strong locality and translation equivariance by design.
Details: Because Transformers do not assume locality, they add sinusoidal or learned (or relative/rotary) positions to inject order; CNNs use small shared kernels that favor local textures, which is data-efficient and stable when data are limited.
Point 3: The compute and memory growth with length are different.
One-sentence summary: Yes, as the sequence gets longer, standard Transformer attention grows roughly quadratically, which is usually faster-growing than a CNN with fixed kernel size.
Details: Self-attention compares “every position with every position,” so cost and memory scale about $O(n^2)$; convolution with fixed kernels is closer to $O(n)$, so doubling input length raises cost roughly linearly. Very large receptive fields still push CNN depth or special kernels, but the scaling curve is usually gentler.
A simple takeaway: When a task needs strong long-range context and quick global linking, a Transformer is usually the better fit; when a task relies on local textures, has limited data, or must run under tight budgets, a CNN is often the safer and more efficient starting point.
Pre-training
Q27. What does model pre-training mean? Introduce common pre-training methods.
So, “pre-training” means we first give the model a basic foundation before using it for real tasks. It’s like a kid learning letters before writing essays. In this stage, the model trains on a huge amount of data to learn general skills. Later, when we use it for a specific task, like translation or sentiment analysis, we only need to fine-tune with small data.
Some common ways are:
- Language Modeling
Problem: It helps the model learn the flow of language, like how words follow each other.
Steps:
- Collect large text data, like Wikipedia or news.
- Take a sentence and let the model predict the next word. For example: “I want to go ___”.
- Keep repeating this, so the model learns grammar and word order.
- Masked Language Model / Autoencoder
Problem: It makes the model better at using context, not just the previous word.
Steps:
- Hide some words in a sentence, like “I go to the [MASK] to eat”.
- Ask the model to guess the missing word, like “restaurant”.
- By doing this, the model learns to use both left and right context.
- BERT uses this method, so it is very good at understanding tasks.
- Contrastive Learning
Problem: It teaches the model to know if two things match or not, like an image and its text.
Steps:
- Prepare matching pairs, like “a dog photo + the text ‘a dog’”.
- Also prepare wrong pairs, like “dog photo + the text ‘a car’”.
- Train the model to pull close the correct pairs and push away the wrong ones.
- CLIP is a good example, useful for search or matching tasks.
- Generative Pre-training
Problem: Not only understand, but also generate new text, like writing or chatting.
Steps:
- Give the model a start, like “A little cat is…”.
- The model predicts the next word, then the next, until it makes a full text.
- Training is basically repeating “see → predict → generate” many times.
- GPT models use this method, so they are strong in writing and conversation.
So in short, pre-training is like “building the foundation,” and different methods give the model different skills to be strong in many tasks.
Q28. What are common data preprocessing methods used in NLP?
What is data preprocessing?
In simple words, data preprocessing means preparing and cleaning the data before giving it to the model. Raw data is usually messy: it may have missing values, wrong formats, very different scales, or even noise. If we feed this directly to the model, the learning will be poor. The goal of preprocessing is to make the data clean, consistent, and suitable for training.
- Missing Value Handling
- Problem: In real datasets, some values are often missing. For example, some people may skip questions in a survey, or a sensor may fail to record data at some time. If we ignore this problem, the model may not train properly.
- Solution: One way is to delete the samples with missing values. Another way is to fill the missing values with statistics such as the mean, median, or mode. A more advanced way is to use another model to predict the missing values, so we can keep the data more complete.
- Data Cleaning
- Problem: Raw data may contain errors, duplicates, or extreme outliers. For instance, in the “age” column, there might be an entry like “300 years old,” which is obviously wrong.
- Solution: Data cleaning often includes removing duplicates, fixing wrong formats, and detecting and removing outliers. This makes the dataset more realistic and reliable.
- Normalization / Standardization
- Problem: Different features often have very different ranges. For example, “height” may be between 100 and 200, while “income” may be in thousands or millions. If we do not handle this, the model may be dominated by large-scale features.
- Solution: Normalization scales all values into a fixed range, usually between 0 and 1. Standardization subtracts the mean and divides by the standard deviation, so the values have mean 0 and variance 1. This way, features are comparable on the same scale.
- Feature Encoding
- Problem: Models can only work with numbers, not directly with text or category labels. For example, categories like “red/blue/green” cannot be used directly.
- Solution: A common method is one-hot encoding, which converts each category into a binary vector. In more advanced settings, we can use embeddings, which map categories into continuous vector spaces and keep more semantic information.
- Text Preprocessing & Tokenization
- Problem: Text cannot be directly used as model input; it needs to be split into smaller units. For example, the sentence “I like learning” cannot be processed as a raw string.
- Solution: A common way is tokenization, splitting the text into words or subwords. We can also remove stop words like “the” or “and,” or apply stemming to reduce words like “running” to “run.” Modern NLP often uses subword methods like BPE (Byte Pair Encoding), which handle rare words more effectively.
- Data Augmentation
- Problem: Many tasks have small labeled datasets, and the model may not learn enough patterns or may overfit.
- Solution: Data augmentation creates new “synthetic” data to enlarge the dataset. For images, we can rotate, crop, or add noise. For text, we can replace words with synonyms or use back-translation. This helps the model learn from more diverse examples.
Q29. What is data augmentation? What are common data augmentation techniques in NLP?
What is data augmentation?
Data augmentation means creating more training data by making controlled changes to existing data. It is used when the dataset is too small or not diverse enough. The main idea is not to create totally new data, but to add variations so that the model becomes more robust and generalizes better.
- Image Data Augmentation
Rotation & Flipping
- Problem: If the model only sees images in one fixed orientation, it may fail when the object appears rotated or flipped.
- Solution: We can randomly rotate images or flip them horizontally and vertically, so the model learns to recognize objects in different angles.
Cropping & Scaling
- Problem: In real life, objects may appear in different sizes or only in part of the image.
- Solution: By randomly cropping parts of the image or scaling them, the model learns to detect objects in different sizes and positions.
Adding Noise
- Problem: Real images often have noise from low light or camera sensors.
- Solution: We can add Gaussian noise or blur to training images, so the model becomes more resistant to noisy environments.
Color Jittering
- Problem: Images from different devices may have changes in brightness, contrast, or color.
- Solution: By changing brightness, saturation, and contrast, the model learns to adapt to different lighting conditions.
- Text Data Augmentation
Synonym Replacement
- Problem: Text datasets are often small, and the model may learn only limited ways of expression.
- Solution: We can replace some words with synonyms. For example, “I am happy” can be changed to “I am glad,” so the model sees more variations.
Back Translation
- Problem: In natural language tasks, limited sentences may make the model memorize fixed patterns.
- Solution: We can translate a sentence into another language and then back again. For example, Chinese → English → Chinese. The meaning stays the same, but the form may change.
Random Insertion or Deletion
- Problem: The model may rely too much on certain words instead of the whole meaning.
- Solution: We can randomly insert synonyms or delete non-key words, so the model learns to understand the full context.
- Numerical Data Augmentation
Noise Injection
- Problem: If numeric features are too “clean,” the model may memorize training data and fail on real-world data.
- Solution: We can add small random noise to numeric values, like slightly changing temperature values, so the model becomes more robust.
Resampling
- Problem: Datasets often have imbalanced classes, such as very few fraud transactions.
- Solution: We can oversample the minority class or undersample the majority class, so the training data becomes more balanced.
Post-training
Q30. What are the differences between pre-training, post-training, and fine-tuning of models?
You can think of these three steps like a person growing from school to a job:
Pre-training
- What it is: Pre-training is like basic education. The model learns general language rules from huge datasets such as Wikipedia, news, and books.
- Role: It gives the model common knowledge and language ability, but it does not yet know how to do specific tasks like sentiment analysis.
Post-training
- What it is: Post-training comes after pre-training, to make the model closer to human needs.
- Common methods: For example, instruction tuning teaches the model to follow instructions. RLHF (reinforcement learning from human feedback) makes the model’s answers more aligned with human preferences.
- Role: It changes the model from “can talk” to “talks in a useful and human-like way.”
Fine-tuning
- What it is: Fine-tuning is like job training. We continue training the model on small, task-specific datasets.
- Example: If we want a model for e-commerce customer service, we fine-tune it on customer chat logs.
- Role: It gives the model domain knowledge and task-specific skills.
In one sentence: Pre-training builds general skills, post-training aligns with human needs, and fine-tuning adapts the model to specific tasks.
Q31. What is prompt tuning? What is instruction tuning? How are they different?
1. Prompt Tuning
- What it is: Prompt tuning means designing good prompts so the model uses its pre-trained knowledge to finish a task. We do not change the model’s parameters; we just change how we ask.
- Example: For sentiment analysis, we can ask: “Review: This restaurant is great. Question: What is the sentiment? Answer:”. The model will likely answer “positive.”
- Key point: It does not update the model, it just uses clever inputs.
2. Instruction Tuning
- What it is: Instruction tuning means training the model with many “instruction + answer” examples, so it learns to follow human instructions.
- Example: Data like: “Instruction: Translate this sentence to English. Input: 我喜欢学习. Output: I like learning.” After thousands of examples, the model learns to follow such tasks.
- Key point: It updates the model parameters and is part of post-training.
3. Fine-tuning
- What it is: Fine-tuning means continuing training on a specific dataset to adapt the model to one task or domain.
- Example: Training a general language model on medical dialogues to get a “medical QA model.”
- Key point: It makes the model strong in one area, not in general instruction following.
Summary of differences:
- Prompt tuning: no parameter update, just smart prompt design.
- Instruction tuning: parameter update, teaches the model to follow natural instructions.
- Fine-tuning: parameter update, adapts the model to a single task or domain.
One-sentence analogy: Prompt tuning is “asking questions smartly,” instruction tuning is “teaching the student how to understand questions,” and fine-tuning is “training the student for one subject.”
Q32. What is the core idea of parameter-efficient fine-tuning (PEFT), such as LoRA? Compare full fine-tuning with PEFT methods like LoRA and Adapters in terms of advantages and disadvantages.
1. Core idea of PEFT
The key idea of Parameter-Efficient Fine-Tuning (PEFT) is: do not update all the model parameters, but only add small trainable modules or update a very small part of the model.
- Full fine-tuning is like rewriting the whole book.
- PEFT is like adding notes or sticky papers to some pages.
Common methods:
- LoRA (Low-Rank Adaptation): add small low-rank matrices next to the main weight matrix, and only train those. Advantage: Very few parameters (often <1%), with performance close to full fine-tuning. Widely used in large models.
- Adapters: insert small adapter layers inside each layer, and only update these layers. Advantage: Works like a “plugin.” You can keep multiple adapters for different tasks and switch easily.
- Prompt-tuning: At first, prompt-tuning meant just designing prompts by hand. It did not change any model parameters. Later: In research, people created trainable prompt-tuning. Here, we add some virtual tokens at the input side. Their embeddings are trainable, so we update only these embeddings. In practice, these tokens are treated as extra embeddings fed into the model, while the main model parameters stay frozen.
- Prefix-tuning: Prefix-tuning is similar to prompt tuning, but instead of adding tokens only at the input side, it adds trainable prefix vectors to every Transformer layer’s attention. This way, the model receives task-specific hints at each layer, which is often more powerful. In practice, these prefix embeddings are inserted into the key and value of self-attention, so the model can use them across all layers.
The main goal is: adapt big models to new tasks with fewer parameters and lower cost.
2. Pros and Cons: Full Fine-tuning vs PEFT
Full Fine-tuning
Pros:
- Very flexible; the model can fully adapt to the target task.
- Best performance for tasks very different from the pre-training data.
Cons:
- Very expensive; you need to update and store the whole model (hundreds of GB).
- High cost in memory, time, and compute.
- Hard to switch between tasks because each one needs a full copy of the model.
PEFT (like LoRA, Adapters)
Pros:
- Only train a very small number of parameters (often <1%), so cost is much lower.
- Easy to switch tasks: just load different small modules.
- More efficient for training and deployment, very practical for real use.
Cons:
- Less flexible; the performance depends heavily on the original pre-trained model.
- For very different tasks (like language → code, or cross-modal tasks), results may be weaker than full fine-tuning.
One-sentence summary:
Full fine-tuning is like “rebuilding the whole house,” expensive but complete; PEFT is like “renovating with furniture,” cheaper and faster but with limits.
Q33. Can you give an example of when fine-tuning is needed and when prompt engineering alone is sufficient?
When to use fine-tuning?
For example, if you want a medical QA model, the medical domain is very different from general language. A simple prompt is not enough, and the answers may be wrong. So in this case, you fine-tune the model on medical conversations or papers, so it really learns domain knowledge.
When to just use a prompt?
For example, if you want sentiment analysis, the model already has strong semantic understanding from pre-training. You can just design a prompt like: “Review: This restaurant is great. Please tell me the sentiment:”. The model can answer “positive” without extra training.
In one sentence: Fine-tuning is better for domain-specific or very different tasks, while prompting is enough for general, simple tasks the model already knows.
Q34. What are common post-training methods, such as Reinforcement Learning from Human Feedback (RLHF)?
In the post-training stage, the three most common methods are: Instruction Tuning, RLHF (Reinforcement Learning from Human Feedback), and DPO (Direct Preference Optimization).
- Instruction Tuning
What it is: The goal is to make the model follow human instructions. Pre-training makes the model good at writing, but it may only “continue text” and not treat a request as a task.
How it works: We collect many “instruction + input + output” examples and train the model with supervised learning. For example:
- Instruction: Summarize the following review in one sentence.
- Input: This phone has a great camera but the battery life is too short.
- Output: The phone takes good pictures but has poor battery life.
After such training, the model learns to follow tasks instead of just generating text.
Effect: The model can now not only generate text, but also complete tasks like summarization, coding, and Q&A.
- RLHF (Reinforcement Learning from Human Feedback)
What it is: RLHF makes the model’s answers more aligned with human preferences. Without it, the model might give strange, unsafe, or unhelpful answers.
How it works:
- Supervised Fine-tuning (SFT): Humans write ideal answers, and the model is trained on them.
- Reward Model: Humans rank several model answers; a reward model learns to score which one is better.
- Reinforcement Learning: The main model is optimized using the reward model as feedback. A common algorithm here is PPO (Proximal Policy Optimization).
What is reinforcement learning?: It is “learning by trial and error.” The model tries an action, gets a reward or penalty, and updates to do better next time.
What is PPO?: PPO is a reinforcement learning algorithm that updates the model step by step but avoids changing it too much at once. This makes learning stable while still improving.
Effect: After RLHF, the model gives safer, more natural, and more helpful answers.
- DPO (Direct Preference Optimization)
- What it is: DPO is a simpler alternative to RLHF. The goal is still human alignment, but it avoids reinforcement learning.
- How it works: We collect preference pairs, for example, given one question and two answers A and B, humans choose A. DPO trains the model directly to prefer A without training a reward model.
- Effect: Compared to RLHF, DPO is simpler, cheaper, and faster, while still giving similar results.
One-sentence summary:
- Instruction tuning → teaches the model to follow tasks.
- RLHF → aligns outputs with human preferences using reinforcement learning and PPO.
- DPO → a simpler way to reach alignment, almost as good but with lower cost.
PEFT & RL
Q35. Introduce common Parameter-Efficient Fine-Tuning (PEFT) methods.
Parameter-Efficient Fine-Tuning, or PEFT, is built on a simple idea: we don’t need to retrain all the parameters of a big model every time we face a new task. That would be too costly. Instead, we update only a small number of parameters, often by adding tiny modules or vectors around the model. This makes training much cheaper and switching between tasks much easier.
One of the most common methods is LoRA. The intuition is that the weight matrices in big models are huge, but the needed changes are usually low-rank. So LoRA adds a small low-rank matrix next to the original weight matrix and only trains this small piece while freezing the big one. This reduces the number of trainable parameters to less than one percent, yet the performance is close to full fine-tuning. That’s why LoRA is so popular in today’s large model applications.
Another classic method is the Adapter. The idea here is to insert a small bottleneck layer into each Transformer block. This layer projects the dimension down and then back up. During training, only the adapter’s parameters are updated. The nice part is that it’s very modular: you can train different adapters for different tasks and plug them in like extensions.
Then we have Prefix-tuning. This method adds trainable prefix vectors to the self-attention in every layer. These prefixes act like extra context and influence the model throughout all layers, not just at the input. Because of this deeper influence, prefix-tuning can sometimes work better on more complex tasks.
Finally, there is Prompt-tuning, which is the lightest method. Here, we simply add some trainable virtual tokens at the beginning of the input sequence. Only these embeddings are trained, and the main model stays frozen. The advantage is that it uses very few parameters, so it’s good when resources are limited, though it is usually less powerful than LoRA or prefix-tuning.
So to sum up, the common PEFT methods are LoRA, Adapter, Prefix-tuning, and Prompt-tuning. They all share the same principle: don’t change the whole model, just tweak small parts, and you still get performance close to full fine-tuning with much lower cost.
Q36. What is an n-gram in NLP, and what are its common applications?
In NLP, an n-gram is simply a sequence of n consecutive units from text. The unit can be a word, a character, or even a subword, depending on how the text is tokenized. For example, if n=1, it’s a unigram (single word); n=2 is a bigram (two words together); n=3 is a trigram.
For example, if we use words, the sentence “I love NLP” gives unigrams like “I”, “love”, “NLP”; bigrams like “I love”, “love NLP”; and the trigram “I love NLP”. If we use characters, the word “chat” with trigrams becomes “cha” and “hat”. In modern NLP, subwords are very common: the word “unhappiness” may be split into “un”, “happi”, and “ness”, so its bigrams are “un happi” and “happi ness”.
The main idea of an n-gram model is that the probability of a word depends only on the previous n-1 units. Before deep learning, this was the standard way to build language models. For example, in a bigram model, the probability of “NLP” after “love” is estimated based on how often “love NLP” appears. It’s simple but works well for local context.
In terms of applications, n-gram models have been used widely. The most common is language modeling, predicting the next word. They are also used in text classification, where n-grams serve as features — for instance, the bigram “free money” is a strong spam signal. In early machine translation and speech recognition, n-gram models helped keep outputs fluent. And in information retrieval, n-grams improved matching, like treating “New York” as a single bigram instead of two separate words.
Of course, n-grams have limitations. Small n misses context, while large n leads to data sparsity and combinatorial explosion. That’s why modern models like LSTMs and Transformers replaced them. Still, n-grams remain a key concept and a useful starting point in NLP.
Q37. What is DreamBooth? Explain its purpose and basic structure.
DreamBooth is a personalization technique for text-to-image diffusion models (like Stable Diffusion). With only a few images (usually 3–10) of your subject, it teaches the model a new identifier token. Later, when you use that token in a prompt, the model can generate that exact subject in any scene or style—not just the generic class.
Architecturally, it uses the same base parts: the text encoder to turn prompts into embeddings, the U-Net denoiser, the VAE for mapping between pixels and latent space, and the scheduler for the noise steps. DreamBooth adds two important ideas: first, it creates a new rare token with a trainable embedding; second, it combines instance loss and class prior preservation loss during training, so the model remembers your subject but still keeps general knowledge of the class.
Now, let me explain how it works in practice. You prepare a few instance images of your subject and pair them with prompts like “a photo of sks dog,” where sks is the new token and “dog” is the class word. You also collect or generate class prior images with prompts like “a photo of a dog.” During training, the model sees both sets. The loss is the standard diffusion MSE, split into $L_\text{instance}$ and $L_\text{prior}$, then combined as $L = L_\text{instance} + \lambda L_\text{prior}$. The weight $\lambda$ balances how much the model focuses on the subject versus the general class.
In terms of parameters, you usually fine-tune the U-Net and the embedding of the new token. This way, the token gains meaning in text space, and the U-Net learns to map it to the right visual features. To avoid overfitting, people use data augmentation, small learning rates (like 1e-6 to 1e-5), only a few thousand steps, and a reasonable prior weight. After that, you can use the new token in any prompt—say, “a watercolor painting of sks dog wearing sunglasses, in Tokyo at night”—and the model puts your subject into that scene.
There are also lighter versions. Some train only the token embedding and a few U-Net layers; others combine DreamBooth with LoRA/Adapters so the main model stays frozen and only tiny extra modules are trained. This saves memory and keeps training efficient. Choosing a good class word is also important, like “dog,” “vase,” or “car,” so the prior images cover enough variety.
In short, DreamBooth is not about retraining the whole model. It’s about adding a new token, using prior preservation, and doing light fine-tuning to reliably bring a specific subject into the model with just a handful of images.
Q38. Explain agentic reinforcement learning methods, including PPO, DPO, and GRPO.
Let’s set the ground first. Reinforcement learning (RL) is about an agent interacting with an environment, taking actions, getting feedback, and improving a policy to maximize long-term return.
Agentic means we treat the model as an active agent. It has goals, plans multi-step actions, interacts with tools and the environment, and updates itself from feedback.
Now the three optimization styles and how they differ in practice.
PPO (Proximal Policy Optimization) is policy gradient with guardrails. Standard policy gradients can push updates too far and break a good policy. PPO measures the change with a probability ratio $r=\pi_{\text{new}}/\pi_{\text{old}}$ and clips it if the step is too large, which keeps training stable. In RLHF, you first train a reward model from human ratings, then use PPO to move the policy toward higher-reward answers, usually with a KL penalty to stay close to the base model. It’s battle-tested and controllable, but it needs a full pipeline (SFT → reward model → rollouts → PPO) and solid compute.
DPO (Direct Preference Optimization) is more direct. It skips the reward model and RL rollouts and trains from preference pairs: for the same prompt, “chosen” vs “rejected.” The objective effectively increases the log-likelihood of the chosen answer relative to the rejected one, with a temperature/regularizer to avoid drifting too far from the base model. Think of it as turning human pairwise choices into a supervised relative-likelihood objective. It’s simple, efficient, and stable, but explores less and relies more on good preference coverage and a decent base model.
GRPO (naming varies; here it means a preference-driven update with PPO-style clipping) sits between PPO and DPO. Like DPO, it uses preferences to say which answer is better; like PPO, it converts that signal into an advantage and applies clipped updates to keep steps small and safe. The goal is to get DPO’s directness plus PPO’s stability, which is helpful for long-sequence generative LMs. In practice, you’ll also see per-token or per-segment advantage assignment, a KL term to the base model, and batch baselines to reduce variance and avoid mode collapse.
Putting it together: choose PPO (RLHF) when you want industrial-grade stability and can afford the reward-model pipeline; choose DPO when you want a lean, fast alignment method that learns directly from preferences; consider GRPO when you want preference signals but also PPO-like stability for generative training. And remember, agentic is the system view—these optimizers are how we align a model so it can act usefully, safely, and in line with human preferences during multi-step interaction.
Q39. Explain what agent MCP is.
Let’s start with the background. A large model by itself is just a “language engine.” It can read and write text, but to become a real agent, it needs to interact with the outside world—tools, data, services. Without that, it’s limited to what it already knows.
MCP (Model Context Protocol) is a proposed way to make this interaction easier. You can think of it as a standard protocol that connects the model to external resources: APIs, databases, file systems, or even other apps. Instead of writing custom integration for every tool, MCP gives a unified interface.
For example, if you ask an MCP-enabled agent, “Get me the latest financial report and plot a chart,” the model doesn’t know the data itself. But through MCP, it can call a financial data source and a plotting tool. MCP structures the request, sends it to the right service, and feeds the results back into the model’s context.
So, Agent MCP means an agent that uses the Model Context Protocol as its bridge to the external world. It makes agents more composable and extensible, turning them from simple chatbots into systems that can plan, fetch data, and take real actions.
Q40. Explain in detail the concept of RLHF (Reinforcement Learning from Human Feedback) and its common applications.
What it is
RLHF means Reinforcement Learning from Human Feedback. It is a way to turn people’s preferences into a training signal for the model. After pretraining and supervised fine-tuning, a model can already give okay answers, but they are not always what people really want. RLHF brings human judgment into the loop. We first collect people’s choices between answers, then we train a reward model to copy those choices, and finally we use reinforcement learning to push the model toward answers that score higher with that reward.
Why we need it
The problem with supervised fine-tuning is that it makes the model complete tasks, but not necessarily in the way humans prefer. Sometimes the answers are too long, too vague, or unsafe when the question is sensitive. These things are hard to fix with just labels. RLHF solves this by letting the model know “what kind of answer people like more.” This makes the outputs not only correct, but also polite, safe, and natural. So it fills the gap that normal training cannot cover.
How it works
The process usually has three steps. First, supervised fine-tuning: train a base model on instruction data so it can follow tasks. Second, collect human preferences: for the same question, show a few answers and ask people to pick which one is better. Use those choices to train a reward model. Third, optimize the main model with reinforcement learning, most often PPO, while adding a KL penalty so the model does not drift too far from its base style. By repeating this loop, the model learns to give answers that are both useful and more aligned with human taste.
Common applications
In practice, RLHF is most used in three areas. The first is aligning dialogue style, so the model sounds polite, natural, and can say no to harmful requests. The second is safety, where feedback reduces harmful content and lowers the chance of the model making things up. The third is task quality, like making summaries clearer, translations more faithful, or code answers more reliable. These three uses are almost everywhere large language models are deployed.
Pros and cons
The big advantage of RLHF is that it makes human satisfaction something the model can actually optimize for, so the outputs feel more human-friendly. But there are also issues. The reward model can be biased if the feedback data is small or inconsistent, and if you optimize too much, the model may “game the reward” and give weird answers. On top of that, collecting high-quality human feedback is expensive and time-consuming. So RLHF is very effective right now, but it is not the final solution. We still need new methods to improve on it.
Classic NLP Models
Q41. Introduce the BERT model and its major variants.
The structure of BERT
BERT stands for Bidirectional Encoder Representations from Transformers. It was introduced by Google in 2018, and it became a milestone in NLP. Structurally, it is a stack of Transformer encoders. The training has two stages: first pretraining on large text without labels, then finetuning on specific tasks. For inputs, BERT uses three embeddings: token, segment, and position. It also adds two special tokens: [CLS] at the start for classification, and [SEP] at the end for separation. The sequence is padded or cut to a fixed length. Then the embeddings go through multiple encoder blocks with multi-head attention, feed-forward layers, residual connections, and normalization. This design lets BERT capture rich context information.
BERT vs GPT
The key difference is their goal and design. BERT uses encoders and masked language modeling. It learns by predicting missing words, which makes it very good at understanding meaning, but it cannot generate natural text. GPT uses decoders and autoregressive language modeling, predicting the next word step by step, so it is great for generation. In short: BERT is for understanding, GPT is for generating.
Main applications of BERT
BERT is most common in three types of tasks. First is text classification, like sentiment analysis or spam detection, where the [CLS] output is used. Second is sequence labeling, such as named entity recognition or part-of-speech tagging, where each token’s output is predicted. Third is sentence pair tasks, for example natural language inference (NLI), which checks if one sentence follows from another, or semantic similarity, which checks if two sentences mean the same thing. Here, BERT uses segment embeddings to separate sentences and [CLS] to capture the relation.
Variants of BERT
Two famous variants are RoBERTa and DeBERTa. RoBERTa means Robustly Optimized BERT Pretraining Approach. It improves training by removing the next sentence prediction, using more data, and training longer, which gives stronger results. DeBERTa means Decoding-enhanced BERT with Disentangled Attention. It changes the attention so that content and position information are separated, which helps the model understand both “what the word is” and “where it is” more clearly, leading to better results.
Pros and cons
The main strength of BERT is that it captures bidirectional context and pushes NLP understanding tasks to a new level. But it also has limits. It cannot generate text, so it is weak in generation tasks. It is expensive to train, needing large compute. And some pretraining tasks, like next sentence prediction, turned out to be less useful. So overall, BERT is very strong in understanding tasks, but less good in generation and efficiency.
Q42. Introduce the GPT family of models.
The history of GPT models
GPT means Generative Pre-trained Transformer. The first version, GPT-1, was released by OpenAI in 2018. It had about 110 million parameters, with a pure Transformer decoder structure. The training was autoregressive, predicting the next word step by step. It was small but proved the idea of “pretrain first, then finetune.” GPT-1 was open source, so the research community could fully use it.
In 2019, GPT-2 came with 1.5 billion parameters. It showed big improvements in generating long, coherent text and could generalize to many tasks. It still used the decoder-only structure, but the training scale was much bigger. At first OpenAI did not fully release GPT-2 because of misuse concerns. Later they opened smaller versions and finally the full one.
In 2020, GPT-3 was launched with 175 billion parameters. It was still decoder-only, but this was the first time people saw “emergent abilities” like zero-shot and few-shot learning. You could give just a few examples and the model could do tasks without finetuning. From GPT-3 onward, OpenAI did not open source the model. Instead, access was only through the API.
GPT-4 came in 2023. It was not only larger, but also much better at reasoning, multimodal input (like images), and safer alignment. This version was fully closed and only available via API or ChatGPT.
Later in the same year, GPT-4o (Omni) was released. It was a native multimodal model, handling text, images, and speech with low latency, so it worked for real-time conversations. This one was also closed source, only through ChatGPT or API.
The latest GPT-5 arrived in 2025. It improved reasoning, long context, and tool use, moving closer to general intelligence. Like GPT-4, it is closed and only accessible via API or products.
Common applications of GPT
GPT is widely used. First, as a chat assistant like ChatGPT for Q&A, conversations, and search. Second, for content generation, such as writing articles, code, emails, and creative text. Third, for task help, like summarizing documents, translation, writing SQL, debugging code, and brainstorming. With GPT-4o’s multimodal ability, it can also handle speech, images, and even video, so the applications keep expanding.
Pros and cons
The strengths are clear: GPT is general, one model can do many tasks. It has strong context understanding, can generate fluent text, and now supports multimodal interaction. It also makes advanced AI easy for normal users. But there are clear downsides. Training and running costs are very high. The model can hallucinate, giving answers that sound right but are wrong. It is not fully controllable, so alignment is always needed for sensitive topics. And one more limitation: since GPT-3, the models have been closed source and only available through APIs, so the research community cannot directly study them anymore.
Q43. Introduce the LLaMA family of models.
The history of LLaMA models
LLaMA stands for Large Language Model Meta AI. It was introduced by Meta in 2023 as an open-source large model family. Structurally, it is based on a decoder-only Transformer architecture, similar to GPT. That means it generates text autoregressively, one word at a time, which makes it strong for text generation and flexible for fine-tuning on understanding tasks.
The first version, LLaMA-1, was released in February 2023, with sizes from 7B up to 65B parameters. It focused on efficiency and openness, reaching near GPT-3 level with fewer parameters, and it was shared with the research community. In July 2023, LLaMA-2 came with 7B, 13B, and 70B models. It improved training data, alignment, and safety, and was free for commercial use. Many open-source ChatGPT alternatives, like Vicuna and Alpaca, were built on it.
In 2024–2025, Meta started releasing LLaMA-3. These models aimed for larger scale, longer context, stronger multilingual ability, and support for multimodal tasks. Overall, LLaMA has become the most influential open-source alternative to GPT.
Common applications of LLaMA
Because it is open-source, LLaMA is widely used in research and industry. A first common use is as a base model for fine-tuned chatbots and instruction models, like Vicuna or Alpaca. A second use is in vertical domains such as healthcare, law, or coding, where teams fine-tune LLaMA with domain-specific data. A third use is in academic research, since it is open and downloadable, making it a strong baseline for NLP and multimodal experiments.
Pros and cons
LLaMA has two main strengths. First, it is open and free, so researchers and companies can download and modify it, fueling the open-source ecosystem. Second, it balances size and efficiency: the 7B and 13B models can run on consumer GPUs, lowering the entry barrier compared to GPT. But there are weaknesses too. Compared to huge closed models like GPT-4, LLaMA is weaker in reasoning, robustness, and multimodality. And while openness is a strength, it also brings misuse risks, since people can fine-tune it for harmful purposes. So overall, LLaMA is hugely important in the open-source space, but still behind closed-source giants.
Q44. Introduce the Qwen family of models.
The history of Qwen models
Qwen, short for Tongyi Qianwen, is a model family from Alibaba (later Tongyi Lab). It was first released in 2023. Structurally, it is a decoder-only Transformer, like GPT or LLaMA, generating text autoregressively. The first Qwen-1.0 series had sizes from 1.8B to 72B parameters, and it was fully open-source. A key feature was strong support for both Chinese and English, especially optimized for Chinese tasks.
Later in 2023, Alibaba released Qwen-Chat and Qwen-Code, tuned for dialogue and coding. In 2024, the Qwen-1.5 series improved long-context ability (tens of thousands of tokens) and safety alignment. By late 2024, Qwen-2 was released with bigger and broader training data, stronger reasoning and math skills, and also multimodal versions that could handle images. From the start, Qwen followed an open-source path, which made it very popular in the Chinese community, contrasting with GPT’s closed model.
Common applications of Qwen
Qwen is used widely. First is as a chat assistant in Chinese scenarios, like customer service, Q&A, or enterprise assistants, where its Chinese fluency stands out. Second is content generation, including writing, marketing copy, translation, and coding, with Qwen-Code especially useful for programmers. Third is in industry-specific models, where companies fine-tune Qwen with domain data for education, healthcare, finance, and so on.
Pros and cons
Qwen has two big strengths. One is strong Chinese ability, since it was trained with lots of Chinese data, making it more natural in Chinese. The second is open-source access, which helps both research and business and boosts the local open-source ecosystem. But there are weaknesses too. For advanced reasoning, multimodality, and very complex tasks, it is still behind closed giants like GPT-4 or GPT-4o. And being open also means more misuse risks. Overall, Qwen is one of the most important Chinese model families and a key player in the open-source community.
Q45. Introduce the DeepSeek family of models.
The history and structure of DeepSeek
DeepSeek is a model family introduced by a Chinese team in 2024. Like GPT and LLaMA, it is built on a decoder-only Transformer architecture for autoregressive generation. But its special feature is efficiency. It uses a Mixture-of-Experts (MoE) design, which means only part of the parameters are active at each step. This keeps the capacity large but reduces training and inference cost. DeepSeek also made improvements in distributed training and inference efficiency, making better use of hardware.
The first release was DeepSeek LLM, and later versions expanded into chat and multimodal models. Its goal is clear: to achieve performance close to GPT-4 but with much lower cost.
Common applications of DeepSeek
DeepSeek is used in ways similar to GPT, Qwen, and LLaMA. One is as a chat assistant, for customer service, Q&A, or enterprise bots. Another is content generation, like writing articles, code, marketing text, or translations. A third is research and open-source projects, since its efficiency makes it attractive for teams with limited resources.
Pros and cons of DeepSeek
The pros are mainly two. First is cost-effectiveness: thanks to MoE and engineering tricks, it delivers strong performance with less compute. Second is openness, since some weights and details are shared with the community, which helps spread large model use. The cons are also clear. Compared to GPT-4, GPT-4o, or GPT-5, DeepSeek is weaker in reasoning, multimodality, and safety alignment. And MoE, while efficient, can add complexity in latency and distributed scheduling. Overall, DeepSeek is a very competitive open-source series in China, focusing on being a “high-efficiency alternative.”
Model Optimization
Q46. Introduce common methods for model optimization.
I think model optimization can be grouped into three main types: architecture optimization, training and inference efficiency, and compression and deployment. These cover the design, the training process, and the final real-world use.
The first type is architecture optimization.
This means improving the Transformer itself. For example, sparse attention only looks at some important positions instead of all tokens, which cuts down computation. Efficient attention methods use tricks like low-rank approximations to make long text processing faster. Another idea is MoE (Mixture of Experts), where the model has many expert subnetworks, but in each step only a few are active. This keeps the capacity large but saves compute and memory.
The second type is training and inference efficiency.
A common method is mixed precision training. Normally, numbers are stored as 32-bit floats (FP32). With mixed precision, we use 16-bit floats (FP16 or BF16), which means fewer bits per number, so less memory use and faster speed. For very large models, we use distributed training: data parallel (different GPUs see different data), model parallel (split the model across GPUs), or pipeline parallel (split training into stages like an assembly line). At inference, we use tricks like KV cache (key-value cache), which stores past attention results so we don’t recompute them, speeding up generation.
The third type is compression and deployment.
Here the goal is to make the model lighter. Pruning removes weights that don’t matter much. Quantization stores numbers with fewer bits, for example INT8 (8-bit integer) or even INT4 (4-bit integer), instead of full floats, to save memory. Distillation means training a smaller model to learn from a large one, so it’s faster but still strong. And for deployment, compilers like TensorRT or ONNX Runtime optimize the model to run faster on hardware.
To sum up, model optimization usually works in three ways: make the structure smarter, make training and inference more efficient, and make the model smaller for deployment. Together, these methods make large models both powerful and practical.
Q47. What is model compression? Introduce common model compression techniques.
Model compression means making a model smaller and faster, while keeping performance as good as possible. The three common methods are pruning, quantization, and distillation.
The first is pruning.
Pruning means removing the parts of the model that don’t matter much. In a neural network, some weights are very small and have little effect, so we set them to zero or cut them out. For example, in a convolution, we can drop less important channels. This makes the model sparser and cheaper to run. Think of it like cutting off useless branches of a tree.
The second is quantization.
Quantization means storing parameters with fewer bits. Normally, models use 32-bit floats (FP32). We can replace them with lower precision numbers. For example, INT8 means 8-bit integers, INT4 means 4-bit integers. These take less memory and make inference faster. The accuracy may drop slightly, but usually not too much. It’s like compressing a high-res photo into JPG—smaller size but still looks fine.
The third is distillation.
Distillation means training a small model to learn from a large one. The big model is the teacher, and the small one is the student. The student copies the teacher’s outputs and style. This way the small model keeps much of the teacher’s knowledge, but is lighter and faster. It’s like a student learning from a master.
To sum up, model compression is all about making models lighter without losing too much performance. The main ways are pruning (remove unneeded weights), quantization (use lower-precision numbers), and distillation (teach a smaller model with a bigger one). Together, these methods make big models practical for real use.
Q48. What is Retrieval-Augmented Generation (RAG)?
What is RAG
RAG means Retrieval-Augmented Generation. The idea is to combine external retrieval with a large language model. We need it because language models have fixed knowledge. They don’t know new facts and sometimes hallucinate. RAG solves this by searching an external knowledge base first, then using the results to generate more accurate answers.
The process of RAG
The workflow has three steps. First, the user asks a question, like “What was Tesla’s stock price yesterday?”
Second, the retrieval module searches an external database or search engine. This involves how we build the database and how we search:
- To build the database, we split documents (like webpages, PDFs, reports) into smaller chunks. Then we use an embedding model to turn each chunk into a vector, and store it in a vector database such as FAISS, Milvus, or Pinecone. Similar text ends up close in this space.
- To search, we also turn the user query into a vector, and do a vector similarity search to find the closest chunks. Sometimes we combine this with keyword search methods like BM25. This mix is called hybrid retrieval, which often works better.
Third, the retrieved documents are passed to the language model, which generates an answer using both its own ability and the retrieved knowledge.
Pros and cons
The pros: RAG gives more accurate and timely answers, reduces hallucination, and lets companies plug in their private knowledge. The cons: if retrieval is poor, the final answer may also be poor; and building and maintaining a vector database adds extra cost.
RAG vs Finetuning
RAG and finetuning are very different. RAG uses external retrieval, so the model parameters don’t change. Finetuning retrains the model itself, writing the new knowledge inside.
When to use which? If knowledge is changing often (like finance data or new laws), RAG is better—you just update the database. If knowledge is stable and long-term (like product manuals or style of reply), finetuning is better. In practice, they can be combined: finetuning for style/domain, RAG for real-time facts.
Example
For example, a bank wants a chatbot. If a user asks, “What is the current loan interest rate?”, a plain LLM may give an outdated answer. With RAG, the system first retrieves the latest rate from internal policy documents, then the model says: “According to the latest notice, the current loan interest rate is X.X%.” If the bank also wants the bot to always sound formal and polite, they can finetune the style and still use RAG for the latest info.
Q49. What is an AI agent?
What is an Agent
An Agent is an “intelligent actor” that can plan and take actions, not just answer questions. A plain language model is good at text, but limited in the real world. An Agent gives the model a way to plan steps, call tools or APIs, and use the results. We need Agents because models alone can’t search the web, run code, or send emails.
The process of an Agent
The workflow goes like this: Step one, the user gives a goal, like “Check New York’s weather for next week and email me the result.” Step two, the Agent breaks the task into steps: check weather → draft email → call email API. Step three, it calls external tools or APIs, like a search engine or a mail service. Finally, it combines the results and replies. Many Agents can also loop and self-correct if something goes wrong in the middle.
Pros and cons
The good part is that Agents extend model power. They turn a model from “just talking” into “actually doing.” They can plan steps automatically. The bad part is they are not perfect at planning complex tasks—they may go off track. And calling many tools adds latency and possible security risks.
Is an Agent just prompt engineering?
Some people ask: is an Agent just a model with a smart prompt? The answer is no. Prompt engineering means carefully designing one input so the model answers in a certain way. An Agent is more than that. It runs a decision loop: it makes a plan, generates the next prompt, calls a tool, reads the result, then decides the next step. This “sense–plan–act–feedback” cycle is more complex than prompt engineering. So yes, prompts are part of it, but an Agent is much more than prompts.
Example
For example, think of a travel assistant Agent. You say: “Plan me a 3-day trip to Paris with flights, hotels, and attractions.” A normal model may just give a generic plan. But an Agent will search flights, find hotels, check attraction reviews, and then combine them into a practical travel plan.