This is an old revision of the document!

System Notes: LLM, Tokens, Agents, Local Models, RAG & LoRA Fine-Tuning

1. Overview

An LLM (Large Language Model) is a model that generates text by predicting the next token based on the current context.

In real products, you typically combine:

LLM: generates the next token (text output)
RAG: retrieves external documents and injects them into context
Agent: software that uses LLM + tools + feedback loop to complete tasks

LLM in one line

A practical mental model:

1) Learned experience = Parameters / Weights
- This is the “knowledge compressed into numbers” learned during training (e.g., 7B/8B/70B parameters).

2) What it can produce = Vocabulary
- The fixed set of tokens the model can output.

3) Temporary memory = Context window
- The maximum number of tokens the model can see at once (prompt + history + retrieved docs + output).

Note: The Tokenizer is a separate component that converts text into token IDs (it is not the model’s learned experience).

Glossary quick notes

parameter /pəˈræmɪtər/: tham số (trọng số học được)
weight /weɪt/: trọng số (một kiểu parameter)
tokenizer /ˈtoʊkəˌnaɪzər/: bộ tách & mã hóa text thành token
vocabulary /vəˈkæbjəˌleri/: từ điển token
context window /ˈkɑːntekst ˈwɪndoʊ/: cửa sổ ngữ cảnh (trí nhớ ngắn hạn)

2. Does an LLM “understand” or only generate probabilistic text?

Technically, an LLM is a probabilistic sequence model: it estimates a distribution over the next token given previous tokens.

Key implications:

It does not have consciousness or human-like understanding.
It can *appear* to understand because it learned patterns from massive data.
It does not inherently know whether its output is true or false.
Bad / biased / outdated training data can lead to bad answers.
Ambiguous prompts can trigger “plausible but wrong” outputs (hallucination).

3. What are tokens, and where do generated tokens come from?

3.1 Tokens are not necessarily “words”

Tokens can be:

word pieces, characters, punctuation, whitespace
e.g., “Ha” + “ Noi” can be separate tokens

3.2 Where does the next token come from?

Generated tokens come from the model’s vocabulary (token dictionary):

Vocabulary is fixed at model creation time.
At each step, the model outputs probabilities across the entire vocabulary and selects a token (greedy or sampling).

So:

The model is not “fetching text from the internet” while generating.
If you want fresh facts, you must use tools (RAG / browsing / APIs).

4. Three frequently confused concepts

4.1 Vocabulary size

Number of distinct tokens the model can output.

Fixed (e.g., ~100K tokens depending on tokenizer/model).

4.2 Training tokens

Number of tokens in the dataset used for training.

Often measured in trillions (T).
Represents “experience volume,” not a stored Q&A database.

4.3 Context length

Maximum number of tokens the model can “see” at once.

Includes prompt + chat history + retrieved docs + the output being generated.
Exceed it and the model loses earlier parts.

5. What do 7B / 8B / 13B / 70B mean?

These refer to parameter count:

B = Billion parameters (weights).
7B ≈ 7 billion parameters, 70B ≈ 70 billion parameters.

Parameters are the model’s learned numeric weights—its compressed “experience.”

6. What is inside an LLM model?

A typical LLM package includes:

Tokenizer: rules to split text into tokens
Vocabulary: mapping token ↔ ID
Embedding table: maps token IDs to vectors
Transformer layers: attention + feed-forward blocks
Parameters (weights): billions of learned numbers
Output head: converts final representation to next-token probabilities (softmax)

What is NOT inside:

No database of memorized articles or Q&A pairs.
No built-in truth checker.

7. Comparing image vector search vs LLM generation

7.1 Image similarity search (your example)

Image → encoder → vector
Store vectors in a DB (FAISS / Milvus / pgvector)
Query vector → nearest neighbors via cosine similarity
Returns an existing item from the DB

7.2 LLM generation

Text → tokens → embeddings
Attention compares vectors internally to build a context representation
Output is a probability distribution over the vocabulary
Select next token, repeat

Core difference:

Image search returns a stored item.
LLM produces new text token-by-token.

7.3 When LLM becomes “search-like” (RAG)

RAG adds:

Text → embedding → vector DB retrieval → inject documents → LLM answers based on retrieved sources.

So:

Vector DB = searching
LLM = composing/explaining

8. What is an AI Agent?

An Agent is a software system that uses an LLM to decide actions, execute tools, observe results, and iterate.

Agent formula:

Agent = program logic + LLM + tools + memory + feedback loop

Compared to a normal chatbot:

Chatbot: one-shot answer
Agent: plan → act → check → fix → repeat

9. Building a minimal agent

Required components:

Goal: clear objective and success criteria
Tools: shell/API/DB/filesystem/etc.
Memory: logs / state (short-term and optionally long-term)
Loop: repeat until success or stop condition
Guardrails: tool allowlist, sandboxing, max steps, timeouts, logging

9.1 Why enforce structured outputs?

Because free-form text is hard to validate.

Common patterns:

JSON-only output
JSON + schema validation
Function calling / tool calling

This allows:

parse → validate → execute → verify → retry if needed

10. Popular models and which ones can run locally

Cloud-only (generally not downloadable):

GPT (OpenAI)
Claude (Anthropic)
Gemini (Google)

Open-weight/open-source (often runnable locally):

LLaMA family (Meta)
Mistral / Mixtral
Qwen
DeepSeek (especially strong for code variants)
Phi (small, efficient)

Local runtimes on macOS:

Ollama
LM Studio
llama.cpp

11. Local model size estimates on Mac

Disk/RAM depends heavily on quantization.

Typical sizes (rough guidance):

FP16 (no quantization): ~14–16 GB for 7B/8B (not recommended on thin laptops)
Q8: ~8–9 GB
Q4/Q5 (most common for local use):
- 7B Q4: ~4–5 GB
- 8B Q4: ~4.5–5.5 GB
- Q5 adds ~0.5–1.5 GB

Mac uses unified memory, so CPU+GPU share RAM.

12. Can you “build an LLM” on MacBook Air M3?

Practical reality:

Training a foundation model from scratch: not realistic on a laptop.
Running (inference) 7B/8B quantized locally: realistic.
Fine-tuning with LoRA: usually done on cloud GPU; laptop can prep data and run inference tests.

13. Adapting a model to a domain like healthcare

There are three main approaches:

13.1 Prompt/System instructions

Changes only the context (runtime behavior).
Does not change weights.

13.2 RAG (recommended first)

Adds domain documents (guidelines, protocols, FAQ) into context.
No weight changes.
Easy to update and safer for fresh knowledge.

13.3 LoRA/QLoRA fine-tuning (Approach #3)

Changes parameters via a small adapter.
Best for behavior: formatting, refusal policy, questioning flow, tone.
Not ideal for injecting large factual knowledge (use RAG for that).

14. Approach #3: How LoRA fine-tuning works

14.1 Concept

Freeze base model weights.
Train a small LoRA adapter (delta weights).
During inference: base + delta.
Can enable/disable adapters or swap adapters per domain.

14.2 Practical workflow

Step 1: Choose a base model (7B/8B common).
Step 2: Prepare instruction→response dataset (high-quality, domain-shaped).
Step 3: Train LoRA/QLoRA using a framework (e.g., HuggingFace PEFT / Axolotl), typically on cloud GPUs.
Step 4: Export adapter files (e.g., safetensors).
Step 5: Load base model + adapter locally for inference.

14.3 What to train for healthcare

Prefer training:

structured response format
asking clarifying questions
refusal behavior and escalation cues
citing sources (especially combined with RAG)

Avoid training:

private patient records
uncontrolled “medical knowledge dumps”

15. What comes after single-agent systems?

Common directions:

Multi-agent setups: planner, executor, critic, safety checker
Self-critique / verification loops
World models / simulation before action
Stronger human-in-the-loop controls

16. Vocabulary notes: terms + IPA

token /ˈtoʊkən/: unit of text used by the model (often a word piece)
vocabulary /vəˈkæbjəˌleri/: the model’s token dictionary
context /ˈkɑːntekst/: the visible token window during inference
parameter /pəˈræmɪtər/: a learned weight value in the network
quantization /ˌkwɑːntəˈzeɪʃən/: reducing numeric precision to shrink models
embedding /ɪmˈbedɪŋ/: vector representation of tokens/items
attention /əˈtenʃən/: mechanism that mixes information across tokens
softmax /ˈsɔːftˌmæks/: converts scores into probabilities
hallucination /həˌluːsɪˈneɪʃən/: plausible-sounding but incorrect output
agent /ˈeɪdʒənt/: autonomous software using LLM + tools
tool calling /tuːl ˈkɔːlɪŋ/: structured calls to tools/functions
RAG /ræɡ/: Retrieval-Augmented Generation (retrieve docs then generate)
fine-tune /ˌfaɪnˈtuːn/: adapting a model with additional training
LoRA /ˈloʊrə/: Low-Rank Adaptation (lightweight fine-tune adapter)
QLoRA /kjuːˈloʊrə/: LoRA + quantization for memory-efficient training

17. Quick implementation checklist

Need accurate, updatable domain knowledge: do RAG first.
Need consistent behavior/formatting: add LoRA.
Need task execution: build an Agent with:
- strict JSON/tool outputs
- schema validation
- allowlisted tools + sandbox
- logs + max steps + stop rules

Wiki.Quizz.vn

Table of Contents