This is an old revision of the document!
Table of Contents
System Notes: LLM, Tokens, Agents, Local Models, RAG & LoRA Fine-Tuning
1. Overview
An LLM (Large Language Model) is a model that generates text by predicting the next token based on the current context.
In real products, you typically combine:
- LLM: generates the next token (text output)
- RAG: retrieves external documents and injects them into context
- Agent: software that uses LLM + tools + feedback loop to complete tasks
LLM in one line
A practical mental model:
- 1) Learned experience = Parameters / Weights
- This is the “knowledge compressed into numbers” learned during training (e.g., 7B/8B/70B parameters).
- 2) What it can produce = Vocabulary
- The fixed set of tokens the model can output.
- 3) Temporary memory = Context window
- The maximum number of tokens the model can see at once (prompt + history + retrieved docs + output).
Note: The Tokenizer is a separate component that converts text into token IDs (it is not the model’s learned experience).
Glossary quick notes
- parameter /pəˈræmɪtər/: tham số (trọng số học được)
- weight /weɪt/: trọng số (một kiểu parameter)
- tokenizer /ˈtoʊkəˌnaɪzər/: bộ tách & mã hóa text thành token
- vocabulary /vəˈkæbjəˌleri/: từ điển token
- context window /ˈkɑːntekst ˈwɪndoʊ/: cửa sổ ngữ cảnh (trí nhớ ngắn hạn)
2. Does an LLM “understand” or only generate probabilistic text?
Technically, an LLM is a probabilistic sequence model: it estimates a distribution over the next token given previous tokens.
Key implications:
- It does not have consciousness or human-like understanding.
- It can *appear* to understand because it learned patterns from massive data.
- It does not inherently know whether its output is true or false.
- Bad / biased / outdated training data can lead to bad answers.
- Ambiguous prompts can trigger “plausible but wrong” outputs (hallucination).
3. What are tokens, and where do generated tokens come from?
3.1 Tokens are not necessarily “words”
Tokens can be:
- word pieces, characters, punctuation, whitespace
- e.g., “Ha” + “ Noi” can be separate tokens
3.2 Where does the next token come from?
Generated tokens come from the model’s vocabulary (token dictionary):
- Vocabulary is fixed at model creation time.
- At each step, the model outputs probabilities across the entire vocabulary and selects a token (greedy or sampling).
So:
- The model is not “fetching text from the internet” while generating.
- If you want fresh facts, you must use tools (RAG / browsing / APIs).
4. Three frequently confused concepts
4.1 Vocabulary size
Number of distinct tokens the model can output.
- Fixed (e.g., ~100K tokens depending on tokenizer/model).
4.2 Training tokens
Number of tokens in the dataset used for training.
- Often measured in trillions (T).
- Represents “experience volume,” not a stored Q&A database.
4.3 Context length
Maximum number of tokens the model can “see” at once.
- Includes prompt + chat history + retrieved docs + the output being generated.
- Exceed it and the model loses earlier parts.
5. What do 7B / 8B / 13B / 70B mean?
These refer to parameter count:
- B = Billion parameters (weights).
- 7B ≈ 7 billion parameters, 70B ≈ 70 billion parameters.
Parameters are the model’s learned numeric weights—its compressed “experience.”
6. What is inside an LLM model?
A typical LLM package includes:
- Tokenizer: rules to split text into tokens
- Vocabulary: mapping token ↔ ID
- Embedding table: maps token IDs to vectors
- Transformer layers: attention + feed-forward blocks
- Parameters (weights): billions of learned numbers
- Output head: converts final representation to next-token probabilities (softmax)
What is NOT inside:
- No database of memorized articles or Q&A pairs.
- No built-in truth checker.
7. Comparing image vector search vs LLM generation
7.1 Image similarity search (your example)
- Image → encoder → vector
- Store vectors in a DB (FAISS / Milvus / pgvector)
- Query vector → nearest neighbors via cosine similarity
- Returns an existing item from the DB
7.2 LLM generation
- Text → tokens → embeddings
- Attention compares vectors internally to build a context representation
- Output is a probability distribution over the vocabulary
- Select next token, repeat
Core difference:
- Image search returns a stored item.
- LLM produces new text token-by-token.
7.3 When LLM becomes “search-like” (RAG)
RAG adds:
- Text → embedding → vector DB retrieval → inject documents → LLM answers based on retrieved sources.
So:
- Vector DB = searching
- LLM = composing/explaining
8. What is an AI Agent?
An Agent is a software system that uses an LLM to decide actions, execute tools, observe results, and iterate.
Agent formula:
- Agent = program logic + LLM + tools + memory + feedback loop
Compared to a normal chatbot:
- Chatbot: one-shot answer
- Agent: plan → act → check → fix → repeat
9. Building a minimal agent
Required components:
- Goal: clear objective and success criteria
- Tools: shell/API/DB/filesystem/etc.
- Memory: logs / state (short-term and optionally long-term)
- Loop: repeat until success or stop condition
- Guardrails: tool allowlist, sandboxing, max steps, timeouts, logging
9.1 Why enforce structured outputs?
Because free-form text is hard to validate.
Common patterns:
- JSON-only output
- JSON + schema validation
- Function calling / tool calling
This allows:
- parse → validate → execute → verify → retry if needed
10. Popular models and which ones can run locally
Cloud-only (generally not downloadable):
- GPT (OpenAI)
- Claude (Anthropic)
- Gemini (Google)
Open-weight/open-source (often runnable locally):
- LLaMA family (Meta)
- Mistral / Mixtral
- Qwen
- DeepSeek (especially strong for code variants)
- Phi (small, efficient)
Local runtimes on macOS:
- Ollama
- LM Studio
- llama.cpp
11. Local model size estimates on Mac
Disk/RAM depends heavily on quantization.
Typical sizes (rough guidance):
- FP16 (no quantization): ~14–16 GB for 7B/8B (not recommended on thin laptops)
- Q8: ~8–9 GB
- Q4/Q5 (most common for local use):
- 7B Q4: ~4–5 GB
- 8B Q4: ~4.5–5.5 GB
- Q5 adds ~0.5–1.5 GB
Mac uses unified memory, so CPU+GPU share RAM.
12. Can you “build an LLM” on MacBook Air M3?
Practical reality:
- Training a foundation model from scratch: not realistic on a laptop.
- Running (inference) 7B/8B quantized locally: realistic.
- Fine-tuning with LoRA: usually done on cloud GPU; laptop can prep data and run inference tests.
13. Adapting a model to a domain like healthcare
There are three main approaches:
13.1 Prompt/System instructions
- Changes only the context (runtime behavior).
- Does not change weights.
13.2 RAG (recommended first)
- Adds domain documents (guidelines, protocols, FAQ) into context.
- No weight changes.
- Easy to update and safer for fresh knowledge.
13.3 LoRA/QLoRA fine-tuning (Approach #3)
- Changes parameters via a small adapter.
- Best for behavior: formatting, refusal policy, questioning flow, tone.
- Not ideal for injecting large factual knowledge (use RAG for that).
14. Approach #3: How LoRA fine-tuning works
14.1 Concept
- Freeze base model weights.
- Train a small LoRA adapter (delta weights).
- During inference: base + delta.
- Can enable/disable adapters or swap adapters per domain.
14.2 Practical workflow
- Step 1: Choose a base model (7B/8B common).
- Step 2: Prepare instruction→response dataset (high-quality, domain-shaped).
- Step 3: Train LoRA/QLoRA using a framework (e.g., HuggingFace PEFT / Axolotl), typically on cloud GPUs.
- Step 4: Export adapter files (e.g., safetensors).
- Step 5: Load base model + adapter locally for inference.
14.3 What to train for healthcare
Prefer training:
- structured response format
- asking clarifying questions
- refusal behavior and escalation cues
- citing sources (especially combined with RAG)
Avoid training:
- private patient records
- uncontrolled “medical knowledge dumps”
15. What comes after single-agent systems?
Common directions:
- Multi-agent setups: planner, executor, critic, safety checker
- Self-critique / verification loops
- World models / simulation before action
- Stronger human-in-the-loop controls
16. Vocabulary notes: terms + IPA
- token /ˈtoʊkən/: unit of text used by the model (often a word piece)
- vocabulary /vəˈkæbjəˌleri/: the model’s token dictionary
- context /ˈkɑːntekst/: the visible token window during inference
- parameter /pəˈræmɪtər/: a learned weight value in the network
- quantization /ˌkwɑːntəˈzeɪʃən/: reducing numeric precision to shrink models
- embedding /ɪmˈbedɪŋ/: vector representation of tokens/items
- attention /əˈtenʃən/: mechanism that mixes information across tokens
- softmax /ˈsɔːftˌmæks/: converts scores into probabilities
- hallucination /həˌluːsɪˈneɪʃən/: plausible-sounding but incorrect output
- agent /ˈeɪdʒənt/: autonomous software using LLM + tools
- tool calling /tuːl ˈkɔːlɪŋ/: structured calls to tools/functions
- RAG /ræɡ/: Retrieval-Augmented Generation (retrieve docs then generate)
- fine-tune /ˌfaɪnˈtuːn/: adapting a model with additional training
- LoRA /ˈloʊrə/: Low-Rank Adaptation (lightweight fine-tune adapter)
- QLoRA /kjuːˈloʊrə/: LoRA + quantization for memory-efficient training
17. Quick implementation checklist
- Need accurate, updatable domain knowledge: do RAG first.
- Need consistent behavior/formatting: add LoRA.
- Need task execution: build an Agent with:
- strict JSON/tool outputs
- schema validation
- allowlisted tools + sandbox
- logs + max steps + stop rules
