Table of Contents
System Notes: LLM, Tokens, Agents, Local Models, RAG & LoRA Fine-Tuning
1. Overview
An LLM (Large Language Model) is a model that generates text by predicting the next token based on the current context.
In real products, you typically combine:
- LLM: generates the next token (text output)
- RAG: retrieves external documents and injects them into context
- Agent: software that uses LLM + tools + feedback loop to complete tasks
LLM in one line
A practical mental model:
- 1) Learned experience = Parameters / Weights
- This is the “knowledge compressed into numbers” learned during training (e.g., 7B/8B/70B parameters).
- 2) What it can produce = Vocabulary
- The fixed set of tokens the model can output.
- 3) Temporary memory = Context window
- The maximum number of tokens the model can see at once (prompt + history + retrieved docs + output).
Note: The Tokenizer is a separate component that converts text into token IDs (it is not the model’s learned experience).
Glossary quick notes
- parameter /pəˈræmɪtər/: tham số (trọng số học được)
- weight /weɪt/: trọng số (một kiểu parameter)
- tokenizer /ˈtoʊkəˌnaɪzər/: bộ tách & mã hóa text thành token
- vocabulary /vəˈkæbjəˌleri/: từ điển token
- context window /ˈkɑːntekst ˈwɪndoʊ/: cửa sổ ngữ cảnh (trí nhớ ngắn hạn)
2. Does an LLM “understand” or only generate probabilistic text?
Technically, an LLM is a probabilistic sequence model: it estimates a distribution over the next token given previous tokens.
Key implications:
- It does not have consciousness or human-like understanding.
- It can *appear* to understand because it learned patterns from massive data.
- It does not inherently know whether its output is true or false.
- Bad / biased / outdated training data can lead to bad answers.
- Ambiguous prompts can trigger “plausible but wrong” outputs (hallucination).
3. What are tokens, and where do generated tokens come from?
3.1 Tokens are not necessarily “words”
Tokens can be:
- word pieces, characters, punctuation, whitespace
- e.g., “Ha” + “ Noi” can be separate tokens
3.2 Where does the next token come from?
Generated tokens come from the model’s vocabulary (token dictionary):
- Vocabulary is fixed at model creation time.
- At each step, the model outputs probabilities across the entire vocabulary and selects a token (greedy or sampling).
So:
- The model is not “fetching text from the internet” while generating.
- If you want fresh facts, you must use tools (RAG / browsing / APIs).
4. Three frequently confused concepts
4.1 Vocabulary size
Number of distinct tokens the model can output.
- Fixed (e.g., ~100K tokens depending on tokenizer/model).
4.2 Training tokens
Number of tokens in the dataset used for training.
- Often measured in trillions (T).
- Represents “experience volume,” not a stored Q&A database.
4.3 Context length
Maximum number of tokens the model can “see” at once.
- Includes prompt + chat history + retrieved docs + the output being generated.
- Exceed it and the model loses earlier parts.
5. What do 7B / 8B / 13B / 70B mean?
These refer to parameter count:
- B = Billion parameters (weights).
- 7B ≈ 7 billion parameters, 70B ≈ 70 billion parameters.
Parameters are the model’s learned numeric weights—its compressed “experience.”
6. What is inside an LLM model?
A typical LLM package includes:
- Tokenizer: rules to split text into tokens
- Vocabulary: mapping token ↔ ID
- Embedding table: maps token IDs to vectors
- Transformer layers: attention + feed-forward blocks
- Parameters (weights): billions of learned numbers
- Output head: converts final representation to next-token probabilities (softmax)
What is NOT inside:
- No database of memorized articles or Q&A pairs.
- No built-in truth checker.
7. Comparing image vector search vs LLM generation
7.1 Image similarity search (your example)
- Image → encoder → vector
- Store vectors in a DB (FAISS / Milvus / pgvector)
- Query vector → nearest neighbors via cosine similarity
- Returns an existing item from the DB
7.2 LLM generation
- Text → tokens → embeddings
- Attention compares vectors internally to build a context representation
- Output is a probability distribution over the vocabulary
- Select next token, repeat
Core difference:
- Image search returns a stored item.
- LLM produces new text token-by-token.
7.3 When LLM becomes “search-like” (RAG)
RAG adds:
- Text → embedding → vector DB retrieval → inject documents → LLM answers based on retrieved sources.
So:
- Vector DB = searching
- LLM = composing/explaining
8. What is an AI Agent?
An Agent is a software system that uses an LLM to decide actions, execute tools, observe results, and iterate.
Agent formula:
- Agent = program logic + LLM + tools + memory + feedback loop
Compared to a normal chatbot:
- Chatbot: one-shot answer
- Agent: plan → act → check → fix → repeat
9. Building a minimal agent
Required components:
- Goal: clear objective and success criteria
- Tools: shell/API/DB/filesystem/etc.
- Memory: logs / state (short-term and optionally long-term)
- Loop: repeat until success or stop condition
- Guardrails: tool allowlist, sandboxing, max steps, timeouts, logging
9.1 Why enforce structured outputs?
Because free-form text is hard to validate.
Common patterns:
- JSON-only output
- JSON + schema validation
- Function calling / tool calling
This allows:
- parse → validate → execute → verify → retry if needed
10. Popular models and which ones can run locally
Quick mental model fields (apply to every LLM)
For each model below, capture:
- 1) Learned experience = Parameters / Weights (e.g., 7B/8B/70B)
- 2) What it can produce = Vocabulary (tokenizer + vocab size; fixed per model)
- 3) Temporary memory = Context window (max tokens visible at once)
Cloud-only (generally not downloadable)
GPT (OpenAI)
- Parameters/Weights: Not publicly disclosed (varies by product/version).
- Vocabulary: Proprietary tokenizer; vocab size not consistently published.
- Context window: Varies by model tier/version (check product docs).
- Notes: Strong general reasoning + tool ecosystem.
Claude (Anthropic)
- Parameters/Weights: Not publicly disclosed.
- Vocabulary: Proprietary tokenizer.
- Context window: Varies by model tier/version (check product docs).
- Notes: Strong long-form writing and code assistance.
Gemini (Google)
- Parameters/Weights: Not publicly disclosed.
- Vocabulary: Proprietary tokenizer.
- Context window: Varies by model tier/version (check product docs).
- Notes: Strong multimodal and large-context options (depending on version).
Open-weight/open-source (often runnable locally)
LLaMA family (Meta)
- Parameters/Weights: Common sizes include 7B/8B/13B/70B (depends on generation).
- Vocabulary: Fixed per LLaMA generation (tokenizer + vocab size depends on version).
- Context window: Varies by generation (older versions smaller; newer may be larger).
- Local use: Best with quantized GGUF via llama.cpp / Ollama / LM Studio.
Mistral / Mixtral
- Parameters/Weights: Mistral commonly 7B-class; Mixtral uses MoE (Mixture-of-Experts) variants.
- Vocabulary: Fixed per model release (tokenizer-specific).
- Context window: Varies by release/version.
- Local use: Mistral 7B-class is popular for fast local inference.
Qwen
- Parameters/Weights: Multiple sizes (small → large; common local picks: ~7B-class).
- Vocabulary: Fixed per Qwen generation (tokenizer-specific).
- Context window: Varies by release/version (some versions support larger contexts).
- Local use: Often strong multilingual performance.
DeepSeek (especially strong for code variants)
- Parameters/Weights: Multiple sizes; common local coder models are ~6–7B-class.
- Vocabulary: Fixed per model/tokenizer version.
- Context window: Varies by release/version.
- Local use: Code-focused variants are widely used for dev tasks.
Phi (small, efficient)
- Parameters/Weights: Small models (often ~2–4B-class depending on version).
- Vocabulary: Fixed per Phi release/tokenizer.
- Context window: Varies by version.
- Local use: Great for low-resource devices; fast inference.
Local runtimes on macOS
Ollama
- Purpose: Simplest local runner (download + run models easily).
- Works best with: Quantized GGUF models.
LM Studio
- Purpose: GUI app to download, run, and chat with local models.
- Works best with: Quantized GGUF models, easy model management.
llama.cpp
- Purpose: High-performance local inference engine for GGUF models.
- Works best with: Fine-grained control and optimization on CPU/Metal.
Glossary (hard terms)
- parameter /pəˈræmɪtər/: tham số (trọng số học được)
- weight /weɪt/: trọng số
- vocabulary /vəˈkæbjəˌleri/: từ điển token (tập token có thể sinh)
- token /ˈtoʊkən/: đơn vị nhỏ (mảnh chữ) của văn bản
- context window /ˈkɑːntekst ˈwɪndoʊ/: cửa sổ ngữ cảnh (trí nhớ tạm thời)
- proprietary /prəˈpraɪəˌteri/: độc quyền, không công khai
- open-weight /ˌoʊpən ˈweɪt/: mở trọng số (công bố weights)
- quantized /ˈkwɑːntaɪzd/: đã lượng tử hóa (nén độ chính xác số)
- runtime /ˈrʌnˌtaɪm/: môi trường/chương trình chạy
- variant /ˈveriənt/: biến thể/phiên bản
- Mixture-of-Experts (MoE) /ˈmɪkstʃər əv ˈekspɜːrts/: kiến trúc “nhiều chuyên gia” (chỉ kích hoạt một phần model mỗi bước)
11. Local model size estimates on Mac
Disk/RAM depends heavily on quantization.
Typical sizes (rough guidance):
- FP16 (no quantization): ~14–16 GB for 7B/8B (not recommended on thin laptops)
- Q8: ~8–9 GB
- Q4/Q5 (most common for local use):
- 7B Q4: ~4–5 GB
- 8B Q4: ~4.5–5.5 GB
- Q5 adds ~0.5–1.5 GB
Mac uses unified memory, so CPU+GPU share RAM.
12. Can you “build an LLM” on MacBook Air M3?
Practical reality:
- Training a foundation model from scratch: not realistic on a laptop.
- Running (inference) 7B/8B quantized locally: realistic.
- Fine-tuning with LoRA: usually done on cloud GPU; laptop can prep data and run inference tests.
13. Adapting a model to a domain like healthcare
There are three main approaches:
13.1 Prompt/System instructions
- Changes only the context (runtime behavior).
- Does not change weights.
13.2 RAG (recommended first)
- Adds domain documents (guidelines, protocols, FAQ) into context.
- No weight changes.
- Easy to update and safer for fresh knowledge.
13.3 LoRA/QLoRA fine-tuning (Approach #3)
- Changes parameters via a small adapter.
- Best for behavior: formatting, refusal policy, questioning flow, tone.
- Not ideal for injecting large factual knowledge (use RAG for that).
14. Approach #3: How LoRA fine-tuning works
14.1 Concept
- Freeze base model weights.
- Train a small LoRA adapter (delta weights).
- During inference: base + delta.
- Can enable/disable adapters or swap adapters per domain.
14.2 Practical workflow
- Step 1: Choose a base model (7B/8B common).
- Step 2: Prepare instruction→response dataset (high-quality, domain-shaped).
- Step 3: Train LoRA/QLoRA using a framework (e.g., HuggingFace PEFT / Axolotl), typically on cloud GPUs.
- Step 4: Export adapter files (e.g., safetensors).
- Step 5: Load base model + adapter locally for inference.
14.3 What to train for healthcare
Prefer training:
- structured response format
- asking clarifying questions
- refusal behavior and escalation cues
- citing sources (especially combined with RAG)
Avoid training:
- private patient records
- uncontrolled “medical knowledge dumps”
15. What comes after single-agent systems?
Common directions:
- Multi-agent setups: planner, executor, critic, safety checker
- Self-critique / verification loops
- World models / simulation before action
- Stronger human-in-the-loop controls
16. Vocabulary notes: terms + IPA
- token /ˈtoʊkən/: unit of text used by the model (often a word piece)
- vocabulary /vəˈkæbjəˌleri/: the model’s token dictionary
- context /ˈkɑːntekst/: the visible token window during inference
- parameter /pəˈræmɪtər/: a learned weight value in the network
- quantization /ˌkwɑːntəˈzeɪʃən/: reducing numeric precision to shrink models
- embedding /ɪmˈbedɪŋ/: vector representation of tokens/items
- attention /əˈtenʃən/: mechanism that mixes information across tokens
- softmax /ˈsɔːftˌmæks/: converts scores into probabilities
- hallucination /həˌluːsɪˈneɪʃən/: plausible-sounding but incorrect output
- agent /ˈeɪdʒənt/: autonomous software using LLM + tools
- tool calling /tuːl ˈkɔːlɪŋ/: structured calls to tools/functions
- RAG /ræɡ/: Retrieval-Augmented Generation (retrieve docs then generate)
- fine-tune /ˌfaɪnˈtuːn/: adapting a model with additional training
- LoRA /ˈloʊrə/: Low-Rank Adaptation (lightweight fine-tune adapter)
- QLoRA /kjuːˈloʊrə/: LoRA + quantization for memory-efficient training
17. Quick implementation checklist
- Need accurate, updatable domain knowledge: do RAG first.
- Need consistent behavior/formatting: add LoRA.
- Need task execution: build an Agent with:
- strict JSON/tool outputs
- schema validation
- allowlisted tools + sandbox
- logs + max steps + stop rules
