System Notes: LLM, Tokens, Agents, Local Models, RAG & LoRA Fine-Tuning
1. Overview
An LLM (Large Language Model) is a model that generates text by predicting the next token based on the current context.
In real products, you typically combine:
LLM: generates the next token (text output)
RAG: retrieves external documents and injects them into context
Agent: software that uses LLM + tools + feedback loop to complete tasks
LLM in one line
A practical mental model:
Note: The Tokenizer is a separate component that converts text into token IDs (it is not the model’s learned experience).
Glossary quick notes
parameter /pəˈræmɪtər/: tham số (trọng số học được)
weight /weɪt/: trọng số (một kiểu parameter)
tokenizer /ˈtoʊkəˌnaɪzər/: bộ tách & mã hóa text thành token
vocabulary /vəˈkæbjəˌleri/: từ điển token
context window /ˈkɑːntekst ˈwɪndoʊ/: cửa sổ ngữ cảnh (trí nhớ ngắn hạn)
2. Does an LLM “understand” or only generate probabilistic text?
Technically, an LLM is a probabilistic sequence model: it estimates a distribution over the next token given previous tokens.
Key implications:
It does not have consciousness or human-like understanding.
It can *appear* to understand because it learned patterns from massive data.
It does not inherently know whether its output is true or false.
Bad / biased / outdated training data can lead to bad answers.
Ambiguous prompts can trigger “plausible but wrong” outputs (hallucination).
3. What are tokens, and where do generated tokens come from?
3.1 Tokens are not necessarily “words”
Tokens can be:
word pieces, characters, punctuation, whitespace
e.g., “Ha” + “ Noi” can be separate tokens
3.2 Where does the next token come from?
Generated tokens come from the model’s vocabulary (token dictionary):
Vocabulary is fixed at model creation time.
At each step, the model outputs probabilities across the entire vocabulary and selects a token (greedy or sampling).
So:
The model is not “fetching text from the internet” while generating.
If you want fresh facts, you must use tools (RAG / browsing / APIs).
4. Three frequently confused concepts
4.1 Vocabulary size
Number of distinct tokens the model can output.
4.2 Training tokens
Number of tokens in the dataset used for training.
4.3 Context length
Maximum number of tokens the model can “see” at once.
5. What do 7B / 8B / 13B / 70B mean?
These refer to parameter count:
Parameters are the model’s learned numeric weights—its compressed “experience.”
6. What is inside an LLM model?
A typical LLM package includes:
Tokenizer: rules to split text into tokens
Vocabulary: mapping token ↔ ID
Embedding table: maps token IDs to vectors
Transformer layers: attention + feed-forward blocks
Parameters (weights): billions of learned numbers
Output head: converts final representation to next-token probabilities (softmax)
What is NOT inside:
7. Comparing image vector search vs LLM generation
7.1 Image similarity search (your example)
Image → encoder → vector
Store vectors in a DB (FAISS / Milvus / pgvector)
Query vector → nearest neighbors via cosine similarity
Returns an existing item from the DB
7.2 LLM generation
Text → tokens → embeddings
Attention compares vectors internally to build a context representation
Output is a probability distribution over the vocabulary
Select next token, repeat
Core difference:
7.3 When LLM becomes “search-like” (RAG)
8. What is an AI Agent?
An Agent is a software system that uses an LLM to decide actions, execute tools, observe results, and iterate.
Agent formula:
Compared to a normal chatbot:
9. Building a minimal agent
Required components:
Goal: clear objective and success criteria
Tools: shell/
API/DB/filesystem/etc.
Memory: logs / state (short-term and optionally long-term)
Loop: repeat until success or stop condition
Guardrails: tool allowlist, sandboxing, max steps, timeouts, logging
9.1 Why enforce structured outputs?
Because free-form text is hard to validate.
Common patterns:
This allows:
10. Popular models and which ones can run locally
Quick mental model fields (apply to every LLM)
For each model below, capture:
1) Learned experience = Parameters / Weights (e.g., 7B/8B/70B)
2) What it can produce = Vocabulary (tokenizer + vocab size; fixed per model)
3) Temporary memory = Context window (max tokens visible at once)
Cloud-only (generally not downloadable)
GPT (OpenAI)
Parameters/Weights: Not publicly disclosed (varies by product/version).
Vocabulary: Proprietary tokenizer; vocab size not consistently published.
Context window: Varies by model tier/version (check product docs).
Notes: Strong general reasoning + tool ecosystem.
Claude (Anthropic)
Parameters/Weights: Not publicly disclosed.
Vocabulary: Proprietary tokenizer.
Context window: Varies by model tier/version (check product docs).
Notes: Strong long-form writing and code assistance.
Gemini (Google)
Parameters/Weights: Not publicly disclosed.
Vocabulary: Proprietary tokenizer.
Context window: Varies by model tier/version (check product docs).
Notes: Strong multimodal and large-context options (depending on version).
Open-weight/open-source (often runnable locally)
Parameters/Weights: Common sizes include 7B/8B/13B/70B (depends on generation).
Vocabulary: Fixed per LLaMA generation (tokenizer + vocab size depends on version).
Context window: Varies by generation (older versions smaller; newer may be larger).
Local use: Best with quantized GGUF via llama.cpp / Ollama / LM Studio.
Mistral / Mixtral
Parameters/Weights: Mistral commonly 7B-class; Mixtral uses MoE (Mixture-of-Experts) variants.
Vocabulary: Fixed per model release (tokenizer-specific).
Context window: Varies by release/version.
Local use: Mistral 7B-class is popular for fast local inference.
Qwen
Parameters/Weights: Multiple sizes (small → large; common local picks: ~7B-class).
Vocabulary: Fixed per Qwen generation (tokenizer-specific).
Context window: Varies by release/version (some versions support larger contexts).
Local use: Often strong multilingual performance.
DeepSeek (especially strong for code variants)
Parameters/Weights: Multiple sizes; common local coder models are ~6–7B-class.
Vocabulary: Fixed per model/tokenizer version.
Context window: Varies by release/version.
Local use: Code-focused variants are widely used for dev tasks.
Phi (small, efficient)
Parameters/Weights: Small models (often ~2–4B-class depending on version).
Vocabulary: Fixed per Phi release/tokenizer.
Context window: Varies by version.
Local use: Great for low-resource devices; fast inference.
Local runtimes on macOS
Ollama
LM Studio
llama.cpp
Glossary (hard terms)
parameter /pəˈræmɪtər/: tham số (trọng số học được)
weight /weɪt/: trọng số
vocabulary /vəˈkæbjəˌleri/: từ điển token (tập token có thể sinh)
token /ˈtoʊkən/: đơn vị nhỏ (mảnh chữ) của văn bản
context window /ˈkɑːntekst ˈwɪndoʊ/: cửa sổ ngữ cảnh (trí nhớ tạm thời)
proprietary /prəˈpraɪəˌteri/: độc quyền, không công khai
open-weight /ˌoʊpən ˈweɪt/: mở trọng số (công bố weights)
quantized /ˈkwɑːntaɪzd/: đã lượng tử hóa (nén độ chính xác số)
runtime /ˈrʌnˌtaɪm/: môi trường/chương trình chạy
variant /ˈveriənt/: biến thể/phiên bản
Mixture-of-Experts (MoE) /ˈmɪkstʃər əv ˈekspɜːrts/: kiến trúc “nhiều chuyên gia” (chỉ kích hoạt một phần model mỗi bước)
11. Local model size estimates on Mac
Disk/RAM depends heavily on quantization.
Typical sizes (rough guidance):
Mac uses unified memory, so CPU+GPU share RAM.
12. Can you “build an LLM” on MacBook Air M3?
Practical reality:
Training a foundation model from scratch: not realistic on a laptop.
Running (inference) 7B/8B quantized locally: realistic.
Fine-tuning with LoRA: usually done on cloud GPU; laptop can prep data and run inference tests.
13. Adapting a model to a domain like healthcare
There are three main approaches:
13.1 Prompt/System instructions
13.2 RAG (recommended first)
13.3 LoRA/QLoRA fine-tuning (Approach #3)
Changes parameters via a small adapter.
Best for behavior: formatting, refusal policy, questioning flow, tone.
Not ideal for injecting large factual knowledge (use RAG for that).
14. Approach #3: How LoRA fine-tuning works
14.1 Concept
Freeze base model weights.
Train a small LoRA adapter (delta weights).
During inference: base + delta.
Can enable/disable adapters or swap adapters per domain.
14.2 Practical workflow
Step 1: Choose a base model (7B/8B common).
Step 2: Prepare instruction→response dataset (high-quality, domain-shaped).
Step 3: Train LoRA/QLoRA using a framework (e.g., HuggingFace PEFT / Axolotl), typically on cloud GPUs.
Step 4: Export adapter files (e.g., safetensors).
Step 5: Load base model + adapter locally for inference.
14.3 What to train for healthcare
Prefer training:
structured response format
asking clarifying questions
refusal behavior and escalation cues
citing sources (especially combined with RAG)
Avoid training:
15. What comes after single-agent systems?
Common directions:
Multi-agent setups: planner, executor, critic, safety checker
Self-critique / verification loops
World models / simulation before action
Stronger human-in-the-loop controls
16. Vocabulary notes: terms + IPA
token /ˈtoʊkən/: unit of text used by the model (often a word piece)
vocabulary /vəˈkæbjəˌleri/: the model’s token dictionary
context /ˈkɑːntekst/: the visible token window during inference
parameter /pəˈræmɪtər/: a learned weight value in the network
quantization /ˌkwɑːntəˈzeɪʃən/: reducing numeric precision to shrink models
embedding /ɪmˈbedɪŋ/: vector representation of tokens/items
attention /əˈtenʃən/: mechanism that mixes information across tokens
softmax /ˈsɔːftˌmæks/: converts scores into probabilities
hallucination /həˌluːsɪˈneɪʃən/: plausible-sounding but incorrect output
agent /ˈeɪdʒənt/: autonomous software using LLM + tools
tool calling /tuːl ˈkɔːlɪŋ/: structured calls to tools/functions
RAG /ræɡ/: Retrieval-Augmented Generation (retrieve docs then generate)
fine-tune /ˌfaɪnˈtuːn/: adapting a model with additional training
LoRA /ˈloʊrə/: Low-Rank Adaptation (lightweight fine-tune adapter)
QLoRA /kjuːˈloʊrə/: LoRA + quantization for memory-efficient training
17. Quick implementation checklist
Need accurate, updatable domain knowledge: do RAG first.
Need consistent behavior/formatting: add LoRA.
Need task execution: build an Agent with: