System Notes: LLM, Tokens, Agents, Local Models, RAG & LoRA Fine-Tuning

System Notes: LLM, Tokens, Agents, Local Models, RAG & LoRA Fine-Tuning

1. Overview

An LLM (Large Language Model) is a model that generates text by predicting the next token based on the current context.

In real products, you typically combine:

LLM: generates the next token (text output)
RAG: retrieves external documents and injects them into context
Agent: software that uses LLM + tools + feedback loop to complete tasks

LLM in one line

A practical mental model:

1) Learned experience = Parameters / Weights
- This is the “knowledge compressed into numbers” learned during training (e.g., 7B/8B/70B parameters).

2) What it can produce = Vocabulary
- The fixed set of tokens the model can output.

3) Temporary memory = Context window
- The maximum number of tokens the model can see at once (prompt + history + retrieved docs + output).

Note: The Tokenizer is a separate component that converts text into token IDs (it is not the model’s learned experience).

Glossary quick notes

parameter /pəˈræmɪtər/: tham số (trọng số học được)
weight /weɪt/: trọng số (một kiểu parameter)
tokenizer /ˈtoʊkəˌnaɪzər/: bộ tách & mã hóa text thành token
vocabulary /vəˈkæbjəˌleri/: từ điển token
context window /ˈkɑːntekst ˈwɪndoʊ/: cửa sổ ngữ cảnh (trí nhớ ngắn hạn)

2. Does an LLM “understand” or only generate probabilistic text?

Technically, an LLM is a probabilistic sequence model: it estimates a distribution over the next token given previous tokens.

Key implications:

It does not have consciousness or human-like understanding.
It can *appear* to understand because it learned patterns from massive data.
It does not inherently know whether its output is true or false.
Bad / biased / outdated training data can lead to bad answers.
Ambiguous prompts can trigger “plausible but wrong” outputs (hallucination).

3. What are tokens, and where do generated tokens come from?

3.1 Tokens are not necessarily “words”

Tokens can be:

word pieces, characters, punctuation, whitespace
e.g., “Ha” + “ Noi” can be separate tokens

3.2 Where does the next token come from?

Generated tokens come from the model’s vocabulary (token dictionary):

Vocabulary is fixed at model creation time.
At each step, the model outputs probabilities across the entire vocabulary and selects a token (greedy or sampling).

So:

The model is not “fetching text from the internet” while generating.
If you want fresh facts, you must use tools (RAG / browsing / APIs).

4. Three frequently confused concepts

4.1 Vocabulary size

Number of distinct tokens the model can output.

Fixed (e.g., ~100K tokens depending on tokenizer/model).

4.2 Training tokens

Number of tokens in the dataset used for training.

Often measured in trillions (T).
Represents “experience volume,” not a stored Q&A database.

4.3 Context length

Maximum number of tokens the model can “see” at once.

Includes prompt + chat history + retrieved docs + the output being generated.
Exceed it and the model loses earlier parts.

5. What do 7B / 8B / 13B / 70B mean?

These refer to parameter count:

B = Billion parameters (weights).
7B ≈ 7 billion parameters, 70B ≈ 70 billion parameters.

Parameters are the model’s learned numeric weights—its compressed “experience.”

6. What is inside an LLM model?

A typical LLM package includes:

Tokenizer: rules to split text into tokens
Vocabulary: mapping token ↔ ID
Embedding table: maps token IDs to vectors
Transformer layers: attention + feed-forward blocks
Parameters (weights): billions of learned numbers
Output head: converts final representation to next-token probabilities (softmax)

What is NOT inside:

No database of memorized articles or Q&A pairs.
No built-in truth checker.

7. Comparing image vector search vs LLM generation

7.1 Image similarity search (your example)

Image → encoder → vector
Store vectors in a DB (FAISS / Milvus / pgvector)
Query vector → nearest neighbors via cosine similarity
Returns an existing item from the DB

7.2 LLM generation

Text → tokens → embeddings
Attention compares vectors internally to build a context representation
Output is a probability distribution over the vocabulary
Select next token, repeat

Core difference:

Image search returns a stored item.
LLM produces new text token-by-token.

7.3 When LLM becomes “search-like” (RAG)

RAG adds:

Text → embedding → vector DB retrieval → inject documents → LLM answers based on retrieved sources.

So:

Vector DB = searching
LLM = composing/explaining

8. What is an AI Agent?

An Agent is a software system that uses an LLM to decide actions, execute tools, observe results, and iterate.

Agent formula:

Agent = program logic + LLM + tools + memory + feedback loop

Compared to a normal chatbot:

Chatbot: one-shot answer
Agent: plan → act → check → fix → repeat

9. Building a minimal agent

Required components:

Goal: clear objective and success criteria
Tools: shell/API/DB/filesystem/etc.
Memory: logs / state (short-term and optionally long-term)
Loop: repeat until success or stop condition
Guardrails: tool allowlist, sandboxing, max steps, timeouts, logging

9.1 Why enforce structured outputs?

Because free-form text is hard to validate.

Common patterns:

JSON-only output
JSON + schema validation
Function calling / tool calling

This allows:

parse → validate → execute → verify → retry if needed

10. Popular models and which ones can run locally

Quick mental model fields (apply to every LLM)

For each model below, capture:

1) Learned experience = Parameters / Weights (e.g., 7B/8B/70B)
2) What it can produce = Vocabulary (tokenizer + vocab size; fixed per model)
3) Temporary memory = Context window (max tokens visible at once)

Cloud-only (generally not downloadable)

GPT (OpenAI)

Parameters/Weights: Not publicly disclosed (varies by product/version).
Vocabulary: Proprietary tokenizer; vocab size not consistently published.
Context window: Varies by model tier/version (check product docs).
Notes: Strong general reasoning + tool ecosystem.

Claude (Anthropic)

Parameters/Weights: Not publicly disclosed.
Vocabulary: Proprietary tokenizer.
Context window: Varies by model tier/version (check product docs).
Notes: Strong long-form writing and code assistance.

Gemini (Google)

Parameters/Weights: Not publicly disclosed.
Vocabulary: Proprietary tokenizer.
Context window: Varies by model tier/version (check product docs).
Notes: Strong multimodal and large-context options (depending on version).

Open-weight/open-source (often runnable locally)

LLaMA family (Meta)

Parameters/Weights: Common sizes include 7B/8B/13B/70B (depends on generation).
Vocabulary: Fixed per LLaMA generation (tokenizer + vocab size depends on version).
Context window: Varies by generation (older versions smaller; newer may be larger).
Local use: Best with quantized GGUF via llama.cpp / Ollama / LM Studio.

Mistral / Mixtral

Parameters/Weights: Mistral commonly 7B-class; Mixtral uses MoE (Mixture-of-Experts) variants.
Vocabulary: Fixed per model release (tokenizer-specific).
Context window: Varies by release/version.
Local use: Mistral 7B-class is popular for fast local inference.

Qwen

Parameters/Weights: Multiple sizes (small → large; common local picks: ~7B-class).
Vocabulary: Fixed per Qwen generation (tokenizer-specific).
Context window: Varies by release/version (some versions support larger contexts).
Local use: Often strong multilingual performance.

DeepSeek (especially strong for code variants)

Parameters/Weights: Multiple sizes; common local coder models are ~6–7B-class.
Vocabulary: Fixed per model/tokenizer version.
Context window: Varies by release/version.
Local use: Code-focused variants are widely used for dev tasks.

Phi (small, efficient)

Parameters/Weights: Small models (often ~2–4B-class depending on version).
Vocabulary: Fixed per Phi release/tokenizer.
Context window: Varies by version.
Local use: Great for low-resource devices; fast inference.

Local runtimes on macOS

Ollama

Purpose: Simplest local runner (download + run models easily).
Works best with: Quantized GGUF models.

LM Studio

Purpose: GUI app to download, run, and chat with local models.
Works best with: Quantized GGUF models, easy model management.

llama.cpp

Purpose: High-performance local inference engine for GGUF models.
Works best with: Fine-grained control and optimization on CPU/Metal.

Glossary (hard terms)

parameter /pəˈræmɪtər/: tham số (trọng số học được)
weight /weɪt/: trọng số
vocabulary /vəˈkæbjəˌleri/: từ điển token (tập token có thể sinh)
token /ˈtoʊkən/: đơn vị nhỏ (mảnh chữ) của văn bản
context window /ˈkɑːntekst ˈwɪndoʊ/: cửa sổ ngữ cảnh (trí nhớ tạm thời)
proprietary /prəˈpraɪəˌteri/: độc quyền, không công khai
open-weight /ˌoʊpən ˈweɪt/: mở trọng số (công bố weights)
quantized /ˈkwɑːntaɪzd/: đã lượng tử hóa (nén độ chính xác số)
runtime /ˈrʌnˌtaɪm/: môi trường/chương trình chạy
variant /ˈveriənt/: biến thể/phiên bản
Mixture-of-Experts (MoE) /ˈmɪkstʃər əv ˈekspɜːrts/: kiến trúc “nhiều chuyên gia” (chỉ kích hoạt một phần model mỗi bước)

11. Local model size estimates on Mac

Disk/RAM depends heavily on quantization.

Typical sizes (rough guidance):

FP16 (no quantization): ~14–16 GB for 7B/8B (not recommended on thin laptops)
Q8: ~8–9 GB
Q4/Q5 (most common for local use):
- 7B Q4: ~4–5 GB
- 8B Q4: ~4.5–5.5 GB
- Q5 adds ~0.5–1.5 GB

Mac uses unified memory, so CPU+GPU share RAM.

12. Can you “build an LLM” on MacBook Air M3?

Practical reality:

Training a foundation model from scratch: not realistic on a laptop.
Running (inference) 7B/8B quantized locally: realistic.
Fine-tuning with LoRA: usually done on cloud GPU; laptop can prep data and run inference tests.

13. Adapting a model to a domain like healthcare

There are three main approaches:

13.1 Prompt/System instructions

Changes only the context (runtime behavior).
Does not change weights.

13.2 RAG (recommended first)

Adds domain documents (guidelines, protocols, FAQ) into context.
No weight changes.
Easy to update and safer for fresh knowledge.

13.3 LoRA/QLoRA fine-tuning (Approach #3)

Changes parameters via a small adapter.
Best for behavior: formatting, refusal policy, questioning flow, tone.
Not ideal for injecting large factual knowledge (use RAG for that).

14. Approach #3: How LoRA fine-tuning works

14.1 Concept

Freeze base model weights.
Train a small LoRA adapter (delta weights).
During inference: base + delta.
Can enable/disable adapters or swap adapters per domain.

14.2 Practical workflow

Step 1: Choose a base model (7B/8B common).
Step 2: Prepare instruction→response dataset (high-quality, domain-shaped).
Step 3: Train LoRA/QLoRA using a framework (e.g., HuggingFace PEFT / Axolotl), typically on cloud GPUs.
Step 4: Export adapter files (e.g., safetensors).
Step 5: Load base model + adapter locally for inference.

14.3 What to train for healthcare

Prefer training:

structured response format
asking clarifying questions
refusal behavior and escalation cues
citing sources (especially combined with RAG)

Avoid training:

private patient records
uncontrolled “medical knowledge dumps”

15. What comes after single-agent systems?

Common directions:

Multi-agent setups: planner, executor, critic, safety checker
Self-critique / verification loops
World models / simulation before action
Stronger human-in-the-loop controls

16. Vocabulary notes: terms + IPA

token /ˈtoʊkən/: unit of text used by the model (often a word piece)
vocabulary /vəˈkæbjəˌleri/: the model’s token dictionary
context /ˈkɑːntekst/: the visible token window during inference
parameter /pəˈræmɪtər/: a learned weight value in the network
quantization /ˌkwɑːntəˈzeɪʃən/: reducing numeric precision to shrink models
embedding /ɪmˈbedɪŋ/: vector representation of tokens/items
attention /əˈtenʃən/: mechanism that mixes information across tokens
softmax /ˈsɔːftˌmæks/: converts scores into probabilities
hallucination /həˌluːsɪˈneɪʃən/: plausible-sounding but incorrect output
agent /ˈeɪdʒənt/: autonomous software using LLM + tools
tool calling /tuːl ˈkɔːlɪŋ/: structured calls to tools/functions
RAG /ræɡ/: Retrieval-Augmented Generation (retrieve docs then generate)
fine-tune /ˌfaɪnˈtuːn/: adapting a model with additional training
LoRA /ˈloʊrə/: Low-Rank Adaptation (lightweight fine-tune adapter)
QLoRA /kjuːˈloʊrə/: LoRA + quantization for memory-efficient training

17. Quick implementation checklist

Need accurate, updatable domain knowledge: do RAG first.
Need consistent behavior/formatting: add LoRA.
Need task execution: build an Agent with:
- strict JSON/tool outputs
- schema validation
- allowlisted tools + sandbox
- logs + max steps + stop rules

Table of Contents