User Tools

Site Tools


ai:llm

System Notes: LLM, Tokens, Agents, Local Models, RAG & LoRA Fine-Tuning

1. Overview

An LLM (Large Language Model) is a model that generates text by predicting the next token based on the current context.

In real products, you typically combine:

  • LLM: generates the next token (text output)
  • RAG: retrieves external documents and injects them into context
  • Agent: software that uses LLM + tools + feedback loop to complete tasks

LLM in one line

A practical mental model:

  • 1) Learned experience = Parameters / Weights
    • This is the “knowledge compressed into numbers” learned during training (e.g., 7B/8B/70B parameters).
  • 2) What it can produce = Vocabulary
    • The fixed set of tokens the model can output.
  • 3) Temporary memory = Context window
    • The maximum number of tokens the model can see at once (prompt + history + retrieved docs + output).

Note: The Tokenizer is a separate component that converts text into token IDs (it is not the model’s learned experience).

Glossary quick notes

  • parameter /pəˈræmɪtər/: tham số (trọng số học được)
  • weight /weɪt/: trọng số (một kiểu parameter)
  • tokenizer /ˈtoʊkəˌnaɪzər/: bộ tách & mã hóa text thành token
  • vocabulary /vəˈkæbjəˌleri/: từ điển token
  • context window /ˈkɑːntekst ˈwɪndoʊ/: cửa sổ ngữ cảnh (trí nhớ ngắn hạn)

2. Does an LLM “understand” or only generate probabilistic text?

Technically, an LLM is a probabilistic sequence model: it estimates a distribution over the next token given previous tokens.

Key implications:

  • It does not have consciousness or human-like understanding.
  • It can *appear* to understand because it learned patterns from massive data.
  • It does not inherently know whether its output is true or false.
  • Bad / biased / outdated training data can lead to bad answers.
  • Ambiguous prompts can trigger “plausible but wrong” outputs (hallucination).

3. What are tokens, and where do generated tokens come from?

3.1 Tokens are not necessarily “words”

Tokens can be:

  • word pieces, characters, punctuation, whitespace
  • e.g., “Ha” + “ Noi” can be separate tokens

3.2 Where does the next token come from?

Generated tokens come from the model’s vocabulary (token dictionary):

  • Vocabulary is fixed at model creation time.
  • At each step, the model outputs probabilities across the entire vocabulary and selects a token (greedy or sampling).

So:

  • The model is not “fetching text from the internet” while generating.
  • If you want fresh facts, you must use tools (RAG / browsing / APIs).

4. Three frequently confused concepts

4.1 Vocabulary size

Number of distinct tokens the model can output.

  • Fixed (e.g., ~100K tokens depending on tokenizer/model).

4.2 Training tokens

Number of tokens in the dataset used for training.

  • Often measured in trillions (T).
  • Represents “experience volume,” not a stored Q&A database.

4.3 Context length

Maximum number of tokens the model can “see” at once.

  • Includes prompt + chat history + retrieved docs + the output being generated.
  • Exceed it and the model loses earlier parts.

5. What do 7B / 8B / 13B / 70B mean?

These refer to parameter count:

  • B = Billion parameters (weights).
  • 7B ≈ 7 billion parameters, 70B ≈ 70 billion parameters.

Parameters are the model’s learned numeric weights—its compressed “experience.”

6. What is inside an LLM model?

A typical LLM package includes:

  • Tokenizer: rules to split text into tokens
  • Vocabulary: mapping token ↔ ID
  • Embedding table: maps token IDs to vectors
  • Transformer layers: attention + feed-forward blocks
  • Parameters (weights): billions of learned numbers
  • Output head: converts final representation to next-token probabilities (softmax)

What is NOT inside:

  • No database of memorized articles or Q&A pairs.
  • No built-in truth checker.

7. Comparing image vector search vs LLM generation

7.1 Image similarity search (your example)

  • Image → encoder → vector
  • Store vectors in a DB (FAISS / Milvus / pgvector)
  • Query vector → nearest neighbors via cosine similarity
  • Returns an existing item from the DB

7.2 LLM generation

  • Text → tokens → embeddings
  • Attention compares vectors internally to build a context representation
  • Output is a probability distribution over the vocabulary
  • Select next token, repeat

Core difference:

  • Image search returns a stored item.
  • LLM produces new text token-by-token.

7.3 When LLM becomes “search-like” (RAG)

RAG adds:

  • Text → embedding → vector DB retrieval → inject documents → LLM answers based on retrieved sources.

So:

  • Vector DB = searching
  • LLM = composing/explaining

8. What is an AI Agent?

An Agent is a software system that uses an LLM to decide actions, execute tools, observe results, and iterate.

Agent formula:

  • Agent = program logic + LLM + tools + memory + feedback loop

Compared to a normal chatbot:

  • Chatbot: one-shot answer
  • Agent: plan → act → check → fix → repeat

9. Building a minimal agent

Required components:

  • Goal: clear objective and success criteria
  • Tools: shell/API/DB/filesystem/etc.
  • Memory: logs / state (short-term and optionally long-term)
  • Loop: repeat until success or stop condition
  • Guardrails: tool allowlist, sandboxing, max steps, timeouts, logging

9.1 Why enforce structured outputs?

Because free-form text is hard to validate.

Common patterns:

  • JSON-only output
  • JSON + schema validation
  • Function calling / tool calling

This allows:

  • parse → validate → execute → verify → retry if needed

Quick mental model fields (apply to every LLM)

For each model below, capture:

  • 1) Learned experience = Parameters / Weights (e.g., 7B/8B/70B)
  • 2) What it can produce = Vocabulary (tokenizer + vocab size; fixed per model)
  • 3) Temporary memory = Context window (max tokens visible at once)

Cloud-only (generally not downloadable)

GPT (OpenAI)

  • Parameters/Weights: Not publicly disclosed (varies by product/version).
  • Vocabulary: Proprietary tokenizer; vocab size not consistently published.
  • Context window: Varies by model tier/version (check product docs).
  • Notes: Strong general reasoning + tool ecosystem.

Claude (Anthropic)

  • Parameters/Weights: Not publicly disclosed.
  • Vocabulary: Proprietary tokenizer.
  • Context window: Varies by model tier/version (check product docs).
  • Notes: Strong long-form writing and code assistance.

Gemini (Google)

  • Parameters/Weights: Not publicly disclosed.
  • Vocabulary: Proprietary tokenizer.
  • Context window: Varies by model tier/version (check product docs).
  • Notes: Strong multimodal and large-context options (depending on version).

Open-weight/open-source (often runnable locally)

LLaMA family (Meta)

  • Parameters/Weights: Common sizes include 7B/8B/13B/70B (depends on generation).
  • Vocabulary: Fixed per LLaMA generation (tokenizer + vocab size depends on version).
  • Context window: Varies by generation (older versions smaller; newer may be larger).
  • Local use: Best with quantized GGUF via llama.cpp / Ollama / LM Studio.

Mistral / Mixtral

  • Parameters/Weights: Mistral commonly 7B-class; Mixtral uses MoE (Mixture-of-Experts) variants.
  • Vocabulary: Fixed per model release (tokenizer-specific).
  • Context window: Varies by release/version.
  • Local use: Mistral 7B-class is popular for fast local inference.

Qwen

  • Parameters/Weights: Multiple sizes (small → large; common local picks: ~7B-class).
  • Vocabulary: Fixed per Qwen generation (tokenizer-specific).
  • Context window: Varies by release/version (some versions support larger contexts).
  • Local use: Often strong multilingual performance.

DeepSeek (especially strong for code variants)

  • Parameters/Weights: Multiple sizes; common local coder models are ~6–7B-class.
  • Vocabulary: Fixed per model/tokenizer version.
  • Context window: Varies by release/version.
  • Local use: Code-focused variants are widely used for dev tasks.

Phi (small, efficient)

  • Parameters/Weights: Small models (often ~2–4B-class depending on version).
  • Vocabulary: Fixed per Phi release/tokenizer.
  • Context window: Varies by version.
  • Local use: Great for low-resource devices; fast inference.

Local runtimes on macOS

Ollama

  • Purpose: Simplest local runner (download + run models easily).
  • Works best with: Quantized GGUF models.

LM Studio

  • Purpose: GUI app to download, run, and chat with local models.
  • Works best with: Quantized GGUF models, easy model management.

llama.cpp

  • Purpose: High-performance local inference engine for GGUF models.
  • Works best with: Fine-grained control and optimization on CPU/Metal.

Glossary (hard terms)

  • parameter /pəˈræmɪtər/: tham số (trọng số học được)
  • weight /weɪt/: trọng số
  • vocabulary /vəˈkæbjəˌleri/: từ điển token (tập token có thể sinh)
  • token /ˈtoʊkən/: đơn vị nhỏ (mảnh chữ) của văn bản
  • context window /ˈkɑːntekst ˈwɪndoʊ/: cửa sổ ngữ cảnh (trí nhớ tạm thời)
  • proprietary /prəˈpraɪəˌteri/: độc quyền, không công khai
  • open-weight /ˌoʊpən ˈweɪt/: mở trọng số (công bố weights)
  • quantized /ˈkwɑːntaɪzd/: đã lượng tử hóa (nén độ chính xác số)
  • runtime /ˈrʌnˌtaɪm/: môi trường/chương trình chạy
  • variant /ˈveriənt/: biến thể/phiên bản
  • Mixture-of-Experts (MoE) /ˈmɪkstʃər əv ˈekspɜːrts/: kiến trúc “nhiều chuyên gia” (chỉ kích hoạt một phần model mỗi bước)

11. Local model size estimates on Mac

Disk/RAM depends heavily on quantization.

Typical sizes (rough guidance):

  • FP16 (no quantization): ~14–16 GB for 7B/8B (not recommended on thin laptops)
  • Q8: ~8–9 GB
  • Q4/Q5 (most common for local use):
    • 7B Q4: ~4–5 GB
    • 8B Q4: ~4.5–5.5 GB
    • Q5 adds ~0.5–1.5 GB

Mac uses unified memory, so CPU+GPU share RAM.

12. Can you “build an LLM” on MacBook Air M3?

Practical reality:

  • Training a foundation model from scratch: not realistic on a laptop.
  • Running (inference) 7B/8B quantized locally: realistic.
  • Fine-tuning with LoRA: usually done on cloud GPU; laptop can prep data and run inference tests.

13. Adapting a model to a domain like healthcare

There are three main approaches:

13.1 Prompt/System instructions

  • Changes only the context (runtime behavior).
  • Does not change weights.
  • Adds domain documents (guidelines, protocols, FAQ) into context.
  • No weight changes.
  • Easy to update and safer for fresh knowledge.

13.3 LoRA/QLoRA fine-tuning (Approach #3)

  • Changes parameters via a small adapter.
  • Best for behavior: formatting, refusal policy, questioning flow, tone.
  • Not ideal for injecting large factual knowledge (use RAG for that).

14. Approach #3: How LoRA fine-tuning works

14.1 Concept

  • Freeze base model weights.
  • Train a small LoRA adapter (delta weights).
  • During inference: base + delta.
  • Can enable/disable adapters or swap adapters per domain.

14.2 Practical workflow

  1. Step 1: Choose a base model (7B/8B common).
  2. Step 2: Prepare instruction→response dataset (high-quality, domain-shaped).
  3. Step 3: Train LoRA/QLoRA using a framework (e.g., HuggingFace PEFT / Axolotl), typically on cloud GPUs.
  4. Step 4: Export adapter files (e.g., safetensors).
  5. Step 5: Load base model + adapter locally for inference.

14.3 What to train for healthcare

Prefer training:

  • structured response format
  • asking clarifying questions
  • refusal behavior and escalation cues
  • citing sources (especially combined with RAG)

Avoid training:

  • private patient records
  • uncontrolled “medical knowledge dumps”

15. What comes after single-agent systems?

Common directions:

  • Multi-agent setups: planner, executor, critic, safety checker
  • Self-critique / verification loops
  • World models / simulation before action
  • Stronger human-in-the-loop controls

16. Vocabulary notes: terms + IPA

  • token /ˈtoʊkən/: unit of text used by the model (often a word piece)
  • vocabulary /vəˈkæbjəˌleri/: the model’s token dictionary
  • context /ˈkɑːntekst/: the visible token window during inference
  • parameter /pəˈræmɪtər/: a learned weight value in the network
  • quantization /ˌkwɑːntəˈzeɪʃən/: reducing numeric precision to shrink models
  • embedding /ɪmˈbedɪŋ/: vector representation of tokens/items
  • attention /əˈtenʃən/: mechanism that mixes information across tokens
  • softmax /ˈsɔːftˌmæks/: converts scores into probabilities
  • hallucination /həˌluːsɪˈneɪʃən/: plausible-sounding but incorrect output
  • agent /ˈeɪdʒənt/: autonomous software using LLM + tools
  • tool calling /tuːl ˈkɔːlɪŋ/: structured calls to tools/functions
  • RAG /ræɡ/: Retrieval-Augmented Generation (retrieve docs then generate)
  • fine-tune /ˌfaɪnˈtuːn/: adapting a model with additional training
  • LoRA /ˈloʊrə/: Low-Rank Adaptation (lightweight fine-tune adapter)
  • QLoRA /kjuːˈloʊrə/: LoRA + quantization for memory-efficient training

17. Quick implementation checklist

  1. Need accurate, updatable domain knowledge: do RAG first.
  2. Need consistent behavior/formatting: add LoRA.
  3. Need task execution: build an Agent with:
    • strict JSON/tool outputs
    • schema validation
    • allowlisted tools + sandbox
    • logs + max steps + stop rules
ai/llm.txt · Last modified: by phong2018