User Tools

Site Tools


ai:llm

This is an old revision of the document!


System Notes: LLM, Tokens, Agents, Local Models, RAG & LoRA Fine-Tuning

1. Overview

An LLM (Large Language Model) is a model that generates text by predicting the next token based on the current context.

In real products, you typically combine:

  • LLM: generates the next token (text output)
  • RAG: retrieves external documents and injects them into context
  • Agent: software that uses LLM + tools + feedback loop to complete tasks

LLM in one line

A practical mental model:

  • 1) Learned experience = Parameters / Weights
    • This is the “knowledge compressed into numbers” learned during training (e.g., 7B/8B/70B parameters).
    • Not the tokenizer.
  • 2) What it can produce = Vocabulary
    • The fixed set of tokens the model can output.
  • 3) Temporary memory = Context window
    • The maximum number of tokens the model can see at once (prompt + history + retrieved docs + output).

Note: The Tokenizer is a separate component that converts text into token IDs (it is not the model’s learned experience).

Glossary quick notes

  • parameter /pəˈræmɪtər/: tham số (trọng số học được)
  • weight /weɪt/: trọng số (một kiểu parameter)
  • tokenizer /ˈtoʊkəˌnaɪzər/: bộ tách & mã hóa text thành token
  • vocabulary /vəˈkæbjəˌleri/: từ điển token
  • context window /ˈkɑːntekst ˈwɪndoʊ/: cửa sổ ngữ cảnh (trí nhớ ngắn hạn)

2. Does an LLM “understand” or only generate probabilistic text?

Technically, an LLM is a probabilistic sequence model: it estimates a distribution over the next token given previous tokens.

Key implications:

  • It does not have consciousness or human-like understanding.
  • It can *appear* to understand because it learned patterns from massive data.
  • It does not inherently know whether its output is true or false.
  • Bad / biased / outdated training data can lead to bad answers.
  • Ambiguous prompts can trigger “plausible but wrong” outputs (hallucination).

3. What are tokens, and where do generated tokens come from?

3.1 Tokens are not necessarily “words”

Tokens can be:

  • word pieces, characters, punctuation, whitespace
  • e.g., “Ha” + “ Noi” can be separate tokens

3.2 Where does the next token come from?

Generated tokens come from the model’s vocabulary (token dictionary):

  • Vocabulary is fixed at model creation time.
  • At each step, the model outputs probabilities across the entire vocabulary and selects a token (greedy or sampling).

So:

  • The model is not “fetching text from the internet” while generating.
  • If you want fresh facts, you must use tools (RAG / browsing / APIs).

4. Three frequently confused concepts

4.1 Vocabulary size

Number of distinct tokens the model can output.

  • Fixed (e.g., ~100K tokens depending on tokenizer/model).

4.2 Training tokens

Number of tokens in the dataset used for training.

  • Often measured in trillions (T).
  • Represents “experience volume,” not a stored Q&A database.

4.3 Context length

Maximum number of tokens the model can “see” at once.

  • Includes prompt + chat history + retrieved docs + the output being generated.
  • Exceed it and the model loses earlier parts.

5. What do 7B / 8B / 13B / 70B mean?

These refer to parameter count:

  • B = Billion parameters (weights).
  • 7B ≈ 7 billion parameters, 70B ≈ 70 billion parameters.

Parameters are the model’s learned numeric weights—its compressed “experience.”

6. What is inside an LLM model?

A typical LLM package includes:

  • Tokenizer: rules to split text into tokens
  • Vocabulary: mapping token ↔ ID
  • Embedding table: maps token IDs to vectors
  • Transformer layers: attention + feed-forward blocks
  • Parameters (weights): billions of learned numbers
  • Output head: converts final representation to next-token probabilities (softmax)

What is NOT inside:

  • No database of memorized articles or Q&A pairs.
  • No built-in truth checker.

7. Comparing image vector search vs LLM generation

7.1 Image similarity search (your example)

  • Image → encoder → vector
  • Store vectors in a DB (FAISS / Milvus / pgvector)
  • Query vector → nearest neighbors via cosine similarity
  • Returns an existing item from the DB

7.2 LLM generation

  • Text → tokens → embeddings
  • Attention compares vectors internally to build a context representation
  • Output is a probability distribution over the vocabulary
  • Select next token, repeat

Core difference:

  • Image search returns a stored item.
  • LLM produces new text token-by-token.

7.3 When LLM becomes “search-like” (RAG)

RAG adds:

  • Text → embedding → vector DB retrieval → inject documents → LLM answers based on retrieved sources.

So:

  • Vector DB = searching
  • LLM = composing/explaining

8. What is an AI Agent?

An Agent is a software system that uses an LLM to decide actions, execute tools, observe results, and iterate.

Agent formula:

  • Agent = program logic + LLM + tools + memory + feedback loop

Compared to a normal chatbot:

  • Chatbot: one-shot answer
  • Agent: plan → act → check → fix → repeat

9. Building a minimal agent

Required components:

  • Goal: clear objective and success criteria
  • Tools: shell/API/DB/filesystem/etc.
  • Memory: logs / state (short-term and optionally long-term)
  • Loop: repeat until success or stop condition
  • Guardrails: tool allowlist, sandboxing, max steps, timeouts, logging

9.1 Why enforce structured outputs?

Because free-form text is hard to validate.

Common patterns:

  • JSON-only output
  • JSON + schema validation
  • Function calling / tool calling

This allows:

  • parse → validate → execute → verify → retry if needed

Cloud-only (generally not downloadable):

  • GPT (OpenAI)
  • Claude (Anthropic)
  • Gemini (Google)

Open-weight/open-source (often runnable locally):

  • LLaMA family (Meta)
  • Mistral / Mixtral
  • Qwen
  • DeepSeek (especially strong for code variants)
  • Phi (small, efficient)

Local runtimes on macOS:

  • Ollama
  • LM Studio
  • llama.cpp

11. Local model size estimates on Mac

Disk/RAM depends heavily on quantization.

Typical sizes (rough guidance):

  • FP16 (no quantization): ~14–16 GB for 7B/8B (not recommended on thin laptops)
  • Q8: ~8–9 GB
  • Q4/Q5 (most common for local use):
    • 7B Q4: ~4–5 GB
    • 8B Q4: ~4.5–5.5 GB
    • Q5 adds ~0.5–1.5 GB

Mac uses unified memory, so CPU+GPU share RAM.

12. Can you “build an LLM” on MacBook Air M3?

Practical reality:

  • Training a foundation model from scratch: not realistic on a laptop.
  • Running (inference) 7B/8B quantized locally: realistic.
  • Fine-tuning with LoRA: usually done on cloud GPU; laptop can prep data and run inference tests.

13. Adapting a model to a domain like healthcare

There are three main approaches:

13.1 Prompt/System instructions

  • Changes only the context (runtime behavior).
  • Does not change weights.
  • Adds domain documents (guidelines, protocols, FAQ) into context.
  • No weight changes.
  • Easy to update and safer for fresh knowledge.

13.3 LoRA/QLoRA fine-tuning (Approach #3)

  • Changes parameters via a small adapter.
  • Best for behavior: formatting, refusal policy, questioning flow, tone.
  • Not ideal for injecting large factual knowledge (use RAG for that).

14. Approach #3: How LoRA fine-tuning works

14.1 Concept

  • Freeze base model weights.
  • Train a small LoRA adapter (delta weights).
  • During inference: base + delta.
  • Can enable/disable adapters or swap adapters per domain.

14.2 Practical workflow

  1. Step 1: Choose a base model (7B/8B common).
  2. Step 2: Prepare instruction→response dataset (high-quality, domain-shaped).
  3. Step 3: Train LoRA/QLoRA using a framework (e.g., HuggingFace PEFT / Axolotl), typically on cloud GPUs.
  4. Step 4: Export adapter files (e.g., safetensors).
  5. Step 5: Load base model + adapter locally for inference.

14.3 What to train for healthcare

Prefer training:

  • structured response format
  • asking clarifying questions
  • refusal behavior and escalation cues
  • citing sources (especially combined with RAG)

Avoid training:

  • private patient records
  • uncontrolled “medical knowledge dumps”

15. What comes after single-agent systems?

Common directions:

  • Multi-agent setups: planner, executor, critic, safety checker
  • Self-critique / verification loops
  • World models / simulation before action
  • Stronger human-in-the-loop controls

16. Vocabulary notes: terms + IPA

  • token /ˈtoʊkən/: unit of text used by the model (often a word piece)
  • vocabulary /vəˈkæbjəˌleri/: the model’s token dictionary
  • context /ˈkɑːntekst/: the visible token window during inference
  • parameter /pəˈræmɪtər/: a learned weight value in the network
  • quantization /ˌkwɑːntəˈzeɪʃən/: reducing numeric precision to shrink models
  • embedding /ɪmˈbedɪŋ/: vector representation of tokens/items
  • attention /əˈtenʃən/: mechanism that mixes information across tokens
  • softmax /ˈsɔːftˌmæks/: converts scores into probabilities
  • hallucination /həˌluːsɪˈneɪʃən/: plausible-sounding but incorrect output
  • agent /ˈeɪdʒənt/: autonomous software using LLM + tools
  • tool calling /tuːl ˈkɔːlɪŋ/: structured calls to tools/functions
  • RAG /ræɡ/: Retrieval-Augmented Generation (retrieve docs then generate)
  • fine-tune /ˌfaɪnˈtuːn/: adapting a model with additional training
  • LoRA /ˈloʊrə/: Low-Rank Adaptation (lightweight fine-tune adapter)
  • QLoRA /kjuːˈloʊrə/: LoRA + quantization for memory-efficient training

17. Quick implementation checklist

  1. Need accurate, updatable domain knowledge: do RAG first.
  2. Need consistent behavior/formatting: add LoRA.
  3. Need task execution: build an Agent with:
    • strict JSON/tool outputs
    • schema validation
    • allowlisted tools + sandbox
    • logs + max steps + stop rules
ai/llm.1769394557.txt.gz · Last modified: by phong2018