====== System Notes: LLM, Tokens, Agents, Local Models, RAG & LoRA Fine-Tuning ====== ===== 1. Overview ===== An **LLM** (Large Language Model) is a model that generates text by predicting the **next token** based on the current **context**. In real products, you typically combine: * **LLM**: generates the next token (text output) * **RAG**: retrieves external documents and injects them into context * **Agent**: software that uses LLM + tools + feedback loop to complete tasks ===== LLM in one line ===== A practical mental model: * **1) Learned experience = Parameters / Weights** * This is the “knowledge compressed into numbers” learned during training (e.g., 7B/8B/70B parameters). * **2) What it can produce = Vocabulary** * The fixed set of tokens the model can output. * **3) Temporary memory = Context window** * The maximum number of tokens the model can see at once (prompt + history + retrieved docs + output). **Note:** The **Tokenizer** is a separate component that converts text into token IDs (it is not the model’s learned experience). ===== Glossary quick notes ===== * **parameter** /pəˈræmɪtər/: tham số (trọng số học được) * **weight** /weɪt/: trọng số (một kiểu parameter) * **tokenizer** /ˈtoʊkəˌnaɪzər/: bộ tách & mã hóa text thành token * **vocabulary** /vəˈkæbjəˌleri/: từ điển token * **context window** /ˈkɑːntekst ˈwɪndoʊ/: cửa sổ ngữ cảnh (trí nhớ ngắn hạn) ===== 2. Does an LLM “understand” or only generate probabilistic text? ===== Technically, an LLM is a **probabilistic sequence model**: it estimates a distribution over the next token given previous tokens. Key implications: * It does not have consciousness or human-like understanding. * It can *appear* to understand because it learned patterns from massive data. * It does not inherently know whether its output is true or false. * Bad / biased / outdated training data can lead to bad answers. * Ambiguous prompts can trigger “plausible but wrong” outputs (hallucination). ===== 3. What are tokens, and where do generated tokens come from? ===== ==== 3.1 Tokens are not necessarily “words” ==== Tokens can be: * word pieces, characters, punctuation, whitespace * e.g., “Ha” + “ Noi” can be separate tokens ==== 3.2 Where does the next token come from? ==== Generated tokens come from the model’s **vocabulary** (token dictionary): * Vocabulary is fixed at model creation time. * At each step, the model outputs probabilities across the entire vocabulary and selects a token (greedy or sampling). So: * The model is not “fetching text from the internet” while generating. * If you want fresh facts, you must use tools (RAG / browsing / APIs). ===== 4. Three frequently confused concepts ===== ==== 4.1 Vocabulary size ==== Number of distinct tokens the model can output. * Fixed (e.g., ~100K tokens depending on tokenizer/model). ==== 4.2 Training tokens ==== Number of tokens in the dataset used for training. * Often measured in trillions (T). * Represents “experience volume,” not a stored Q&A database. ==== 4.3 Context length ==== Maximum number of tokens the model can “see” at once. * Includes prompt + chat history + retrieved docs + the output being generated. * Exceed it and the model loses earlier parts. ===== 5. What do 7B / 8B / 13B / 70B mean? ===== These refer to parameter count: * **B = Billion** parameters (weights). * 7B ≈ 7 billion parameters, 70B ≈ 70 billion parameters. Parameters are the model’s learned numeric weights—its compressed “experience.” ===== 6. What is inside an LLM model? ===== A typical LLM package includes: * **Tokenizer**: rules to split text into tokens * **Vocabulary**: mapping token <-> ID * **Embedding table**: maps token IDs to vectors * **Transformer layers**: attention + feed-forward blocks * **Parameters (weights)**: billions of learned numbers * **Output head**: converts final representation to next-token probabilities (softmax) What is NOT inside: * No database of memorized articles or Q&A pairs. * No built-in truth checker. ===== 7. Comparing image vector search vs LLM generation ===== ==== 7.1 Image similarity search (your example) ==== * Image -> encoder -> vector * Store vectors in a DB (FAISS / Milvus / pgvector) * Query vector -> nearest neighbors via cosine similarity * Returns an existing item from the DB ==== 7.2 LLM generation ==== * Text -> tokens -> embeddings * Attention compares vectors internally to build a context representation * Output is a probability distribution over the vocabulary * Select next token, repeat Core difference: * Image search returns a stored item. * LLM produces new text token-by-token. ==== 7.3 When LLM becomes “search-like” (RAG) ==== RAG adds: * Text -> embedding -> vector DB retrieval -> inject documents -> LLM answers based on retrieved sources. So: * Vector DB = searching * LLM = composing/explaining ===== 8. What is an AI Agent? ===== An **Agent** is a software system that uses an LLM to decide actions, execute tools, observe results, and iterate. Agent formula: * **Agent = program logic + LLM + tools + memory + feedback loop** Compared to a normal chatbot: * Chatbot: one-shot answer * Agent: plan -> act -> check -> fix -> repeat ===== 9. Building a minimal agent ===== Required components: * **Goal**: clear objective and success criteria * **Tools**: shell/API/DB/filesystem/etc. * **Memory**: logs / state (short-term and optionally long-term) * **Loop**: repeat until success or stop condition * **Guardrails**: tool allowlist, sandboxing, max steps, timeouts, logging ==== 9.1 Why enforce structured outputs? ==== Because free-form text is hard to validate. Common patterns: * JSON-only output * JSON + schema validation * Function calling / tool calling This allows: * parse -> validate -> execute -> verify -> retry if needed ===== 10. Popular models and which ones can run locally ===== ==== Quick mental model fields (apply to every LLM) ==== For each model below, capture: * **1) Learned experience = Parameters / Weights** (e.g., 7B/8B/70B) * **2) What it can produce = Vocabulary** (tokenizer + vocab size; fixed per model) * **3) Temporary memory = Context window** (max tokens visible at once) ==== Cloud-only (generally not downloadable) ==== === GPT (OpenAI) === * **Parameters/Weights:** Not publicly disclosed (varies by product/version). * **Vocabulary:** Proprietary tokenizer; vocab size not consistently published. * **Context window:** Varies by model tier/version (check product docs). * Notes: Strong general reasoning + tool ecosystem. === Claude (Anthropic) === * **Parameters/Weights:** Not publicly disclosed. * **Vocabulary:** Proprietary tokenizer. * **Context window:** Varies by model tier/version (check product docs). * Notes: Strong long-form writing and code assistance. === Gemini (Google) === * **Parameters/Weights:** Not publicly disclosed. * **Vocabulary:** Proprietary tokenizer. * **Context window:** Varies by model tier/version (check product docs). * Notes: Strong multimodal and large-context options (depending on version). ==== Open-weight/open-source (often runnable locally) ==== === LLaMA family (Meta) === * **Parameters/Weights:** Common sizes include 7B/8B/13B/70B (depends on generation). * **Vocabulary:** Fixed per LLaMA generation (tokenizer + vocab size depends on version). * **Context window:** Varies by generation (older versions smaller; newer may be larger). * Local use: Best with quantized GGUF via llama.cpp / Ollama / LM Studio. === Mistral / Mixtral === * **Parameters/Weights:** Mistral commonly 7B-class; Mixtral uses MoE (Mixture-of-Experts) variants. * **Vocabulary:** Fixed per model release (tokenizer-specific). * **Context window:** Varies by release/version. * Local use: Mistral 7B-class is popular for fast local inference. === Qwen === * **Parameters/Weights:** Multiple sizes (small -> large; common local picks: ~7B-class). * **Vocabulary:** Fixed per Qwen generation (tokenizer-specific). * **Context window:** Varies by release/version (some versions support larger contexts). * Local use: Often strong multilingual performance. === DeepSeek (especially strong for code variants) === * **Parameters/Weights:** Multiple sizes; common local coder models are ~6–7B-class. * **Vocabulary:** Fixed per model/tokenizer version. * **Context window:** Varies by release/version. * Local use: Code-focused variants are widely used for dev tasks. === Phi (small, efficient) === * **Parameters/Weights:** Small models (often ~2–4B-class depending on version). * **Vocabulary:** Fixed per Phi release/tokenizer. * **Context window:** Varies by version. * Local use: Great for low-resource devices; fast inference. ==== Local runtimes on macOS ==== === Ollama === * Purpose: Simplest local runner (download + run models easily). * Works best with: Quantized GGUF models. === LM Studio === * Purpose: GUI app to download, run, and chat with local models. * Works best with: Quantized GGUF models, easy model management. === llama.cpp === * Purpose: High-performance local inference engine for GGUF models. * Works best with: Fine-grained control and optimization on CPU/Metal. ==== Glossary (hard terms) ==== * **parameter** /pəˈræmɪtər/: tham số (trọng số học được) * **weight** /weɪt/: trọng số * **vocabulary** /vəˈkæbjəˌleri/: từ điển token (tập token có thể sinh) * **token** /ˈtoʊkən/: đơn vị nhỏ (mảnh chữ) của văn bản * **context window** /ˈkɑːntekst ˈwɪndoʊ/: cửa sổ ngữ cảnh (trí nhớ tạm thời) * **proprietary** /prəˈpraɪəˌteri/: độc quyền, không công khai * **open-weight** /ˌoʊpən ˈweɪt/: mở trọng số (công bố weights) * **quantized** /ˈkwɑːntaɪzd/: đã lượng tử hóa (nén độ chính xác số) * **runtime** /ˈrʌnˌtaɪm/: môi trường/chương trình chạy * **variant** /ˈveriənt/: biến thể/phiên bản * **Mixture-of-Experts (MoE)** /ˈmɪkstʃər əv ˈekspɜːrts/: kiến trúc “nhiều chuyên gia” (chỉ kích hoạt một phần model mỗi bước) ===== 11. Local model size estimates on Mac ===== Disk/RAM depends heavily on **quantization**. Typical sizes (rough guidance): * FP16 (no quantization): ~14–16 GB for 7B/8B (not recommended on thin laptops) * Q8: ~8–9 GB * Q4/Q5 (most common for local use): * 7B Q4: ~4–5 GB * 8B Q4: ~4.5–5.5 GB * Q5 adds ~0.5–1.5 GB Mac uses **unified memory**, so CPU+GPU share RAM. ===== 12. Can you “build an LLM” on MacBook Air M3? ===== Practical reality: * Training a foundation model from scratch: not realistic on a laptop. * Running (inference) 7B/8B quantized locally: realistic. * Fine-tuning with LoRA: usually done on cloud GPU; laptop can prep data and run inference tests. ===== 13. Adapting a model to a domain like healthcare ===== There are three main approaches: ==== 13.1 Prompt/System instructions ==== * Changes only the **context** (runtime behavior). * Does not change weights. ==== 13.2 RAG (recommended first) ==== * Adds domain documents (guidelines, protocols, FAQ) into context. * No weight changes. * Easy to update and safer for fresh knowledge. ==== 13.3 LoRA/QLoRA fine-tuning (Approach #3) ==== * Changes **parameters** via a small adapter. * Best for **behavior**: formatting, refusal policy, questioning flow, tone. * Not ideal for injecting large factual knowledge (use RAG for that). ===== 14. Approach #3: How LoRA fine-tuning works ===== ==== 14.1 Concept ==== * Freeze base model weights. * Train a small **LoRA adapter** (delta weights). * During inference: base + delta. * Can enable/disable adapters or swap adapters per domain. ==== 14.2 Practical workflow ==== - Step 1: Choose a base model (7B/8B common). - Step 2: Prepare instruction->response dataset (high-quality, domain-shaped). - Step 3: Train LoRA/QLoRA using a framework (e.g., HuggingFace PEFT / Axolotl), typically on cloud GPUs. - Step 4: Export adapter files (e.g., safetensors). - Step 5: Load base model + adapter locally for inference. ==== 14.3 What to train for healthcare ==== Prefer training: * structured response format * asking clarifying questions * refusal behavior and escalation cues * citing sources (especially combined with RAG) Avoid training: * private patient records * uncontrolled “medical knowledge dumps” ===== 15. What comes after single-agent systems? ===== Common directions: * **Multi-agent** setups: planner, executor, critic, safety checker * **Self-critique** / verification loops * **World models** / simulation before action * Stronger **human-in-the-loop** controls ===== 16. Vocabulary notes: terms + IPA ===== * **token** /ˈtoʊkən/: unit of text used by the model (often a word piece) * **vocabulary** /vəˈkæbjəˌleri/: the model’s token dictionary * **context** /ˈkɑːntekst/: the visible token window during inference * **parameter** /pəˈræmɪtər/: a learned weight value in the network * **quantization** /ˌkwɑːntəˈzeɪʃən/: reducing numeric precision to shrink models * **embedding** /ɪmˈbedɪŋ/: vector representation of tokens/items * **attention** /əˈtenʃən/: mechanism that mixes information across tokens * **softmax** /ˈsɔːftˌmæks/: converts scores into probabilities * **hallucination** /həˌluːsɪˈneɪʃən/: plausible-sounding but incorrect output * **agent** /ˈeɪdʒənt/: autonomous software using LLM + tools * **tool calling** /tuːl ˈkɔːlɪŋ/: structured calls to tools/functions * **RAG** /ræɡ/: Retrieval-Augmented Generation (retrieve docs then generate) * **fine-tune** /ˌfaɪnˈtuːn/: adapting a model with additional training * **LoRA** /ˈloʊrə/: Low-Rank Adaptation (lightweight fine-tune adapter) * **QLoRA** /kjuːˈloʊrə/: LoRA + quantization for memory-efficient training ===== 17. Quick implementation checklist ===== - Need accurate, updatable domain knowledge: do **RAG** first. - Need consistent behavior/formatting: add **LoRA**. - Need task execution: build an **Agent** with: * strict JSON/tool outputs * schema validation * allowlisted tools + sandbox * logs + max steps + stop rules