====== System Notes: LLM, Tokens, Agents, Local Models, RAG & LoRA Fine-Tuning ======

===== 1. Overview =====
An **LLM** (Large Language Model) is a model that generates text by predicting the **next token** based on the current **context**.

In real products, you typically combine:
  * **LLM**: generates the next token (text output)
  * **RAG**: retrieves external documents and injects them into context
  * **Agent**: software that uses LLM + tools + feedback loop to complete tasks

===== LLM in one line =====
A practical mental model:

  * **1) Learned experience = Parameters / Weights**
    * This is the “knowledge compressed into numbers” learned during training (e.g., 7B/8B/70B parameters).

  * **2) What it can produce = Vocabulary**
    * The fixed set of tokens the model can output.

  * **3) Temporary memory = Context window**
    * The maximum number of tokens the model can see at once (prompt + history + retrieved docs + output).

**Note:** The **Tokenizer** is a separate component that converts text into token IDs (it is not the model’s learned experience).

===== Glossary quick notes =====
  * **parameter** /pəˈræmɪtər/: tham số (trọng số học được)
  * **weight** /weɪt/: trọng số (một kiểu parameter)
  * **tokenizer** /ˈtoʊkəˌnaɪzər/: bộ tách & mã hóa text thành token
  * **vocabulary** /vəˈkæbjəˌleri/: từ điển token
  * **context window** /ˈkɑːntekst ˈwɪndoʊ/: cửa sổ ngữ cảnh (trí nhớ ngắn hạn)
===== 2. Does an LLM “understand” or only generate probabilistic text? =====
Technically, an LLM is a **probabilistic sequence model**: it estimates a distribution over the next token given previous tokens.

Key implications:
  * It does not have consciousness or human-like understanding.
  * It can *appear* to understand because it learned patterns from massive data.
  * It does not inherently know whether its output is true or false.
  * Bad / biased / outdated training data can lead to bad answers.
  * Ambiguous prompts can trigger “plausible but wrong” outputs (hallucination).

===== 3. What are tokens, and where do generated tokens come from? =====
==== 3.1 Tokens are not necessarily “words” ====
Tokens can be:
  * word pieces, characters, punctuation, whitespace
  * e.g., “Ha” + “ Noi” can be separate tokens

==== 3.2 Where does the next token come from? ====
Generated tokens come from the model’s **vocabulary** (token dictionary):
  * Vocabulary is fixed at model creation time.
  * At each step, the model outputs probabilities across the entire vocabulary and selects a token (greedy or sampling).

So:
  * The model is not “fetching text from the internet” while generating.
  * If you want fresh facts, you must use tools (RAG / browsing / APIs).

===== 4. Three frequently confused concepts =====
==== 4.1 Vocabulary size ====
Number of distinct tokens the model can output.
  * Fixed (e.g., ~100K tokens depending on tokenizer/model).

==== 4.2 Training tokens ====
Number of tokens in the dataset used for training.
  * Often measured in trillions (T).
  * Represents “experience volume,” not a stored Q&A database.

==== 4.3 Context length ====
Maximum number of tokens the model can “see” at once.
  * Includes prompt + chat history + retrieved docs + the output being generated.
  * Exceed it and the model loses earlier parts.

===== 5. What do 7B / 8B / 13B / 70B mean? =====
These refer to parameter count:
  * **B = Billion** parameters (weights).
  * 7B ≈ 7 billion parameters, 70B ≈ 70 billion parameters.

Parameters are the model’s learned numeric weights—its compressed “experience.”

===== 6. What is inside an LLM model? =====
A typical LLM package includes:

  * **Tokenizer**: rules to split text into tokens
  * **Vocabulary**: mapping token <-> ID
  * **Embedding table**: maps token IDs to vectors
  * **Transformer layers**: attention + feed-forward blocks
  * **Parameters (weights)**: billions of learned numbers
  * **Output head**: converts final representation to next-token probabilities (softmax)

What is NOT inside:
  * No database of memorized articles or Q&A pairs.
  * No built-in truth checker.

===== 7. Comparing image vector search vs LLM generation =====
==== 7.1 Image similarity search (your example) ====
  * Image -> encoder -> vector
  * Store vectors in a DB (FAISS / Milvus / pgvector)
  * Query vector -> nearest neighbors via cosine similarity
  * Returns an existing item from the DB

==== 7.2 LLM generation ====
  * Text -> tokens -> embeddings
  * Attention compares vectors internally to build a context representation
  * Output is a probability distribution over the vocabulary
  * Select next token, repeat

Core difference:
  * Image search returns a stored item.
  * LLM produces new text token-by-token.

==== 7.3 When LLM becomes “search-like” (RAG) ====
RAG adds:
  * Text -> embedding -> vector DB retrieval -> inject documents -> LLM answers based on retrieved sources.

So:
  * Vector DB = searching
  * LLM = composing/explaining

===== 8. What is an AI Agent? =====
An **Agent** is a software system that uses an LLM to decide actions, execute tools, observe results, and iterate.

Agent formula:
  * **Agent = program logic + LLM + tools + memory + feedback loop**

Compared to a normal chatbot:
  * Chatbot: one-shot answer
  * Agent: plan -> act -> check -> fix -> repeat

===== 9. Building a minimal agent =====
Required components:
  * **Goal**: clear objective and success criteria
  * **Tools**: shell/API/DB/filesystem/etc.
  * **Memory**: logs / state (short-term and optionally long-term)
  * **Loop**: repeat until success or stop condition
  * **Guardrails**: tool allowlist, sandboxing, max steps, timeouts, logging

==== 9.1 Why enforce structured outputs? ====
Because free-form text is hard to validate.

Common patterns:
  * JSON-only output
  * JSON + schema validation
  * Function calling / tool calling

This allows:
  * parse -> validate -> execute -> verify -> retry if needed

===== 10. Popular models and which ones can run locally =====

==== Quick mental model fields (apply to every LLM) ====
For each model below, capture:
  * **1) Learned experience = Parameters / Weights** (e.g., 7B/8B/70B)
  * **2) What it can produce = Vocabulary** (tokenizer + vocab size; fixed per model)
  * **3) Temporary memory = Context window** (max tokens visible at once)

==== Cloud-only (generally not downloadable) ====
=== GPT (OpenAI) ===
  * **Parameters/Weights:** Not publicly disclosed (varies by product/version).
  * **Vocabulary:** Proprietary tokenizer; vocab size not consistently published.
  * **Context window:** Varies by model tier/version (check product docs).
  * Notes: Strong general reasoning + tool ecosystem.

=== Claude (Anthropic) ===
  * **Parameters/Weights:** Not publicly disclosed.
  * **Vocabulary:** Proprietary tokenizer.
  * **Context window:** Varies by model tier/version (check product docs).
  * Notes: Strong long-form writing and code assistance.

=== Gemini (Google) ===
  * **Parameters/Weights:** Not publicly disclosed.
  * **Vocabulary:** Proprietary tokenizer.
  * **Context window:** Varies by model tier/version (check product docs).
  * Notes: Strong multimodal and large-context options (depending on version).

==== Open-weight/open-source (often runnable locally) ====
=== LLaMA family (Meta) ===
  * **Parameters/Weights:** Common sizes include 7B/8B/13B/70B (depends on generation).
  * **Vocabulary:** Fixed per LLaMA generation (tokenizer + vocab size depends on version).
  * **Context window:** Varies by generation (older versions smaller; newer may be larger).
  * Local use: Best with quantized GGUF via llama.cpp / Ollama / LM Studio.

=== Mistral / Mixtral ===
  * **Parameters/Weights:** Mistral commonly 7B-class; Mixtral uses MoE (Mixture-of-Experts) variants.
  * **Vocabulary:** Fixed per model release (tokenizer-specific).
  * **Context window:** Varies by release/version.
  * Local use: Mistral 7B-class is popular for fast local inference.

=== Qwen ===
  * **Parameters/Weights:** Multiple sizes (small -> large; common local picks: ~7B-class).
  * **Vocabulary:** Fixed per Qwen generation (tokenizer-specific).
  * **Context window:** Varies by release/version (some versions support larger contexts).
  * Local use: Often strong multilingual performance.

=== DeepSeek (especially strong for code variants) ===
  * **Parameters/Weights:** Multiple sizes; common local coder models are ~6–7B-class.
  * **Vocabulary:** Fixed per model/tokenizer version.
  * **Context window:** Varies by release/version.
  * Local use: Code-focused variants are widely used for dev tasks.

=== Phi (small, efficient) ===
  * **Parameters/Weights:** Small models (often ~2–4B-class depending on version).
  * **Vocabulary:** Fixed per Phi release/tokenizer.
  * **Context window:** Varies by version.
  * Local use: Great for low-resource devices; fast inference.

==== Local runtimes on macOS ====
=== Ollama ===
  * Purpose: Simplest local runner (download + run models easily).
  * Works best with: Quantized GGUF models.

=== LM Studio ===
  * Purpose: GUI app to download, run, and chat with local models.
  * Works best with: Quantized GGUF models, easy model management.

=== llama.cpp ===
  * Purpose: High-performance local inference engine for GGUF models.
  * Works best with: Fine-grained control and optimization on CPU/Metal.

==== Glossary (hard terms) ====
  * **parameter** /pəˈræmɪtər/: tham số (trọng số học được)
  * **weight** /weɪt/: trọng số
  * **vocabulary** /vəˈkæbjəˌleri/: từ điển token (tập token có thể sinh)
  * **token** /ˈtoʊkən/: đơn vị nhỏ (mảnh chữ) của văn bản
  * **context window** /ˈkɑːntekst ˈwɪndoʊ/: cửa sổ ngữ cảnh (trí nhớ tạm thời)
  * **proprietary** /prəˈpraɪəˌteri/: độc quyền, không công khai
  * **open-weight** /ˌoʊpən ˈweɪt/: mở trọng số (công bố weights)
  * **quantized** /ˈkwɑːntaɪzd/: đã lượng tử hóa (nén độ chính xác số)
  * **runtime** /ˈrʌnˌtaɪm/: môi trường/chương trình chạy
  * **variant** /ˈveriənt/: biến thể/phiên bản
  * **Mixture-of-Experts (MoE)** /ˈmɪkstʃər əv ˈekspɜːrts/: kiến trúc “nhiều chuyên gia” (chỉ kích hoạt một phần model mỗi bước)

===== 11. Local model size estimates on Mac =====
Disk/RAM depends heavily on **quantization**.

Typical sizes (rough guidance):
  * FP16 (no quantization): ~14–16 GB for 7B/8B (not recommended on thin laptops)
  * Q8: ~8–9 GB
  * Q4/Q5 (most common for local use):
      * 7B Q4: ~4–5 GB
      * 8B Q4: ~4.5–5.5 GB
      * Q5 adds ~0.5–1.5 GB

Mac uses **unified memory**, so CPU+GPU share RAM.

===== 12. Can you “build an LLM” on MacBook Air M3? =====
Practical reality:
  * Training a foundation model from scratch: not realistic on a laptop.
  * Running (inference) 7B/8B quantized locally: realistic.
  * Fine-tuning with LoRA: usually done on cloud GPU; laptop can prep data and run inference tests.

===== 13. Adapting a model to a domain like healthcare =====
There are three main approaches:

==== 13.1 Prompt/System instructions ====
  * Changes only the **context** (runtime behavior).
  * Does not change weights.

==== 13.2 RAG (recommended first) ====
  * Adds domain documents (guidelines, protocols, FAQ) into context.
  * No weight changes.
  * Easy to update and safer for fresh knowledge.

==== 13.3 LoRA/QLoRA fine-tuning (Approach #3) ====
  * Changes **parameters** via a small adapter.
  * Best for **behavior**: formatting, refusal policy, questioning flow, tone.
  * Not ideal for injecting large factual knowledge (use RAG for that).

===== 14. Approach #3: How LoRA fine-tuning works =====
==== 14.1 Concept ====
  * Freeze base model weights.
  * Train a small **LoRA adapter** (delta weights).
  * During inference: base + delta.
  * Can enable/disable adapters or swap adapters per domain.

==== 14.2 Practical workflow ====
  - Step 1: Choose a base model (7B/8B common).
  - Step 2: Prepare instruction->response dataset (high-quality, domain-shaped).
  - Step 3: Train LoRA/QLoRA using a framework (e.g., HuggingFace PEFT / Axolotl), typically on cloud GPUs.
  - Step 4: Export adapter files (e.g., safetensors).
  - Step 5: Load base model + adapter locally for inference.

==== 14.3 What to train for healthcare ====
Prefer training:
  * structured response format
  * asking clarifying questions
  * refusal behavior and escalation cues
  * citing sources (especially combined with RAG)
Avoid training:
  * private patient records
  * uncontrolled “medical knowledge dumps”

===== 15. What comes after single-agent systems? =====
Common directions:
  * **Multi-agent** setups: planner, executor, critic, safety checker
  * **Self-critique** / verification loops
  * **World models** / simulation before action
  * Stronger **human-in-the-loop** controls

===== 16. Vocabulary notes: terms + IPA =====
  * **token** /ˈtoʊkən/: unit of text used by the model (often a word piece)
  * **vocabulary** /vəˈkæbjəˌleri/: the model’s token dictionary
  * **context** /ˈkɑːntekst/: the visible token window during inference
  * **parameter** /pəˈræmɪtər/: a learned weight value in the network
  * **quantization** /ˌkwɑːntəˈzeɪʃən/: reducing numeric precision to shrink models
  * **embedding** /ɪmˈbedɪŋ/: vector representation of tokens/items
  * **attention** /əˈtenʃən/: mechanism that mixes information across tokens
  * **softmax** /ˈsɔːftˌmæks/: converts scores into probabilities
  * **hallucination** /həˌluːsɪˈneɪʃən/: plausible-sounding but incorrect output
  * **agent** /ˈeɪdʒənt/: autonomous software using LLM + tools
  * **tool calling** /tuːl ˈkɔːlɪŋ/: structured calls to tools/functions
  * **RAG** /ræɡ/: Retrieval-Augmented Generation (retrieve docs then generate)
  * **fine-tune** /ˌfaɪnˈtuːn/: adapting a model with additional training
  * **LoRA** /ˈloʊrə/: Low-Rank Adaptation (lightweight fine-tune adapter)
  * **QLoRA** /kjuːˈloʊrə/: LoRA + quantization for memory-efficient training

===== 17. Quick implementation checklist =====
  - Need accurate, updatable domain knowledge: do **RAG** first.
  - Need consistent behavior/formatting: add **LoRA**.
  - Need task execution: build an **Agent** with:
      * strict JSON/tool outputs
      * schema validation
      * allowlisted tools + sandbox
      * logs + max steps + stop rules