Table of Contents

System Notes: LLM, Tokens, Agents, Local Models, RAG & LoRA Fine-Tuning

1. Overview

An LLM (Large Language Model) is a model that generates text by predicting the next token based on the current context.

In real products, you typically combine:

LLM in one line

A practical mental model:

Note: The Tokenizer is a separate component that converts text into token IDs (it is not the model’s learned experience).

Glossary quick notes

2. Does an LLM “understand” or only generate probabilistic text?

Technically, an LLM is a probabilistic sequence model: it estimates a distribution over the next token given previous tokens.

Key implications:

3. What are tokens, and where do generated tokens come from?

3.1 Tokens are not necessarily “words”

Tokens can be:

3.2 Where does the next token come from?

Generated tokens come from the model’s vocabulary (token dictionary):

So:

4. Three frequently confused concepts

4.1 Vocabulary size

Number of distinct tokens the model can output.

4.2 Training tokens

Number of tokens in the dataset used for training.

4.3 Context length

Maximum number of tokens the model can “see” at once.

5. What do 7B / 8B / 13B / 70B mean?

These refer to parameter count:

Parameters are the model’s learned numeric weights—its compressed “experience.”

6. What is inside an LLM model?

A typical LLM package includes:

What is NOT inside:

7. Comparing image vector search vs LLM generation

7.1 Image similarity search (your example)

7.2 LLM generation

Core difference:

7.3 When LLM becomes “search-like” (RAG)

RAG adds:

So:

8. What is an AI Agent?

An Agent is a software system that uses an LLM to decide actions, execute tools, observe results, and iterate.

Agent formula:

Compared to a normal chatbot:

9. Building a minimal agent

Required components:

9.1 Why enforce structured outputs?

Because free-form text is hard to validate.

Common patterns:

This allows:

Quick mental model fields (apply to every LLM)

For each model below, capture:

Cloud-only (generally not downloadable)

GPT (OpenAI)

Claude (Anthropic)

Gemini (Google)

Open-weight/open-source (often runnable locally)

LLaMA family (Meta)

Mistral / Mixtral

Qwen

DeepSeek (especially strong for code variants)

Phi (small, efficient)

Local runtimes on macOS

Ollama

LM Studio

llama.cpp

Glossary (hard terms)

11. Local model size estimates on Mac

Disk/RAM depends heavily on quantization.

Typical sizes (rough guidance):

Mac uses unified memory, so CPU+GPU share RAM.

12. Can you “build an LLM” on MacBook Air M3?

Practical reality:

13. Adapting a model to a domain like healthcare

There are three main approaches:

13.1 Prompt/System instructions

13.3 LoRA/QLoRA fine-tuning (Approach #3)

14. Approach #3: How LoRA fine-tuning works

14.1 Concept

14.2 Practical workflow

  1. Step 1: Choose a base model (7B/8B common).
  2. Step 2: Prepare instruction→response dataset (high-quality, domain-shaped).
  3. Step 3: Train LoRA/QLoRA using a framework (e.g., HuggingFace PEFT / Axolotl), typically on cloud GPUs.
  4. Step 4: Export adapter files (e.g., safetensors).
  5. Step 5: Load base model + adapter locally for inference.

14.3 What to train for healthcare

Prefer training:

Avoid training:

15. What comes after single-agent systems?

Common directions:

16. Vocabulary notes: terms + IPA

17. Quick implementation checklist

  1. Need accurate, updatable domain knowledge: do RAG first.
  2. Need consistent behavior/formatting: add LoRA.
  3. Need task execution: build an Agent with:
    • strict JSON/tool outputs
    • schema validation
    • allowlisted tools + sandbox
    • logs + max steps + stop rules