COMPUTATIONAL ARCHITECTURE FOR AI
The Physicality of Computation
- "The Cloud" is a marketing abstraction for remote data centers. All software runs on physical hardware.
- Computation is a physical process requiring energy, matter, and time.
- AI is particularly resource-intensive, pushing the limits of modern semiconductor physics.
Intent: Dispel the "magic" of software. AI performance is limited by physics, not just code.
COMPUTATIONAL ARCHITECTURE FOR AI
The Computational Stack: An Overview
- Logic (CPU): General sequential processing.
- Memory (RAM): Volatile system workspace.
- Acceleration (GPU): Parallel processing units.
- Throughput (VRAM Bandwidth): Speed of data movement.
- Storage (SSD): Non-volatile persistence.
- Bottleneck Principle: The system is only as fast as its slowest component.
Intent: Introduce the system architecture as a series of constraints and bottlenecks.
COMPUTATIONAL ARCHITECTURE FOR AI
CPU (Central Processing Unit): Sequential Logic
- Optimized for Sequential Processing (complex logic, one step at a time).
- Role: The "Conductor" — handles OS tasks, data preprocessing, and logic flow.
- Limitation: Low core count makes it inefficient for the massive matrix multiplication required by AI.
Intent: Define the CPU's role. It coordinates but lacks the parallelism needed for AI at scale.
COMPUTATIONAL ARCHITECTURE FOR AI
RAM (System Memory): Volatile Workspace
- Volatile memory that stores data currently in use by the CPU.
- Role: Buffering data before it is sent to the GPU.
- Constraint: Insufficient RAM leads to "swapping" (using slow SSD as fake RAM), which cripples performance.
Intent: Explain the role of system memory in the pipeline and the penalty of running out.
COMPUTATIONAL ARCHITECTURE FOR AI
GPU (Graphics Processing Unit): Parallel Acceleration
- Optimized for Parallel Processing (thousands of simple operations simultaneously).
- SIMD Architecture: Single Instruction, Multiple Data.
- Originally for rendering pixels, now the primary engine for calculating neural network weights.
Intent: Explain why AI requires GPUs. Their architecture is mathematically aligned with matrix operations.
COMPUTATIONAL ARCHITECTURE FOR AI
VRAM (Video Random Access Memory): The Critical Bottleneck
- High-bandwidth memory located physically on the GPU die.
- Constraint: The entire active model must usually fit into VRAM for efficient inference.
- If Model Size > VRAM Capacity = System Failure or extreme latency.
Intent: Identify VRAM as the single most critical factor for offline AI performance.
COMPUTATIONAL ARCHITECTURE FOR AI
Storage (NVMe SSD): Throughput & Latency
- Non-volatile storage for model weights when not in use.
- Metric: Read/Write Speed (Throughput).
- Relevance: Affects Model Loading Time. Slow drives result in long wait times before inference begins.
Intent: Distinguish between "storage" and "memory." SSD speed impacts the "wake up" time of a model.
COMPUTATIONAL ARCHITECTURE FOR AI
Compute Constraints: Parameter Count vs. Hardware
- Larger Models (More Parameters) = Higher VRAM requirement.
- Quantization: Reducing the precision of weights (e.g., 16-bit to 4-bit) to fit larger models into smaller VRAM.
- Engineering Trade-off: Precision vs. Performance.
Intent: Link software complexity to hardware requirements. Introduce quantization as an optimization technique.
COMPUTATIONAL ARCHITECTURE FOR AI
Lab Infrastructure: The "Local-First" Spec
- Student Nodes: Ryzen 7 / 32GB RAM. Optimized for CPU-based inference and efficient quantized models.
- Objective: Zero-latency, privacy-preserving, offline capability.
- Democratization: Access to professional-grade compute without recurring subscription barriers.
Intent: Justify the specific hardware chosen for the course in terms of access and capability.
COMPUTATIONAL ARCHITECTURE FOR AI
Server Infrastructure: High-Performance Compute (HPC)
- Teacher Core: NVIDIA RTX 4090 / 5070 Ti.
- Role: Handling "Heavy" Inference (large parameter models) and Batch Processing.
- Networked Inference: Offloading tasks from student nodes to the central core when local hardware is insufficient.
Intent: Explain the tiered architecture of the lab and the role of the central server.
COMPUTATIONAL ARCHITECTURE FOR AI
Distributed vs. Local Compute
- Distributed (Cloud): Massive GPU clusters, high scalability, but high cost and privacy risk.
- Local (Edge): Fixed hardware constraints, zero variable cost, total data sovereignty.
- The "Edge AI" movement: Bringing intelligence to the device rather than the data center.
Intent: Contrast the lab model with the industry standard cloud model.
COMPUTATIONAL ARCHITECTURE FOR AI
Data Sovereignty & Privacy
- Sovereignty: You own the input, the model processing, and the output.
- No telemetry, no data scraping, no "training on user data."
- Critical for sensitive fields (medical, legal, proprietary research).
Intent: Explain the "Why" of offline AI beyond just cost. It's about ownership.
COMPUTATIONAL ARCHITECTURE FOR AI
The Trade-offs of Local AI
- Hardware Constraints: You cannot run massive models (like GPT-4) locally.
- Maintenance: You are the sysadmin. Drivers, environments, and updates are your responsibility.
- Thermodynamics: High-performance compute generates significant heat and power draw.
Intent: Maintain intellectual honesty. Great power comes with great maintenance responsibility.
COMPUTATIONAL ARCHITECTURE FOR AI
Computer Science Alignment
- Understanding Hardware/Software Integration.
- Resource Management & Optimization.
- System Architecture Design.
Intent: Map these concepts to the curriculum standards.
COMPUTATIONAL ARCHITECTURE FOR AI
Transition to Prompt Engineering
- "We have the theory and the machine. Now, how do we program the probabilistic engine?"
Intent: Set up the next module on prompting and spec-driven development.