Project 1 - HPC: Scaling Laws for Agentic Systems

Preliminary Step

Before starting the project, convert the course notebooks into standalone Python scripts. This is a necessary step for running experiments on the cluster, where Jupyter is impractical and batch submissions require executable .py files.

Convert the following notebooks into standalone .py scripts:

06_full_agent.py:the complete single-agent Physics Research Assistant
08_multi_agent_team.py:the 3-agent sequential CrewAI team
09_debate_agent.py:the 3-agent Debate crew

Goal

Empirically measure how agent performance scales along two axes: model size (vertical scaling) and number of agents (horizontal scaling). Produce quantitative scaling plots and discuss whether agentic systems exhibit predictable scaling behaviour analogous to LLM scaling laws.

Part 1 - Model Size Scaling (single agent)

Run the single-agent Physics Research Assistant (06_full_agent.py) on a fixed set of tasks across 4 model sizes:

Model	Parameters	Approx. GPU memory
`qwen2.5:1.5b`	1.5B	~1 GB
`qwen2.5:3b`	3B	~2 GB
`qwen2.5:7b`	7B	~4.5 GB
`qwen2.5:14b`	14B	~9 GB

Use the same 3 tasks for every model, ordered by difficulty:

Task E (Easy) - knowledge recall:

"What is the critical temperature of the 2D Ising model on a square lattice with nearest-neighbour interactions?"

Task M (Medium) - tool use + calculation:

"Compute the exact magnetisation |m| at T = 2.0 J/k_B for the 2D Ising model using Onsager's formula. Show your calculation."

Task H (Hard) - multi-step reasoning:

"Run Monte Carlo simulations of the 2D Ising model at T = 2.00, 2.269, and 3.00 on a 32×32 lattice with 10,000 steps. Compare the simulated magnetisation with the theoretical prediction at each temperature. Flag any inconsistencies and explain their origin."

For each (model, task) pair, record:

Metric	How to measure
Wall-clock time	`time.time()` around the agent run
Number of agent steps	Count tool calls + retries + reflection loops
Tokens consumed	Input + output tokens (from the Ollama response or LiteLLM callback)
Correctness score	Task E: 1 if T_c ≈ 2.269 mentioned, 0 otherwise. Task M: 1 if