Preliminary Step

Before starting the project, convert the course notebooks into standalone Python scripts. This is a necessary step for running experiments on the cluster, where Jupyter is impractical and batch submissions require executable .py files.

Convert the following notebooks into standalone .py scripts:

Goal

Empirically measure how agent performance scales along two axes: model size (vertical scaling) and number of agents (horizontal scaling). Produce quantitative scaling plots and discuss whether agentic systems exhibit predictable scaling behaviour analogous to LLM scaling laws.

Part 1 - Model Size Scaling (single agent)

Run the single-agent Physics Research Assistant (06_full_agent.py) on a fixed set of tasks across 4 model sizes:

Model Parameters Approx. GPU memory
qwen2.5:1.5b 1.5B ~1 GB
qwen2.5:3b 3B ~2 GB
qwen2.5:7b 7B ~4.5 GB
qwen2.5:14b 14B ~9 GB

Use the same 3 tasks for every model, ordered by difficulty:

Task E (Easy) - knowledge recall:

"What is the critical temperature of the 2D Ising model on a square lattice with nearest-neighbour interactions?"

Task M (Medium) - tool use + calculation:

"Compute the exact magnetisation |m| at T = 2.0 J/k_B for the 2D Ising model using Onsager's formula. Show your calculation."

Task H (Hard) - multi-step reasoning:

"Run Monte Carlo simulations of the 2D Ising model at T = 2.00, 2.269, and 3.00 on a 32×32 lattice with 10,000 steps. Compare the simulated magnetisation with the theoretical prediction at each temperature. Flag any inconsistencies and explain their origin."

For each (model, task) pair, record:

Metric How to measure
Wall-clock time time.time() around the agent run
Number of agent steps Count tool calls + retries + reflection loops
Tokens consumed Input + output tokens (from the Ollama response or LiteLLM callback)
Correctness score Task E: 1 if T_c ≈ 2.269 mentioned, 0 otherwise. Task M: 1 if