Before starting the project, convert the course notebooks into standalone Python scripts. This is a necessary step for running experiments on the cluster, where Jupyter is impractical and batch submissions require executable .py files.
Convert the following notebooks into standalone .py scripts:
06_full_agent.py:the complete single-agent Physics Research Assistant08_multi_agent_team.py:the 3-agent sequential CrewAI team09_debate_agent.py:the 3-agent Debate crewEmpirically measure how agent performance scales along two axes: model size (vertical scaling) and number of agents (horizontal scaling). Produce quantitative scaling plots and discuss whether agentic systems exhibit predictable scaling behaviour analogous to LLM scaling laws.
Run the single-agent Physics Research Assistant (06_full_agent.py) on a fixed set of tasks across 4 model sizes:
| Model | Parameters | Approx. GPU memory |
|---|---|---|
qwen2.5:1.5b |
1.5B | ~1 GB |
qwen2.5:3b |
3B | ~2 GB |
qwen2.5:7b |
7B | ~4.5 GB |
qwen2.5:14b |
14B | ~9 GB |
Use the same 3 tasks for every model, ordered by difficulty:
Task E (Easy) - knowledge recall:
"What is the critical temperature of the 2D Ising model on a square lattice with nearest-neighbour interactions?"
Task M (Medium) - tool use + calculation:
"Compute the exact magnetisation |m| at T = 2.0 J/k_B for the 2D Ising model using Onsager's formula. Show your calculation."
Task H (Hard) - multi-step reasoning:
"Run Monte Carlo simulations of the 2D Ising model at T = 2.00, 2.269, and 3.00 on a 32×32 lattice with 10,000 steps. Compare the simulated magnetisation with the theoretical prediction at each temperature. Flag any inconsistencies and explain their origin."
For each (model, task) pair, record:
| Metric | How to measure |
|---|---|
| Wall-clock time | time.time() around the agent run |
| Number of agent steps | Count tool calls + retries + reflection loops |
| Tokens consumed | Input + output tokens (from the Ollama response or LiteLLM callback) |
| Correctness score | Task E: 1 if T_c ≈ 2.269 mentioned, 0 otherwise. Task M: 1 if |