Abstract
We present Ollama-Judge, a Rust-based multi-agent system that simulates courtroom adjudication through a panel of nine Supreme Court justices and twelve jurors, each powered by locally-deployed large language models (LLMs) via Ollama. The framework implements a novel transparent reasoning engine—Super Why—which decomposes judicial decision-making into four phases (Letters, Words, Story, Solution) that together produce a complete audit trail from raw evidence to final verdict.
A tokio-based asynchronous channel architecture enables inner-core communication among agents, supporting a deliberation protocol where justices first produce independent opinions and then revise after reviewing peer reasoning. The system supports both criminal and civil case formats, produces structured verdicts in Markdown or JSON, and leverages Metal GPU acceleration on Apple Silicon through Ollama’s native backend.
We evaluate the framework on two complete synthetic cases—a burglary trial and a negligence suit—demonstrating consistent verdict generation with full reasoning transparency. The implementation requires only six Rust crate dependencies and produces a 4 MB release binary.
I. Introduction
The application of large language models to legal reasoning presents both opportunities and challenges. While LLMs demonstrate impressive capabilities in text understanding and generation, their use in high-stakes domains such as adjudication requires careful consideration of transparency, reproducibility, and structured reasoning [1]. Prior work has explored LLMs for legal document analysis [2], case outcome prediction [3], and argument mining [4], but few systems attempt to simulate the full adjudicative process with multiple interacting agents.
We introduce Ollama-Judge, a multi-agent framework that simulates courtroom adjudication through the following contributions:
The remainder of this paper is organized as follows. Section II describes the system architecture. Section III details the agent design. Section IV presents the communication protocol. Section V discusses implementation. Section VI presents case studies. Section VII concludes.
II. System Architecture
Ollama-Judge follows a phased pipeline architecture where each stage processes the case transcript through a different agentic lens. The pipeline below illustrates the overall flow.
Fig. 1. Ollama-Judge pipeline architecture showing the four processing phases and inter-agent communication channels.
A. Case Model
Cases are represented as JSON documents containing structured fields for parties, opening testimony, cross-examination transcripts (with witness Q&A and objections), and closing testimony. The framework supports two case types—Criminal (beyond reasonable doubt) and Civil (preponderance of the evidence)—with appropriate burden-of-proof instructions automatically injected into agent prompts.
B. LLM Backend
All LLM inference is performed through the Ollama REST API [5], which runs locally and manages model lifecycle. The default model is Llama 3.2 (8B parameters), though any Ollama-compatible model may be substituted. Metal GPU acceleration is provided transparently by Ollama on Apple Silicon hardware; no additional GPU compute crates are required in the Rust binary.
C. Throttling and Resource Management
To prevent resource exhaustion on consumer hardware, the system employs a tokio-based semaphore that limits concurrent LLM requests. Default settings restrict to three simultaneous calls with a 200 ms inter-request delay. These parameters are user-configurable and recommended ranges are provided for different hardware profiles.
III. Agent Design
A. Supreme Court Justices
Nine justice agents are instantiated, each assigned a distinct judicial ideology drawn from a linear spectrum ranging from strict constructionist (Justices 1–3) through moderate pragmatist (4–6) to broad interpreter (7–9). This ideological diversity is injected via system prompts that describe each justice’s interpretive framework, producing varied legal reasoning even when analyzing identical evidence.
Each justice follows a two-phase workflow:
The justice receives the full case transcript and produces a complete legal opinion including analysis of key facts and evidence, application of relevant legal standards, a binary verdict, and a confidence score in the range [0, 1].
Each justice receives a summary of peer votes and confidence scores, then produces a final opinion, potentially revising their position. This mimics the Supreme Court conference procedure.
Judicial Ideology Spectrum
Justices 1–9
B. Jury Panel
Twelve juror agents operate independently, each receiving the full case transcript and producing a factual verdict. Jurors are instructed on the appropriate burden of proof—beyond a reasonable doubt for criminal cases, preponderance of the evidence for civil cases—and do not participate in the justice deliberation phase. This separation of legal and factual determination mirrors the distinction between judge and jury in common law systems.
C. Super Why Reasoning Engine
The Super Why engine provides transparent, step-by-step reasoning by decomposing the decision process into two combined phases that together cover four analytical levels:
Extract every individual fact and evidence item; map each to relevant legal standards and burden of proof.
Reconstruct the event narrative; identify contradictions; apply law to facts; produce final decision.
The system makes two sequential LLM calls—reduced from four in earlier versions—with each call building on the output of the previous one. This produces a complete reasoning chain where each step is independently verifiable, addressing the “black box” critique of LLM-based decision systems.
IV. Inner-Core Communication Protocol
Agent communication is implemented using Rust’s async channel primitives from the tokio crate [6]. The protocol employs two channel types:
Multi-Producer, Single-Consumer
Used for collecting preliminary opinions from nine justices and final votes from both justices and jurors. Each agent sends its opinion through a shared channel endpoint.
One-to-All Distribution
Used for the deliberation phase, where collected preliminary opinions must be delivered to every justice simultaneously via a single broadcast channel.
The protocol ensures deterministic ordering without deadlocks: the orchestrator drops its sender endpoints after spawning all agents, allowing the receiver to terminate naturally once all agents have delivered their results.
procedure Deliberation(case, justices, client)
prelim ← ∅
tx_prelim ← mpsc(9)
tx_broadcast ← broadcast()
for each j ∈ justices
spawn analyze(j, case, client, tx_prelim)
// Collect all preliminary opinions
for i ← 1 to 9
prelim ← prelim ∪ recv(tx_prelim)
send(tx_broadcast, prelim) // Broadcast for deliberation
for each j ∈ justices
spawn deliberate(j, case, client, prelim, tx_final)
return collect(tx_final)V. Implementation
Ollama-Judge is implemented in Rust (edition 2021) and compiles with Rust 1.75+. The implementation emphasizes minimal dependencies and small binary size.
| Crate | Purpose |
|---|---|
| tokio | Async runtime, I/O, channels, semaphore |
| reqwest | HTTP client for Ollama REST API |
| serde / serde_json | Case deserialization, verdict serialization |
| clap | CLI argument parsing |
| thiserror | Error type derivation |
Table 2. Direct Rust dependencies.
The release binary is 4 MB and has zero runtime dependencies beyond the Ollama HTTP endpoint. The total source code is approximately 780 lines across 17 source files.
A. Metal GPU Acceleration
On Apple Silicon hardware, Ollama automatically uses the Metal Performance Shaders framework for GPU-accelerated inference. The Rust application detects Metal availability at compile time using the cfg!(target_os = "macos") macro and reports acceleration status at startup.
B. Throttle Configuration
The system provides two resource management parameters:
--throttle(default: 3)Maximum concurrent LLM requests
--delay-ms(default: 200)Inter-request delay in milliseconds
VI. Case Studies
We evaluated Ollama-Judge on two synthetic cases designed to test different aspects of legal reasoning. Both cases include complete opening testimony, cross-examination (6–8 witness sessions with Q&A), and closing testimony.
The defendant is charged with second-degree burglary. The prosecution’s case rests on circumstantial evidence: proximity to the scene, matching shoe prints, and possession of consistent merchandise.
- Proximity versus proof
- Identification reliability (150 ft, night, rain)
- Shoe print consistency vs. exact match
The plaintiff alleges negligence after a semi-truck sideswiped her vehicle on I-80 during rainy conditions. The defense argues an unavoidable accident from a sudden rain squall.
- Causation versus weather
- Pre-existing conditions
- FMCSA lane-change regulations
VII. Results and Discussion
Each trial generates a complete verdict document containing:
The system consistently produces structured verdicts with all required components. The deliberation phase typically produces some vote changes as justices respond to peer reasoning, though the majority opinion remains stable across rounds.
A. Performance
With default throttle settings (3 concurrent requests, 200 ms delay), a full trial with 9 justices (2 rounds) and 12 jurors requires 30 LLM calls plus 2 Super Why calls. On an Apple M2 Max with 64 GB RAM, a complete trial completes in approximately 8–12 minutes.
B. Limitations
VIII. Conclusion and Future Work
We presented Ollama-Judge, a multi-agent framework for simulated courtroom adjudication that combines nine justice agents, twelve juror agents, and a transparent reasoning engine into a cohesive pipeline. The implementation demonstrates that complex multi-agent legal reasoning can be achieved with minimal dependencies and a small binary footprint.
Future work includes:
References
- R. Bommasani et al., “On the opportunities and risks of foundation models,” arXiv:2108.07258, 2021.
- I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, and I. Androutsopoulos, “LEGAL-BERT: The muppets straight out of law school,” in Proc. EMNLP, 2020.
- M. Medvedeva, M. Vols, and M. Wieling, “Using machine learning to predict decisions of the European Court of Human Rights,” Artificial Intelligence and Law, vol. 28, no. 2, 2019.
- I. Habernal and I. Gurevych, “Argumentation mining in user-generated web discourse,” Computational Linguistics, vol. 43, no. 1, 2017.
- Ollama, “Ollama: Get up and running with large language models locally,” 2024. https://ollama.com
- Tokio Contributors, “Tokio: An asynchronous runtime for the Rust programming language,” 2024. https://tokio.rs
- A. Vaswani et al., “Attention is all you need,” in Proc. NeurIPS, 2017.
- T. Brown et al., “Language models are few-shot learners,” in Proc. NeurIPS, 2020.
- J. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” in Proc. NeurIPS, 2022.
- X. Wang et al., “Self-consistency improves chain of thought reasoning in language models,” in Proc. ICLR, 2023.
