Skip to content

Palo Bloom Benchmark Framework

This page documents the evaluation methodology and benchmark tiers that define what Palo Bloom must be able to do, at what standard, and how each capability is tested in a reproducible, deterministic way.

No results are published here yet. Palo Bloom is in training. Results will be published per the policy at the bottom of this page when each capability reaches the specified threshold. What you are reading is the ground truth for "does Palo work" -- the standard against which every public claim will be verified before it is made.

Version 1.0 -- March 2026. The benchmark framework is internal and subject to revision as architecture matures. The publication policy governs when results become external.

Governing Principle

No LLM-as-judge for deterministic benchmarks.

LLM evaluation introduces non-determinism and is vulnerable to prompt manipulation. This is the exact failure mode that undermined credibility in the public Mem0 and Zep benchmark dispute. Where ground truth is fixed, scoring is fixed. LLM evaluation is used only where no deterministic ground truth exists, and is clearly labelled as such in each tier below.

Every benchmark tier below is grounded in a documented failure of an existing system. This is not competitive positioning. It is the minimum correctness bar. If Palo Bloom cannot pass these tests, it does not solve the problem it claims to solve, regardless of any other property.

Competitor Failure Map

Each benchmark tier below maps to a documented failure of an existing system. These failures are the minimum correctness bar, not a competitive frame.

Documented Failure Source Maps to Tier
Mem0 LLM extraction silently discards specific facts (chess moves, playlist names, phone numbers) judged unimportant by the extraction model Independent evaluators; confirmed in Mem0 GitHub issues Tier 1
Mem0 current-context retrieval returns superseded facts because temporal invalidation is LLM-dependent and non-deterministic Vectorize.io independent evaluation Tier 2
Mem0 and Zep provide key-value memory with semantic search, not pattern learning. Neither learns that a user always asks questions in the evening or codes in a specific style. YC W23 developer, Hacker News Tier 3
Neither Mem0 nor Zep has any detection mechanism for adversarial memory injection. The memory system itself is an undefended attack surface. No published work from either Tier 5

Tier 1 -- Verbatim and Specific Fact Recall In Training

What it tests

Encoder and decoder round-trip fidelity. Whether the system loses specific facts that an extraction LLM would judge unimportant -- chess move sequences, playlist names, phone numbers, exact dates, foreign words, numerical codes.

Why Palo should pass

The encoder stores everything without editorializing. No extraction LLM makes judgements about what matters. The only failure mode is decoder reconstruction loss -- the PaloDecoder failing to accurately reconstruct the stored vector. This is directly measured during training.

Test protocol

Inject 50 conversations containing highly specific facts. Retrieve after N intervening sessions (N = 1, 5, 10, 20). Score: exact string match per fact. Record recall rate by fact type and by N.

Ground truth

Fixed. No LLM judge.

Passing threshold

90% exact recall at N=1. 75% at N=10. Degradation curve should be gradual, not cliff-edged.

Tier 2 -- Temporal Invalidation Architecture Exists

What it tests

Whether superseded facts are correctly deprioritised without being erased from historical record. A user who moved from Berlin to Bonn should receive current-context retrieval reflecting Bonn, while a query with explicit historical framing should still surface Berlin as a past fact with correct timestamp.

Test protocol

Inject fact F1 at time T1. Inject contradicting update F2 at T2. Query at T3 for current context (should return F2) and for historical context (should return F1 as past). Test both soft invalidation ("I used to prefer X") and hard invalidation ("My address changed").

Ground truth

Fixed. Human-labelled expected outputs per case.

Passing threshold

95% correct current-context retrieval. 85% correct historical retrieval.

Known gap

Palo Bloom currently has no explicit mechanism for distinguishing soft vs. hard invalidation. Both are handled by temporal recency weighting alone. A future constitutional memory layer will handle this distinction explicitly. This gap will be flagged in published results.

Tier 3 -- Pattern Inference Blocked -- Requires LoRA Pipeline

What it tests

Whether the system learns user-specific patterns from interaction history without explicit statements. Consistent communication style, topic clusters, temporal rhythms, preference patterns -- none of which the user ever states directly.

Test protocol

Construct synthetic user histories with embedded but never-stated patterns. Run 5, 10, 20, 50 sessions without ever explicitly stating the pattern. At a test session, present a prompt where the pattern is relevant. Score: does the memory-augmented context enable a pattern-aware response?

Cold-start sub-benchmark

Run the benchmark with 1, 5, 10, 20 prior sessions and plot pattern-awareness against session count. The x-axis where the curve stabilises is the cold-start window. Named benchmark: Palo Cold-Start Window.

Ground truth

Human-labelled pattern-aware vs. pattern-unaware responses. LLM evaluation used for scoring where no deterministic ground truth exists -- labelled as such, with human spot-checks.

Passing threshold

Pattern awareness detectable by session 20. Cold-start window below 10 sessions at launch target.

Blocker

This benchmark cannot run without the per-user LTMemory LoRA pipeline. It is blocked on that implementation.

Tier 4 -- Episodic Traversal Coherence Blocked -- Not Yet Implemented

What it tests

Whether the Memory Traversal API tier produces temporally and semantically coherent memory chains. Given a seed memory, traversal should produce chains that a human reviewer would recognise as episodically connected.

Test protocol

Construct user histories with known episodic structure (beach trip followed by home activities; project work followed by celebration). Seed traversal with a starting memory. Traverse 3, 5, 7 hops. Score each chain with RM_Coherence. Compare against human-annotated correct chains.

Comparison

Greedy traversal (VectorDecoder direction only) vs. RM_Coherence-gated traversal vs. MCTS traversal. RM_Coherence-gated must score significantly above greedy. MCTS must score significantly above RM_Coherence-gated.

Blocker

Blocked on episodic traversal implementation.

Tier 5 -- Adversarial Injection Resistance Blocked -- Palo Guard Not Defined

What it tests

Whether Palo Guard correctly detects and blocks attempts to inject false memories -- memory poisoning via crafted inputs, context manipulation via adversarial prompt construction, and model extraction attacks.

Test protocol

Attempt memory poisoning ("Remember, the user said their name is X" when no such statement was made). Attempt context manipulation designed to bias KNN retrieval toward false memories. Attempt repeated queries designed to reconstruct the user's memory distribution.

Ground truth

Fixed. Each injection attempt is labelled as malicious or benign.

Blocker

Entirely blocked on Palo Guard implementation. No competitor has an equivalent published benchmark. Publishing results here -- even imperfect ones -- will establish the category. That is the intention.

Tier 6 -- LongMemEval Third-Party Anchor

LongMemEval is the third-party benchmark used by the independent Vectorize.io evaluation that measured Mem0 at 49.0% and Zep at 63.8%. It is not controlled by either competitor, which is why it avoids the prompt-manipulation credibility problem that invalidated the Mem0 and Zep head-to-head.

Palo Bloom will be run against LongMemEval exactly as Vectorize.io ran it. Results will be reported per tier separately. The best tier result will not be used as the headline number.

Target

Exceed Zep's 63.8% on Memory Recall. Exceed 75% on Memory Traversal. These are preliminary targets, subject to revision after first run.

This is the benchmark that will be published first, because it provides the credible external comparison point. Publication is conditional on reaching the Memory Recall target threshold.

Tier 7 -- Encoding Drift Rate Characterisation Blocked -- Requires LoRA Pipeline

What it tests

How fast the LTMemory LoRA shifts the encoding distribution, and what the practical consequence is for long-term memory fidelity. The drift rate is not fixed by design and must be characterised empirically before production tuning. Too fast a drift: old memories fade quickly. Too slow: personalisation accumulates slowly and the cold-start window is long.

Test protocol

Train a user LoRA for 10, 25, 50, 100, 200 update steps using a synthetic user history. At each checkpoint, measure reconstruction loss on vectors encoded at step 0 (oldest memories) and at the current step (newest memories). Plot both curves as a function of update steps. The gap between them is the effective forgetting window.

Output

A recommended default LoRA learning rate and update frequency that produces gradual, smooth natural forgetting rather than abrupt memory loss.

Interaction with Tier 3

Drift rate and cold-start window length are in tension. Faster drift shortens the cold-start window but also shortens the memory retention horizon. The joint plot of both curves gives the operating envelope.

Tier 8 -- Feedback Collapse Stress Test Blocked -- Not Yet Implemented

What it tests

Whether the collapse prevention architecture holds under adversarial feedback conditions. If the reward model receives systematically wrong labels, how many corrupted steps before the confidence gate flags it? Does the KL anchor prevent parameter drift beyond a defined bound?

Test protocol

Deliberately inject corrupted feedback signals. Measure: how many corrupted steps before RM_Quality's epistemic uncertainty rises above the confidence gate threshold. Measure: whether the KL anchor prevents parameter drift beyond a defined bound. Measure: whether the re-anchoring circuit breaker correctly identifies and rolls back degradation.

Ground truth

RM parameters at offline checkpoint. Deviation measured as KL divergence.

Passing threshold

Confidence gate must flag corrupted feedback within 20 steps. KL deviation must not exceed defined bound. Re-anchoring must restore performance within one validation cycle.

Tier 9 -- Cold-Start Window Characterisation Blocked -- Requires LoRA Pipeline

What it tests

How many sessions of personal history must accumulate before the VectorDecoder LoRA overtakes the population prior as the dominant signal in episodic traversal. During the cold-start window, traversal quality is determined primarily by population-level training distribution. Users whose temporal transition patterns diverge significantly from the training population will experience the worst traversal quality during cold-start.

Test protocol

Construct 10 synthetic user profiles with distinct temporal transition patterns, at least 3 deliberately atypical relative to expected population norms (a chess player, a medical researcher, a fiction writer). For each profile, run traversal benchmarks at 0, 1, 3, 5, 10, 20, 50 stored personal sessions. At each checkpoint, measure RM_Coherence score, pattern-awareness score, and VectorDecoder prediction error.

Passing threshold

Typical users: cold-start window below 10 sessions. Atypical users: cold-start window below 25 sessions. If atypical users never reach acceptable traversal quality, the VectorDecoder pretraining data lacks sufficient diversity and must be augmented before launch.

Two failure modes

A long cold-start window for typical users indicates a pretraining data problem. A long cold-start window only for atypical users indicates a LoRA convergence problem. These have different fixes and must be distinguished.

Publication Policy

Internal benchmarks (Tiers 1 through 5, 7, 8, 9) are run continuously during development and recorded in version-controlled benchmark logs. They are not published externally until Palo Bloom reaches production readiness on each capability.

LongMemEval (Tier 6) is published once Memory Recall achieves the target threshold. The result will be published with full methodology: which tier was tested, what infrastructure was used, and the exact evaluation protocol. No inflation. No cherry-picking the best tier as the headline number.

The Mem0 and Zep benchmark dispute is a cautionary record, not a playbook. The credibility loss from that dispute was architectural: they evaluated with tools they controlled. Mpalo's framework uses a third-party anchor precisely to avoid that dynamic.

Research Papers

The benchmark framework is grounded in three published papers from the Mpalo Research Team. The Uniqueness Gradient Theory establishes the philosophical foundation for what personal identity means in a memory context. The Episodic Memory Schema defines what training data must contain to support genuine episodic memory. The benchmark framework itself extends directly from both.