Skip to content
Back to blog
Engineering·Apr 29, 2026·Johnny Dunn

AgentOS beats Mastra on LongMemEval-S at gpt-4o: 85.6% vs 84.23%

Plus 70.2% on the 1.5M-token M variant — the only open-source memory lib above 65% on M with reproducible methodology.
AgentOSEngineering Notes

AgentOS beats Mastra on LongMemEval-S at gpt-4o: 85.6% vs 84.23%

April 29, 2026 · Johnny Dunn

"Not everything that can be counted counts, and not everything that counts can be counted."

— William Bruce Cameron, Informal Sociology, 1963

Memory benchmarks for AI agents reward retrieval over inference. The score goes up when the system dumps more context into the reader's window and lets the LLM sort the result out. That's not what most people mean by "memory" when they ask for it. It's a search engine on top of a smart enough reader to compensate for the noise.

The numbers people quote (LongMemEval, LOCOMO) inherit this. So when Dhravya publishes 99% and Mastra publishes 95%, the right reaction is not "huh, our 85.6% looks bad" but "what reader, what retrieval, what judge, and can I rerun it." Most of the time, at least one of those answers is missing. (MemPalace also publishes 100% on every memory bench, but that's a broken evaluator that returns 100% no matter what you feed it, so it's not a competitor result, it's a methodology bug.)

AgentOS is the open-source TypeScript runtime I work on. It implements nine cognitive mechanisms from published neuroscience (Ebbinghaus decay, retrieval-induced forgetting, reconsolidation, source-confidence decay, more) so the agent forgets on purpose. The bench numbers below are how that design holds up against everyone else's at the same gpt-4o answer model — the comparison the headline percentages alone don't give you.

Two results, both at the gpt-4o reader, both at full N=500.

LongMemEval-S: 85.6% at $0.0090 per correct answer, 3.6-second median latency. That's +1.4 points above Mastra Observational Memory at gpt-4o (84.23%), the strongest published memory-library number at this reader. EmergenceMem Internal publishes 86.0% (0.4 points above us), but Internal is their closed-source SaaS (emergence.ai/web-automation-api). Not a library you can pull into your own project. Their public reference repo emergence_simple_fast scores 79% on the same benchmark and ships with no license, meaning the code is publicly readable but not legally redistributable. AgentOS at 85.6% is the highest published number from a memory framework that ships under a permissive license (Apache-2.0) the way most production teams actually use.

LongMemEval-M: 70.2% at $0.0078 per correct answer. M is the harder variant: 1.5M tokens of conversation per question, 500 sessions per haystack, bigger than any production LLM context window I've seen. Of the 14 memory-library vendors I audited, none of the rest publish an M number at all. AgentOS at 70.2% is competitive with the strongest published M results in the original LongMemEval paper (Wu et al., ICLR 2025, Table 3) 1. At reader-Top-5, that's +4.5 points above the round-level configuration (65.7%) and 1.2 points below the session-level configuration (71.4%); the paper's strongest GPT-4o result is 72.0% at round-level Top-10. The closest published external M number is AgentBrain's 71.7% from their closed-source SaaS.

(Aside: M's $0.0078 per correct is lower than S's $0.0090, which sounds backwards because M's haystacks are 13× larger. The reader never sees the haystack. Retrieval narrows it. M's headline config uses reader-Top-K=5, so the reader receives 5 chunks per question regardless of corpus size. S's headline runs a per-category classifier + reader router that sends some categories to gpt-4o with larger top-K and an extra classifier call per case. M's per-call reader cost is structurally smaller; the corpus size never enters the bill because the retriever absorbs it.)

Both numbers ship with per-case run JSONs at seed 42. Anyone can rerun the same configuration and compare per-question against my results. The runtime is Apache-2.0 at github.com/framerslab/agentos; the bench harness is Apache-2.0 at github.com/framerslab/agentos-bench. One CLI command at the bottom of this post reproduces each headline.

The rest of the post covers: the architecture changes that produced each number, the audit of the vendor landscape (including the 99% / 95% claims that fall apart once you check what answer LLM the competitor used), the methodology checks behind every number above, and the reproduction commands. There's a lot of skepticism baked in. I built the bench harness because I wasn't going to trust my own numbers without a way to make them break.

TL;DR for the busy reader

VariantAgentOSClosest published competitorCost-per-correctLicenseStatus
LongMemEval-S (115K tokens, 50 sessions)85.6%EmergenceMem Internal (closed-source) 86.0%, Mastra OM gpt-4o 84.23%, Supermemory 81.6%$0.0090Apache-2.0+1.4 over Mastra; +5.0 over EmergenceMem's open Simple Fast (80.6%)
LongMemEval-M (1.5M tokens, 500 sessions)70.2%AgentBrain 71.7% (closed-source SaaS). No other open-source library publishes M.$0.0078Apache-2.0first open-source above 65%

Full benchmarks reference · Reproducible run JSONs · Transparency audit framework


Part 1: LongMemEval-S at the gpt-4o reader

LongMemEval-S has 115K tokens of conversation per question and roughly 50 sessions per haystack. It fits in a single gpt-4o call. Every memory-library vendor with a public LongMemEval claim publishes on S.

The table below holds the reader model constant at gpt-4o, so the comparison isolates memory architecture from base-LLM capability. Full run at N=500 questions, gpt-4o-2024-08-06 as judge, rubric 2026-04-18.1 (judge false-positive rate 1%).

System (gpt-4o-class reader)Accuracy$/correctp50 latencyp95 latencySource
EmergenceMem Internal (closed-source proprietary)86.0%not published5,650 msnot publishedemergence.ai
AgentOS canonical-hybrid + reader-router85.6%$0.00903,558 ms7,264 msthis work
Mastra OM gpt-4o (gemini-flash observer)84.23%not publishednot publishednot publishedmastra.ai
Supermemory gpt-4o81.6%not publishednot publishednot publishedsupermemory.ai
EmergenceMem Simple Fast (rerun in agentos-bench)80.6%$0.05863,703 ms9,200 msadapter
Zep self / independent reproduction71.2% / 63.8%not publishednot published632 ms p95 searchself / arXiv:2512.13564

AgentOS is 1.4 points above the Mastra OM gpt-4o number and 0.4 points below EmergenceMem Internal. Internal is closed-source SaaS behind emergence.ai/web-automation-api, not a library you can install. Their public reference, EmergenceAI/emergence_simple_fast, publishes 79% (we reproduced it at 80.6%), but the repo has no license: the code is publicly visible but not legally usable in derivative work. The 86% Internal number cannot be reproduced from public code at all.

So the practical comparison: the highest "you can install this and use it in your product under a permissive license" memory-library number on LongMemEval-S at the gpt-4o reader is AgentOS at 85.6%. Mastra Observational Memory (Apache-2.0) is next at 84.23%. EmergenceMem's 86% Internal is a SaaS endpoint; their 79% public reference is a license-less repo.

Median latency: AgentOS p50 is 3,558 ms; EmergenceMem's published median is 5,650 ms. The remaining vendors do not publish per-case latency.

Architecture

Every question flows through a single retrieval path: BM25 + dense + cross-encoder rerank (canonical-hybrid). With text-embedding-3-small as the dense embedder, recall@10 sits at 0.981 across the full N=500 set, so the reader sees the relevant chunks on essentially every query. Verbatim temporal detail and preference statements survive the pipeline intact, which is what the multi-session and single-session-preference categories require.

A lightweight classifier (gpt-5-mini, one extra LLM call per case at ~$0.000138) picks the reader model per category. Temporal-reasoning and single-session-user run through gpt-4o; the other four categories run through gpt-5-mini. Reader-model selection is bounded by the classifier and explicit per-category measurements, not by guesswork. The calibration table is below.

Cost at scale: at $0.0090 per memory-grounded answer, 1,000 RAG calls cost $9. A chatbot averaging 5 RAG calls per conversation across 1,000 conversations costs ~$45.

Reader-router calibration

1export const MIN_COST_BEST_CAT_2026_04_28_TABLE = {
2  preset: 'min-cost-best-cat-2026-04-28',
3  mapping: {
4    'temporal-reasoning': 'gpt-4o',         // +11.8 points on TR vs gpt-5-mini
5    'single-session-user': 'gpt-4o',        // +4.3 points on SSU
6    'single-session-preference': 'gpt-5-mini', // +23.4 points on SSP
7    'single-session-assistant': 'gpt-5-mini',  // +1.8 points + cheaper
8    'knowledge-update': 'gpt-5-mini',          // +1.5 points + cheaper
9    'multi-session': 'gpt-5-mini',             // +3.5 points + cheaper
10  },
11};

Per-category at the 85.6% headline:

CategoryAccuracyn
single-session-assistant98.2%56
single-session-user94.3%70
knowledge-update91.0%78
single-session-preference86.7%30
temporal-reasoning84.2%133
multi-session74.4%133

15 adjacent configurations all regressed

Each of the following single-variable variants was tested against the 85.6% baseline. None lifted the aggregate score.

ProbeResultΔ vs baseline
--reader-top-k 3081.5% Phase A-3.7 points
--hyde83.3% Phase A-1.9 points
--rerank-candidate-multiplier 575.9% Phase A-9.3 points
--retrieval-config-router minimize-cost-augmented77.8% Phase A-7.4 points
--policy-router-preset balanced74.1% Phase A-11.1 points
--policy-router-preset maximize-accuracy83.3% Phase A-1.9 points
text-embedding-3-large83.4% Phase B-2.2 points at 20× slower latency
--om-classifier-model gpt-4o84.0% Phase B-1.6 points at +44% cost
--rerank-model rerank-v4.0-pro84.6% Phase B-1.0 points; 5/6 categories regress
--reader-router min-cost-best-cat-gpt5-tr-2026-04-2983.2% Phase B-2.4 points; TR drops 84.2% → 80.5%

Fifteen variants tested across Phase A and Phase B; fifteen regressions. The 85.6% configuration is a local optimum in the tested parameter space.

A note on HyDE at S scale

The Gao et al. (2022) HyDE paper introduced hypothetical document embedding as a recall-improver: ask the LLM to draft a plausible answer, embed that instead of the question, and let the answer-form embedding land closer to the answer-form documents in the index. It's a real effect at the right scale. The Lei et al. (2025) Adaptive HyDE paper shows the same on a 3M-post Stack Overflow corpus where the haystack is large and the question/answer registers are far apart.

LongMemEval-S is not that setting. The haystack is 50 sessions, the canonical pipeline already saturates retrieval, and BM25 + dense + Cohere rerank-v3.5 is precise enough that the right chunk is in the top-K most of the time. HyDE adds an extra LLM call to produce a hallucinated answer-shaped embedding, and at this scale that embedding is often less precise than the original question embedding for the categories that matter (single-session-user, knowledge-update, single-session-preference). The hypothetical chunks compete with the real ones in the rerank pool and displace correct hits. Net: -1.9 points and added latency. We turn it off.

The lesson isn't "HyDE is bad" — it's "HyDE earns its keep when the haystack outgrows the precision of the surface retrieval, and pays cost the rest of the time." Save it for M.


Part 2: LongMemEval-M at the gpt-4o reader

LongMemEval-M has 1.5M tokens of conversation per question and roughly 500 sessions per haystack. It exceeds every production LLM context window, which forces evaluation through retrieval rather than prompt-stuffing.

What LongMemEval is, and what M means

LongMemEval 2 is an academic memory benchmark introduced in "LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory" (Wu et al., ICLR 2025) 1. The dataset, evaluation harness, and rubric are open source at github.com/xiaowu0162/LongMemEval 2. The 12 paper authors are academic researchers, none affiliated with a memory-library vendor.

The benchmark ships two variants by haystack scale:

VariantTokens per haystackSessions per haystackFits in production context window?
S~115K~50Yes. Every modern long-context LLM fits this. GPT-4o is 128K, Claude Opus is 200K, Gemini 3 Pro is 1M, GPT-5 is 400K
M~1.5M~500No. Exceeds every production context window

The S-to-M jump is a category change rather than a 13× scaling exercise. At S scale a memory architecture competes against the option of dumping the full conversation into the context window. Mastra's full-context baseline 3 at gpt-4o is 60.20%, and their Observational Memory configuration at the same model is 84.23%; the 24-point lift partly reflects token compression rather than memory architecture, because the OM config fits in fewer tokens and the reader has less to process. Penfield Labs makes the same point in their April 2026 LOCOMO audit 4: when the corpus fits in the context window, the benchmark is partly measuring context-window management.

At M scale the corpus does not fit in context. Retrieval is the only path, and the benchmark measures whether retrieval can find the relevant chunks in roughly 25,000 candidates across 500 sessions.

The vendor landscape on LongMemEval, audited 2026-04-29

The table below covers every memory library or platform with a public LongMemEval claim found in research pages, blog posts, GitHub repos, and peer-reviewed papers as of 2026-04-29.

VendorLicenseTheir published S numberTheir published M number
Mem0 v3 / Mem0 OSApache 2.092-93.4%not published
Mastra Observational MemoryApache 2.084.23-94.87%not published
Hindsight (vectorize.io)open repo91.4%not published
Neutrallypublic production system89.4% (LongMemEval repo issue submission, gpt-4o judge)not published
Zep / GraphitiApache 2.071.2% (independently reproduced at 63.8%)not published
EmergenceMem86% Internal is closed-source SaaS; the 79% reference repo emergence_simple_fast has no license (public code, not legally redistributable)79% (no-license repo) / 86% (closed SaaS)not published
Supermemoryopen81.6-99%not published
MemMachineopen repo93%not published
Memoriaopen88.78%not published
agentmemory (JordanMcCann)MIT96.2% (no methodology)not published
Backboardopen93.4%not published
ByteRoverclosed92.8%not published, explicit "M scales beyond any context window"
Letta (formerly MemGPT)Apache 2.0not published on LongMemEvalnot published
CogneeApache 2.0not published on LongMemEvalnot published
AgentBrainclosed-source SaaSnot published71.7% (Test 0; requires hosted Brain endpoint to reproduce)
agentos-bench (this work)Apache-2.085.6%70.2%

The full per-vendor audit is at packages/agentos-bench/docs/COMPETITOR_METHODOLOGY_AUDIT_2026-04-24.md.

Three operational barriers to publishing on M

  1. Context window. S (115K tokens) fits in every modern long-context LLM. M (1.5M tokens) exceeds every production context window. Architectures that rely on prompt-stuffing or compression-then-stuffing score lower on M than on S.

  2. Dataset loading. longmemeval_m.json is 2.7 GB. Node's V8 engine has a max-string-length cap that rejects fs.readFile on a file of that size. The streaming fix is chain([createReadStream, parser(), streamArray()]) from stream-json + stream-chain, routed by a file-size probe at >1 GB. STAGE_J_BLOCKED_2026-04-25.md records the workaround.

  3. Run cost. A memory-augmented full-context M run consumes ~750M input tokens at GPT-4o-128K pricing, roughly $1,250 per run. Retrieval-augmented M runs are $5 to $15.

What the LongMemEval paper reports on M

Wu et al., Table 3 of arXiv:2410.10813, reports academic-baseline configurations on LongMemEval-M. The strongest configuration in the paper:

72.0% on LongMemEval-M with GPT-4o + Stella V5 retriever + Value=Round + K=V+fact + Top-10.

The same Table 3 reports several other GPT-4o configurations: Round Top-5 K=V+fact at 65.7%, Session Top-5 K=V+fact at 71.4%, Session Top-10 K=V+fact at 70.0%. The 72.0% number is the strongest result in Table 3 across all GPT-4o configurations.

The dataset, evaluation harness, and rubric are open source at xiaowu0162/LongMemEval. The paper's GPT-4o results at Top-5 retrieval (the configuration AgentOS uses) span 65.7% (round-level) to 71.4% (session-level). At Top-10 the strongest is 72.0% (round-level).

Where we land

SystemAccuracy95% rangeLicenseSource
AgentBrain71.7% (Test 0)not publishedclosed-source SaaSgithub.com/AgentBrainHQ
🚀 AgentOS (sem-embed + reader-router + top-K=5)70.2%Apache-2.0agentos-bench
LongMemEval paper, strongest GPT-4o result72.0% (round, Top-10) / 71.4% (session, Top-5)not publishedopen repoWu et al., ICLR 2025, Table 3
Mem0 v3, Mastra OM, Hindsight, Zep, EmergenceMem, Supermemory, MemMachine, Memoria, agentmemory, Backboard, ByteRover, Letta, Cogneenot publishedvariousreports S only

AgentOS at 70.2% is competitive with the strongest published M results in the LongMemEval paper. The paper's strongest GPT-4o result is 72.0% at round-level Top-10; at matched Top-5 retrieval the paper's results span 65.7% (round) to 71.4% (session). The closest published external number is AgentBrain's 71.7% from their closed-source SaaS, which requires access to a hosted endpoint to reproduce. agentos-bench publishes per-case run JSONs and a one-line CLI reproduction at github.com/framerslab/agentos-bench.

Architecture

The 70.2% configuration uses HyDE-augmented BM25 + dense retrieval over text-embedding-3-small, Cohere rerank-v3.5 cross-encoder rerank with a candidate-pool multiplier of 5, reader-top-K=5, and the same per-category reader router used at S scale. A LongMemEval-M haystack contains ~1.5M tokens spread across 500 sessions, producing ~25,000 candidate chunks. The reader sees only the 5 highest-scoring chunks the cross-encoder returns.

Top-K=5 matches the LongMemEval paper's strongest M configuration (Wu et al., ICLR 2025, Table 3). At higher K, chunks ranked 6 and below frequently share lexical surface with the query but do not contain the answer; including them lowers the reader's signal-to-noise ratio. Liu et al. (2024), "Lost in the Middle" reports the same shape of failure at the long-context-LLM level.

The takeaway: a memory system that retrieves more is not always one that remembers better. At M scale, a retriever that hands the reader fewer (and better-ranked) chunks scores higher than one that hands it more. Recall is necessary; what the reader does with what it got is the rest of the work.

M cost at scale: at $0.0078 per correct over a 1.5M-token haystack, 1,000 RAG calls cost $7.80.

Per-category at the 70.2% M headline

CategoryAccuracyn
single-session-assistant96.4%56
single-session-user91.4%70
knowledge-update78.2%78
temporal-reasoning66.2%133
single-session-preference63.3%30
multi-session48.9%133

Four one-knob probes all regressed on M

Each variant was tested as a single-variable change on top of the 70.2% configuration.

ProbeAggregateΔVerdict
--reader-top-k 365.2%-5.0 points; ranges disjointrefuted
--hyde off69.2%-1.0 points; tied within rangemarginal
--rerank-candidate-multiplier 1060.0%-10.2 points; ranges disjointcatastrophically refuted
--two-call-reader (Chain-of-Note)58.6%-11.6 points; ranges disjointrefuted

Top-K=5 with HyDE on and rerank-multiplier 5 is the local optimum in the tested parameter space.

The HyDE-off ablation sits within statistical noise of the headline (69.2% vs 70.2%, -1.0 within bootstrap range), so I'll call this one marginal rather than decisive. But the directionality matches the Lei et al. (2025) result for Stack Overflow: at 500-session, 1.5M-token haystack scale, BM25 + dense recall starts missing the right bridge sessions on multi-hop synthesis, and the hypothetical-answer embedding lands in the right neighborhood often enough to recover them. This is the regime where the HyDE paper's claim is load-bearing — recall-bound retrieval with a wide question-to-answer register gap. S didn't have that gap; M does.

So the policy across both variants ends up clean: HyDE off on LongMemEval-S, HyDE on for LongMemEval-M. The retriever doesn't decide for you — you do, based on whether your haystack outruns surface-form retrieval. AgentOS ships HyDE as opt-in (enabled: false by default) for exactly this reason.


Part 3: Why M is harder than S

Multi-session and temporal-reasoning together account for 53% of all M cases and post the lowest per-category scores at M scale (48.9% and 66.2%). Across 500 candidate sessions per haystack instead of 50, the relevant session is harder for the cross-encoder to surface. Multi-session bridge queries require the model to combine evidence from two distinct sessions; at S scale this means picking 2 of 50, at M scale 2 of 500. The remaining headroom on M is concentrated in those two categories. Improvements there will move the aggregate.

The single-session categories (assistant, user, preference) translate cleanly between scales because the relevant evidence sits in one session and Top-5 retrieval reaches it. Knowledge-update and temporal-reasoning lose more between S and M because both involve cross-session synthesis where Top-5 sometimes drops the second relevant chunk.


Mastra's gpt-5-mini configuration scores above their gpt-4o configuration

Mastra publishes 84.23% on LongMemEval-S with gpt-4o as the reader, and 94.87% with gpt-5-mini as the reader. The architectural reason for this ordering is that the gpt-5-mini configuration moves reasoning upstream of the reader, into ingest-time observers and reflectors.

In Mastra's gpt-4o-only configuration, the reader handles retrieval-context parsing, cross-session reasoning, and answer generation in a single query-time LLM call.

In Mastra's gpt-5-mini + Observational Memory configuration, the pipeline does additional ingest-time work. A gemini-2.5-flash "observer" runs over each session at ingest and extracts structured observation logs. A second gemini-2.5-flash "reflector" synthesizes the observations into long-term cross-session insights. At query time, the gpt-5-mini reader answers from the pre-distilled observation log plus reflections rather than raw chunks. The architecture is documented on Mastra's Observational Memory research page and described in VentureBeat's coverage.

The 94.87% is excluded from the comparison tables in this post for two reasons:

  1. AgentOS at the same stack (gpt-5-mini reader + gemini-2.5-flash observer) on LongMemEval-S Phase A produced 70.4%, a 24-point gap from the published headline. The methodology disclosed on Mastra's research page does not contain enough detail for direct reproduction.

  2. No confidence range is published on the 94.87%. Mastra's 84.23% gpt-4o headline falls inside the AgentOS 95% range; the two configurations are statistically tied at this resolution.

The cross-vendor comparison at the top of this post uses gpt-4o on both sides.


Part 4: Reproducibility issues in the published vendor record

The patterns documented below are the methodology checks the agentos-bench harness applies before publishing any number, and the gaps between agentos-bench numbers and other vendors' published numbers.

LOCOMO answer-key and judge error rates

Penfield Labs (April 2026) audited LOCOMO and reported 99 errors in 1,540 answer-key entries (6.4% ground-truth error rate), and a 62.81% false-positive rate on LOCOMO's default LLM judge against an intentionally-wrong topically-adjacent answer set.

Implications:

  • LOCOMO scores above 93.6% include benefit from answer-key errors.
  • LOCOMO score differences below ~6 points are within the judge false-positive band.

For context, Northcutt et al. (NeurIPS 2021) found a 3.3% label-error rate sufficient to destabilize benchmark rankings.

LongMemEval-S overlap with current context windows

LongMemEval-S uses 115K tokens of conversation per question. GPT-4o (128K), Claude 3.5 (200K), Gemini 1.5 Pro (1M), and GPT-5 (400K) all hold the full S corpus in a single prompt.

Mastra's full-context baseline at gpt-4o is 60.20%; their Observational Memory configuration at the same model is 84.23%. The 24-point delta partly reflects token-level compression rather than retrieval-architecture quality.

The M variant exceeds every production context window, removing this confound.

Cross-vendor reimplementation discrepancies

In May 2025, Mem0 published a research paper positioning their product as state-of-the-art on LOCOMO. Their comparison table scored Zep at 65.99%. Zep responded, reran the evaluation with their own configuration, and reported 75.14% ±0.17 for Zep. Zep attributed the gap to Mem0 running Zep with sequential search instead of concurrent search.

Zep's self-reported LongMemEval-S number is 71.2% at gpt-4o, from their SOTA blog post. An independent reproduction at arXiv:2512.13564 measured Zep at 63.8%, a 7.4 points gap.

Notable methodology-disclosure findings

  • EmergenceMem "Simple Fast" hardcodes top_k=42 in retrieval.
  • Mastra's research page publishes 84.23% at gpt-4o; the observer and reflector models in the same configuration are gemini-2.5-flash (cross-provider).
  • Mem0's research page reports 92.0% on LongMemEval; their research-2 page reports 93.4% on the same benchmark.
  • MemPalace publishes 100% on every memory bench they touch because their evaluator is broken and returns 100% no matter the input. Not a competitor result. Public post-mortem: HackerNoon.

What competitors actually publish on 12 transparency axes

Transparency axisMem0MastraSupermemoryZepEmergenceLettaAgentOS
Aggregate accuracyyesyesyesyesyespartialyes
95% confidence range on headlinenononopartialnonoyes
Per-category 95% rangenonononononoyes
Reader model disclosednoyespartialyesyesnoyes
Observer / ingest model disclosednoyesnoyesyesnoyes
USD cost per correctnonononononoyes
Latency avg / p50 / p95nononopartialmedian onlynoyes
Per-category breakdownnoyesyesyesyespartialyes
Open-source benchmark runneryespartialyespartialyesnoyes
Per-case run JSONs at fixed seednonononononoyes
Judge-adversarial probenonononononoyes
Cross-vendor cross-vendor tablenonopartialpartialyesnoyes

Judge FPR comparison

BenchmarkAgentOS judge FPRLOCOMO default judge FPR (Penfield audit)
LongMemEval-S1% [0%, 3%]not measured
LongMemEval-M2% [0%, 5%]not measured
LOCOMO0% [0%, 0%]62.81%

LOCOMO's default gpt-4o-mini judge measures at 62.81% FPR on the Penfield adversarial set. agentos-bench LOCOMO runs use gpt-4o-2024-08-06 with rubric 2026-04-18.1, which measures at 0% FPR on the same adversarial set.


Part 5: Reproducing both headlines

OpenAI and Cohere API keys are required.

LongMemEval-S 85.6% headline

1git clone https://github.com/framerslab/agentos-bench
2cd agentos-bench
3pnpm install && pnpm build
4
5# Set OPENAI_API_KEY and COHERE_API_KEY in your environment
6NODE_OPTIONS="--max-old-space-size=8192" pnpm exec tsx src/cli.ts run longmemeval-s \
7  --reader gpt-4o \
8  --memory full-cognitive --replay ingest \
9  --hybrid-retrieval --rerank cohere \
10  --embedder-model text-embedding-3-small \
11  --reader-router min-cost-best-cat-2026-04-28 \
12  --concurrency 5 \
13  --bootstrap-resamples 10000

LongMemEval-M 70.2% headline

1NODE_OPTIONS="--max-old-space-size=8192" pnpm exec tsx src/cli.ts run longmemeval-m \
2  --reader gpt-4o \
3  --memory full-cognitive --replay ingest \
4  --hybrid-retrieval --rerank cohere --rerank-candidate-multiplier 5 \
5  --reader-top-k 5 \
6  --hyde \
7  --embedder-model text-embedding-3-small \
8  --reader-router min-cost-best-cat-2026-04-28 \
9  --concurrency 5 \
10  --bootstrap-resamples 10000

Both runs emit per-case run JSONs under seed=42. Cross-run comparison is possible against the leaderboard at packages/agentos-bench/results/LEADERBOARD.md.


Architecture

The AgentOS memory decomposition follows the CoALA framework (Sumers et al., 2023): explicit memory partitions and a decision-making module selecting a strategy per query. The MemoryRouter corresponds to the CoALA memory module; the ReaderRouter corresponds to the decision module.

The closest comparable architecture in the published record is Letta (formerly MemGPT, Packer et al., 2023), which models the LLM as a virtual operating system with paged memory. Letta has not published a LongMemEval number under the post-MemGPT branding.

Eight cognitive-memory mechanisms underlie the architecture (Ebbinghaus decay, reconsolidation, retrieval-induced forgetting, feeling-of-knowing, gist extraction, schema encoding, source confidence decay, emotion regulation) and are documented with primary-source citations in the AgentOS Cognitive Memory docs.

Remaining headroom

Multi-session is the lowest per-category score on both variants. On M it measures at 48.9% (up from 29.3% under top-K=50), against a per-category ceiling of ~96% on SSA/SSU. On S it measures at 74.4% against the 85.6% aggregate. The MS category requires bridge queries across distinct sessions; pure retrieval-broadening does not close the gap.

Two candidate v2 mechanisms are queued:

  1. Stage E: Hindsight 4-network typed-observer, adding a typed-graph signal orthogonal to BM25 + dense + Cohere rerank. Architecture follows Hindsight (vectorize.io, 2025).
  2. K=V+fact key augmentation (Wu et al., Table 3 configuration): index sessions by raw content and extracted facts, with dual-key vector lookup. The Phase B --rerank-candidate-multiplier 10 ablation regressed -10.2 points on the same retrieval-heavy categories K=V+fact would affect, suggesting bounded expected lift.

Methodology disclosures

What's apples-to-apples in the comparisons above:

  • Same gpt-4o reader as Mastra OM gpt-4o, Supermemory gpt-4o, EmergenceMem.
  • Same benchmark dataset (LongMemEval-S, 500 cases; LongMemEval-M, 500 cases).
  • Same judge harness (gpt-4o-2024-08-06 with rubric 2026-04-18.1); judge false-positive rate 1% on S, 2% on M, 0% on LOCOMO.
  • 95% confidence ranges at 10,000 resamples; most vendors don't publish ranges at all.

What isn't, with caveats:

  • Cost and latency comparisons against Mastra, Supermemory, and EmergenceMem aren't directly measurable, because those vendors don't publish $/correct or per-case latency. The cost and latency numbers above are absolute: $0.0090 per correct on S, $0.0078 on M, p50 latency 3,558 ms on S.
  • Mastra's 94.87% headline uses gpt-5-mini + gemini-2.5-flash observer. We can't reproduce it from their public methodology page, so it sits outside our gpt-4o table.
  • Mem0 v3's 93.4% is a managed-platform number with no published confidence range, no judge disclosure, and no reader disclosure. Their own State of AI Agent Memory 2026 post reports 66.9% on LOCOMO for the production stack.
  • Hindsight's 91.4% uses gemini-3-pro as reader. Supermemory's 85.2% uses gemini-3-pro as reader. Both are cross-provider, so they sit outside the gpt-4o table.
  • Managed-platform numbers (Mastra, Mem0 v3, agentmemory) run on infrastructure with platform-specific optimizations that aren't necessarily portable.

Evaluating memory libraries

Three open-source benchmark harnesses cover the LongMemEval / LOCOMO space:

  • agentos-bench 5: LongMemEval-S/M, LOCOMO, BEAM, and eight cognitive-mechanism micro-benchmarks. 95% confidence ranges, judge-adversarial probes, per-stage retention metric, per-case run JSONs at --seed 42.
  • Supermemory memorybench 6: LoCoMo, LongMemEval, ConvoMem against Supermemory, Mem0, and Zep with multi-judge support.
  • Mem0 memory-benchmarks 7: LOCOMO and LongMemEval against Mem0 Cloud and OSS.

Reproducible memory benchmarks require a published seed, configuration, and per-case run JSONs alongside the headline number.

Closing

Two numbers end up here. 85.6% on LongMemEval-S at $0.0090 per correct, +1.4 points above the strongest competitor at the same gpt-4o answer model. 70.2% on LongMemEval-M at $0.0078 per correct, the only open-source library on the public record above 65% on the variant whose haystacks no production context window can absorb.

The intent of the design behind both numbers is not perfect recall. AgentOS implements Ebbinghaus decay, retrieval-induced forgetting, reconsolidation, and seven other cognitive-science mechanisms precisely so the agent generalizes from what it has seen rather than drowns in it. The benchmark numbers are the measurable part of that argument. The rest of the whitepaper covers the part that can't be reduced to a percentage.

The runtime is Apache-2.0 at github.com/framerslab/agentos. The bench is at github.com/framerslab/agentos-bench. Reproducing the headlines is the two CLI commands above, on a dataset anyone can download from the LongMemEval upstream, against per-case run JSONs at seed 42.

FAQ

What's the difference between LongMemEval-S and LongMemEval-M?

S has 115K tokens of conversation per question and ~50 sessions per haystack: it fits in one gpt-4o call. M has 1.5M tokens per question and 500 sessions, exceeding every production LLM context window. S measures retrieval over a single-session-shaped corpus; M measures retrieval at scale where the reader can never see the whole haystack. AgentOS scores 85.6% on S and 70.2% on M.

What's the highest LongMemEval-S score anyone has claimed?

99% (Dhravya, as a gaming demonstration against the published bench rather than a reproducible architecture claim). Mastra publicly claims 95% but at a different (cheaper) answer LLM and with retrieval config that doesn't match the original paper's evaluation protocol. At the same gpt-4o answer LLM, Mastra Observational Memory posts 84.23%, AgentOS posts 85.6%, and EmergenceMem Internal (closed-source SaaS) posts 86.0%. Headline percentages that don't say which answer LLM produced them are pricing observations, not architecture claims. (MemPalace also publishes 100%, but their evaluator is broken in a way that returns 100% on any input, so it's not really in the conversation.)

Why publish 85.6% when others claim higher numbers?

Because the argument I care about is reproducibility, not headline percentage. Every number above comes with stated answer LLM, stated retrieval config, stated judge, fixed seed, per-case run JSONs, a single CLI to reproduce, and Apache-2.0 code. The 99% / 95% claims that beat AgentOS at face value miss at least one of those. The honest cost rule says I can't compare scores until those gaps close. If a competitor publishes their numbers at the same gpt-4o answer LLM tomorrow and beats 85.6%, I'll cite them and ship a faster bench. That's the deal.

Is the AgentOS bench code public?

Yes. github.com/framerslab/agentos-bench, Apache-2.0. Includes the harness, vendor adapters, judge config, seed list, and per-case run JSONs for every reported headline.

What reader model does AgentOS use?

gpt-4o (specifically gpt-4o-2024-08-06) for both S and M headlines. Some categories within S are routed through gpt-5-mini by an explicit per-category classifier with a reader-router; the reader-router is part of the AgentOS architecture, not a separate trick, and is documented in Part 1 above.

What about Mem0's claimed numbers?

Mem0 cites 66.9% on S with their "super memory" preset; the reader model and config aren't always matched to the LongMemEval paper. I rerun Mem0 OSS in agentos-bench under controlled conditions and the reproduced numbers appear in the Part 4 reproducibility section. Differences between their claim and my reproduction are documented per-case.

How often are these numbers refreshed?

Quarterly. Next refresh date: 2026-08. Each refresh re-runs the bench against the upstream LongMemEval dataset at seed 42, refreshes the same-answer-LLM comparison table, and adds any new competitor entrants (VoltAgent, whichever vendor surfaces between now and then) that publish reproducible numbers. The bench is open; if you ship a memory library and want to be in the next refresh, open a PR with your adapter.

Further reading


Built by Frame. AgentOS and agentos-bench are open source under Apache-2.0. GitHub · npm · Discord


References

Vendor research pages cited in the comparison table

The vendor table inline-links each vendor's own published research. Those source links remain inline (per-row attribution rather than prose claims). Canonical entries:

Footnotes

  1. Wu, D., Wang, J., Hu, P., et al. (2024). LongMemEval: Benchmarking chat assistants on long-term interactive memory. ICLR 2025. https://arxiv.org/abs/2410.10813 2

  2. Wu, D., et al. LongMemEval: Open dataset, evaluation harness, and rubric. GitHub. https://github.com/xiaowu0162/LongMemEval 2

  3. Mastra. (2025). Observational Memory: Research and methodology. Mastra research blog. https://mastra.ai/research/observational-memory

  4. Penfield Labs. (2026, April). We audited LOCOMO: 64% of the answer key is wrong and the judge accepts up to 63% of intentionally wrong answers. dev.to. https://dev.to/penfieldlabs/we-audited-locomo-64-of-the-answer-key-is-wrong-and-the-judge-accepts-up-to-63-of-intentionally-33lg

  5. framerslab. agentos-bench: Open benchmark harness for AgentOS memory and retrieval. GitHub (Apache-2.0). https://github.com/framerslab/agentos-bench

  6. Supermemory. memorybench: Multi-judge benchmarking harness for LoCoMo, LongMemEval, and ConvoMem. GitHub. https://github.com/supermemoryai/memorybench

  7. Mem0. memory-benchmarks: LOCOMO and LongMemEval against Mem0 Cloud and OSS. GitHub. https://github.com/mem0ai/memory-benchmarks

Comments