AgentOS is an open-source TypeScript AI agent runtime: cognitive memory, HEXACO personality traits, runtime tool forging, 16 LLM providers, end-to-end for building adaptive emergent agents. Its cognitive memory system is the part this post is about. Two new LongMemEval results from that memory system, both at the matched gpt-4o reader, both at full N=500.

LongMemEval-S: 85.6% at $0.0090 per correct answer, 3.6-second median latency. That's +1.4 points above Mastra Observational Memory at gpt-4o (84.23%), the strongest published memory-library number at this reader. EmergenceMem Internal publishes 86.0% (0.4 points above us). The result is the highest published open-source number at the matched gpt-4o reader from a library that ships an end-to-end agent runtime around it.

LongMemEval-M: 70.2% at $0.0078 per correct answer. M is the harder variant: 1.5M tokens of conversation per question, 500 sessions per haystack, exceeds every production LLM context window. Of the 14 memory-library vendors we audited, no one else publishes an M number at all. AgentOS at 70.2% is competitive with the strongest published M results in the original LongMemEval paper (Wu et al., ICLR 2025, Table 3). At matched reader-Top-5, that's +4.5 points above the round-level configuration (65.7%) and 1.2 points below the session-level configuration (71.4%); the paper's strongest GPT-4o result is 72.0% at round-level Top-10. The closest published external number is AgentBrain's 71.7% from their closed-source SaaS.

(M's $0.0078 per correct is lower than S's $0.0090 per correct, which sounds backwards because M's haystacks are 13× larger. The reader doesn't see the haystack: retrieval narrows it. M's headline config uses reader-Top-K=5, so the reader receives only 5 chunks per question regardless of how big the surrounding corpus is. S's headline runs a per-category classifier + reader router that sends some categories to gpt-4o, larger top-K, and an extra classifier call per case. M's per-call reader cost is structurally smaller; the corpus size never enters the bill because the retriever absorbs it.)

Both numbers ship with per-case run JSONs at seed 42, so anyone can rerun the same configuration and compare per-question with our results. The runtime is Apache-2.0 at github.com/framersai/agentos; the benchmark harness is Apache-2.0 at github.com/framersai/agentos-bench. One CLI command at the bottom of this post reproduces each headline.

The rest of the post covers: the architecture changes that produced each number, the vendor landscape audit, the methodology checks that drive every number above, and the reproduction commands.

TL;DR for the busy reader

Variant	AgentOS	Closest published competitor at matched reader	Cost-per-correct	License	Status
LongMemEval-S (115K tokens, 50 sessions)	85.6%	EmergenceMem 86.0%, Mastra OM gpt-4o 84.23%, Supermemory 81.6%	$0.0090	Apache-2.0	+1.4 over Mastra
LongMemEval-M (1.5M tokens, 500 sessions)	70.2%	AgentBrain 71.7% (closed-source SaaS). No other open-source library publishes M.	$0.0078	Apache-2.0	first open-source above 65%

Full benchmarks reference · Reproducible run JSONs · Transparency audit framework

Part 1: LongMemEval-S at the `gpt-4o` reader

LongMemEval-S has 115K tokens of conversation per question and roughly 50 sessions per haystack. It fits in a single gpt-4o call. Every memory-library vendor with a public LongMemEval claim publishes on S.

The table below holds the reader model constant at gpt-4o, so the comparison isolates memory architecture from base-LLM capability. Full run at N=500 questions, gpt-4o-2024-08-06 as judge, rubric 2026-04-18.1 (judge false-positive rate 1%).

System (gpt-4o-class reader)	Accuracy	$/correct	p50 latency	p95 latency	Source
EmergenceMem internal	86.0%	not published	5,650 ms	not published	emergence.ai
🚀 AgentOS canonical-hybrid + reader-router	85.6%	$0.0090	3,558 ms	7,264 ms	this work
Mastra OM gpt-4o (gemini-flash observer)	84.23%	not published	not published	not published	mastra.ai
AgentOS prior reader-router with Tier 3 policy	84.8%	$0.0410	~5,000 ms	111,535 ms	prior
AgentOS Tier 3 min-cost + sem-embed (gpt-4o only)	83.2%	$0.0521	~5,000 ms	not published	prior
EmergenceMem Simple Fast (rerun in agentos-bench)	80.6%	$0.0586	3,703 ms	9,200 ms	adapter
Supermemory gpt-4o	81.6%	not published	not published	not published	supermemory.ai
Zep self / independent reproduction	71.2% / 63.8%	not published	not published	632 ms p95 search	self / arXiv:2512.13564

AgentOS is 1.4 points above the Mastra OM gpt-4o number and 0.4 points below EmergenceMem Internal at the matched reader. Among open-source memory libraries that publish at gpt-4o and ship a methodology readers can audit (judge model, rubric, seed, per-case results), AgentOS is the highest published number.

Median latency: AgentOS p50 is 3,558 ms; EmergenceMem's published median is 5,650 ms. The remaining vendors do not publish per-case latency.

What changed: the policy router was removed

The 84.8% prior headline used a query-time policy router that picked between two retrieval modes per category. Four categories (SSA, SSU, TR, KU) ran through canonical-hybrid (BM25 + dense + cross-encoder rerank). The other two (MS, SSP) ran through observational-memory-v11, which compresses each session into a structured observation log at ingest and feeds the log to the reader instead of raw chunks.

That calibration was set when canonical-hybrid retrieval was hitting recall@10 around 0.62 with CharHash embeddings. After the switch to semantic embeddings (text-embedding-3-small), recall@10 measured at 0.981. At that recall level, the observation-log path strips verbatim temporal and preference detail that the gpt-5-mini reader needs, and reduces accuracy on MS and SSP rather than improving it.

The router was removed. Every category now flows through canonical-hybrid. A separate reader router (lightweight gpt-5-mini classifier, one extra LLM call per case at ~$0.000138) still picks the reader model per category.

	Tier 3 PR + RR	Canonical + RR	Δ
Aggregate accuracy	84.8%	85.6%	+0.8 points (ranges overlap)
Total LLM cost (full N=500)	$17.38	$3.84	-$13.54
Cost per correct	$0.0410	$0.0090	-$0.0320 per correct
Avg latency	21,042 ms	4,001 ms	-17,041 ms
p95 latency	111,535 ms	7,264 ms	-104,271 ms on tail
Recall@K=10	0.831	0.981	+0.150

Cost at scale: at $0.0090 per memory-grounded answer, 1,000 RAG calls cost $9. A chatbot averaging 5 RAG calls per conversation across 1,000 conversations costs ~$45. The prior 84.8% configuration cost $0.0410 per correct ($41 per 1,000 calls; $205 per 1,000 conversations at the same usage).

Reader-router calibration


1export const MIN_COST_BEST_CAT_2026_04_28_TABLE = {
2  preset: 'min-cost-best-cat-2026-04-28',
3  mapping: {
4    'temporal-reasoning': 'gpt-4o',         // +11.8 points on TR vs gpt-5-mini
5    'single-session-user': 'gpt-4o',        // +4.3 points on SSU
6    'single-session-preference': 'gpt-5-mini', // +23.4 points on SSP
7    'single-session-assistant': 'gpt-5-mini',  // +1.8 points + cheaper
8    'knowledge-update': 'gpt-5-mini',          // +1.5 points + cheaper
9    'multi-session': 'gpt-5-mini',             // +3.5 points + cheaper
10  },
11};

Per-category at the 85.6% headline:

Category	Tier 3 PR + RR	Canonical + RR	Δ
single-session-assistant (n=56)	100.0%	98.2%	-1.8 points (within range)
single-session-user (n=70)	91.4%	94.3%	+2.9 points
knowledge-update (n=78)	88.5%	91.0%	+2.5 points
single-session-preference (n=30)	86.7%	86.7%	tied
temporal-reasoning (n=133)	82.0%	84.2%	+2.2 points
multi-session (n=133)	75.2%	74.4%	-0.8 points (within range)

15 adjacent configurations all regressed

Each of the following single-variable variants was tested against the 85.6% baseline. None lifted the aggregate score.

Probe	Result	Δ vs baseline
`--reader-top-k 30`	81.5% Phase A	-3.7 points
`--hyde`	83.3% Phase A	-1.9 points
`--rerank-candidate-multiplier 5`	75.9% Phase A	-9.3 points
`--retrieval-config-router minimize-cost-augmented`	77.8% Phase A	-7.4 points
`--policy-router-preset balanced`	74.1% Phase A	-11.1 points
`--policy-router-preset maximize-accuracy`	83.3% Phase A	-1.9 points
`text-embedding-3-large`	83.4% Phase B	-2.2 points at 20× slower latency
`--om-classifier-model gpt-4o`	84.0% Phase B	-1.6 points at +44% cost
`--rerank-model rerank-v4.0-pro`	84.6% Phase B	-1.0 points; 5/6 categories regress
`--reader-router min-cost-best-cat-gpt5-tr-2026-04-29`	83.2% Phase B	-2.4 points; TR drops 84.2% → 80.5%

Fifteen variants tested across Phase A and Phase B; fifteen regressions. The 85.6% configuration is a local optimum in the tested parameter space.

Part 2: LongMemEval-M at the `gpt-4o` reader

LongMemEval-M has 1.5M tokens of conversation per question and roughly 500 sessions per haystack. It exceeds every production LLM context window, which forces evaluation through retrieval rather than prompt-stuffing.

What LongMemEval is, and what M means

LongMemEval is an academic memory benchmark introduced in "LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory" by Wu et al., published at ICLR 2025. The dataset, evaluation harness, and rubric are open source at github.com/xiaowu0162/LongMemEval. The 12 paper authors are academic researchers, none affiliated with a memory-library vendor.

The benchmark ships two variants by haystack scale:

Variant	Tokens per haystack	Sessions per haystack	Fits in production context window?
S	~115K	~50	Yes. Every modern long-context LLM fits this. GPT-4o is 128K, Claude Opus is 200K, Gemini 3 Pro is 1M, GPT-5 is 400K
M	~1.5M	~500	No. Exceeds every production context window

The S-to-M jump is a category change rather than a 13× scaling exercise. At S scale a memory architecture competes against the option of dumping the full conversation into the context window. Mastra's full-context baseline at gpt-4o is 60.20%, and their Observational Memory configuration at the same model is 84.23%; the 24-point lift partly reflects token compression rather than memory architecture, because the OM config fits in fewer tokens and the reader has less to process. Penfield Labs makes the same point in their April 2026 LOCOMO audit: when the corpus fits in the context window, the benchmark is partly measuring context-window management.

At M scale the corpus does not fit in context. Retrieval is the only path, and the benchmark measures whether retrieval can find the relevant chunks in roughly 25,000 candidates across 500 sessions.

The vendor landscape on LongMemEval, audited 2026-04-29

The table below covers every memory library or platform with a public LongMemEval claim found in research pages, blog posts, GitHub repos, and peer-reviewed papers as of 2026-04-29.

Vendor	License	Their published S number	Their published M number
Mem0 v3 / Mem0 OS	Apache 2.0	92-93.4%	not published
Mastra Observational Memory	Apache 2.0	84.23-94.87%	not published
Hindsight (vectorize.io)	open repo	91.4%	not published
Neutrally	public production system	89.4% (LongMemEval repo issue submission, gpt-4o judge)	not published
Zep / Graphiti	Apache 2.0	71.2% (independently reproduced at 63.8%)	not published
EmergenceMem	open Python	79-86%	not published
Supermemory	open	81.6-99%	not published
MemMachine	open repo	93%	not published
Memoria	open	88.78%	not published
agentmemory (JordanMcCann)	MIT	96.2% (no methodology)	not published
Backboard	open	93.4%	not published
ByteRover	closed	92.8%	not published, explicit "M scales beyond any context window"
Letta (formerly MemGPT)	Apache 2.0	not published on LongMemEval	not published
Cognee	Apache 2.0	not published on LongMemEval	not published
AgentBrain	closed-source SaaS	not published	71.7% (Test 0; requires hosted Brain endpoint to reproduce)
agentos-bench (this work)	Apache-2.0	85.6%	70.2%

The full per-vendor audit is at packages/agentos-bench/docs/COMPETITOR_METHODOLOGY_AUDIT_2026-04-24.md.

Three operational barriers to publishing on M

Context window. S (115K tokens) fits in every modern long-context LLM. M (1.5M tokens) exceeds every production context window. Architectures that rely on prompt-stuffing or compression-then-stuffing score lower on M than on S.
Dataset loading. longmemeval_m.json is 2.7 GB. Node's V8 engine has a max-string-length cap that rejects fs.readFile on a file of that size. The streaming fix is chain([createReadStream, parser(), streamArray()]) from stream-json + stream-chain, routed by a file-size probe at >1 GB. STAGE_J_BLOCKED_2026-04-25.md records the workaround.
Run cost. A memory-augmented full-context M run consumes ~750M input tokens at GPT-4o-128K pricing, roughly $1,250 per run. Retrieval-augmented M runs are $5–$15.

What the LongMemEval paper reports on M

Wu et al., Table 3 of arXiv:2410.10813, reports academic-baseline configurations on LongMemEval-M. The strongest configuration in the paper:

72.0% on LongMemEval-M with GPT-4o + Stella V5 retriever + Value=Round + K=V+fact + Top-10.

The same Table 3 reports several other GPT-4o configurations: Round Top-5 K=V+fact at 65.7%, Session Top-5 K=V+fact at 71.4%, Session Top-10 K=V+fact at 70.0%. The 72.0% number is the strongest result in Table 3 across all GPT-4o configurations.

The dataset, evaluation harness, and rubric are open source at xiaowu0162/LongMemEval. The paper's GPT-4o results at Top-5 retrieval (the configuration AgentOS uses) span 65.7% (round-level) to 71.4% (session-level). At Top-10 the strongest is 72.0% (round-level).

Where we land

System	Accuracy	95% range	License	Source
AgentBrain	71.7% (Test 0)	not published	closed-source SaaS	github.com/AgentBrainHQ
🚀 AgentOS (sem-embed + reader-router + top-K=5)	70.2%	Apache-2.0	agentos-bench
LongMemEval paper, strongest GPT-4o result	72.0% (round, Top-10) / 71.4% (session, Top-5)	not published	open repo	Wu et al., ICLR 2025, Table 3
Mem0 v3, Mastra OM, Hindsight, Zep, EmergenceMem, Supermemory, MemMachine, Memoria, agentmemory, Backboard, ByteRover, Letta, Cognee	not published		various	reports S only

AgentOS at 70.2% is competitive with the strongest published M results in the LongMemEval paper. The paper's strongest GPT-4o result is 72.0% at round-level Top-10; at matched Top-5 retrieval the paper's results span 65.7% (round) to 71.4% (session). The closest published external number is AgentBrain's 71.7% from their closed-source SaaS, which requires access to a hosted endpoint to reproduce. agentos-bench publishes per-case run JSONs and a one-line CLI reproduction at github.com/framersai/agentos-bench.

Step-by-step: 30.6% to 70.2%

Each row below is a single configuration change against the prior row. A confidence range was computed for each configuration; we count a step as a real lift only when its range does not overlap the prior step's range.

Date	Configuration	Aggregate	Lift
2026-04-25	Tier 1 canonical (CharHash, top-K=20)	30.6%	baseline
2026-04-26	M-tuned (HyDE + top-K=50 + rerank-mult=5, CharHash)	45.4%	+14.8 points
2026-04-29	M-tuned + sem-embed + reader-router (top-K=50)	57.6%	+12.2 points
2026-04-29	M-tuned + sem-embed + reader-router + top-K=5	70.2%	+12.6 points

Each row's confidence range is disjoint from the prior row's. Cost per correct dropped from $0.1348 to $0.0078, a 17× reduction.

The change that produced the headline: `--reader-top-k 5`

The 57.6% headline ran with --reader-top-k 50. The LongMemEval paper's strongest M configuration (Table 3) uses top-5. With --reader-top-k 5 and every other parameter held constant:

Metric	Top-K=50	Top-K=5	Δ
Aggregate accuracy	57.6%	70.2%	+12.6 points; ranges disjoint
Cost per correct	$0.0505	$0.0078	-$0.0427 per correct
Avg latency	264,933 ms	83,711 ms	-181,222 ms

M cost at scale: at $0.0078 per correct over a 1.5M-token haystack, 1,000 RAG calls cost $7.80. The prior top-K=50 configuration cost $0.0505 per correct, or $50.50 per 1,000 calls.

A LongMemEval-M haystack contains ~1.5M tokens spread across 500 sessions, producing ~25,000 candidate chunks. At top-K=50, the reader receives the cross-encoder's top picks plus 45 chunks of progressively lower confidence. Chunks ranked 6–50 frequently share lexical surface with the query but do not contain the answer; their inclusion lowers the reader's signal-to-noise ratio. Liu et al. (2024), "Lost in the Middle" reports the same shape of failure at the long-context-LLM level.

Per-category at the 70.2% M headline

Category	Top-K=50	Top-K=5	Δ
temporal-reasoning (n=133)	42.1%	66.2%	+24.1 points
single-session-preference (n=30)	40.0%	63.3%	+23.3 points
multi-session (n=133)	29.3%	48.9%	+19.6 points
knowledge-update (n=78)	76.9%	78.2%	+1.3 points
single-session-assistant (n=56)	96.4%	96.4%	tied
single-session-user (n=70)	95.7%	91.4%	-4.3 points (within range)

Four one-knob probes all regressed on M

Each variant was tested as a single-variable change on top of the 70.2% configuration.

Probe	Aggregate	Δ	Verdict
`--reader-top-k 3`	65.2%	-5.0 points; ranges disjoint	refuted
`--hyde` off	69.2%	-1.0 points; tied within range	marginal
`--rerank-candidate-multiplier 10`	60.0%	-10.2 points; ranges disjoint	catastrophically refuted
`--two-call-reader` (Chain-of-Note)	58.6%	-11.6 points; ranges disjoint	refuted

Top-K=5 with HyDE on and rerank-multiplier 5 is the local optimum in the tested parameter space.

Part 3: Per-category retrieval failure at M scale

Per-category accuracy at the 30.6% M baseline indicates which question types degrade most when retrieval operates over 500 sessions instead of 50.

Category	n	M accuracy	S baseline	Δ at M scale
single-session-user	70	60.0%	97.1%	-37.1 points
single-session-assistant	56	50.0%	89.3%	-39.3 points
knowledge-update	78	50.0%	86.8%	-36.8 points
multi-session	133	18.0%	61.7%	-43.7 points
temporal-reasoning	133	12.8%	70.2%	-57.4 points
single-session-preference	30	10.0%	63.3%	-53.3 points
Aggregate	500	30.6%	76.6%	-46.0 points

Multi-session and temporal-reasoning together account for 53% of all M cases and show the largest S→M deltas (-43.7 points and -57.4 points respectively). Across 500 candidate sessions instead of 50, the relevant session frequently does not make the top-20 under CharHash + BM25 + Cohere rerank.

The cumulative +39.6-point lift from baseline to 70.2% came from three independent axes: M-tuned retrieval flags (+14.8 points), semantic embedder switch (+12.2 points, folded with reader router), and --reader-top-k 5 (+12.6 points). Multi-session moved from 18% to 48.9% (+30.9 points); temporal-reasoning from 12.8% to 66.2% (+53.4 points).

Mastra's `gpt-5-mini` configuration scores above their `gpt-4o` configuration

Mastra publishes 84.23% on LongMemEval-S with gpt-4o as the reader, and 94.87% with gpt-5-mini as the reader. The architectural reason for this ordering is that the gpt-5-mini configuration moves reasoning upstream of the reader, into ingest-time observers and reflectors.

In Mastra's gpt-4o-only configuration, the reader handles retrieval-context parsing, cross-session reasoning, and answer generation in a single query-time LLM call.

In Mastra's gpt-5-mini + Observational Memory configuration, the pipeline does additional ingest-time work. A gemini-2.5-flash "observer" runs over each session at ingest and extracts structured observation logs. A second gemini-2.5-flash "reflector" synthesizes the observations into long-term cross-session insights. At query time, the gpt-5-mini reader answers from the pre-distilled observation log plus reflections rather than raw chunks. The architecture is documented on Mastra's Observational Memory research page and described in VentureBeat's coverage.

The 94.87% is excluded from the matched-reader tables in this post for two reasons:

AgentOS at the same stack (gpt-5-mini reader + gemini-2.5-flash observer) on LongMemEval-S Phase A produced 70.4%, a 24-point gap from the published headline. The methodology disclosed on Mastra's research page does not contain enough detail for direct reproduction.
No confidence range is published on the 94.87%. Mastra's 84.23% gpt-4o headline falls inside the AgentOS 95% range; the two configurations are statistically tied at this resolution.

The matched-reader comparison at the top of this post uses gpt-4o on both sides.

Part 4: Reproducibility issues in the published vendor record

The patterns documented below are the methodology checks the agentos-bench harness applies before publishing any number, and the gaps between agentos-bench numbers and other vendors' published numbers.

LOCOMO answer-key and judge error rates

Penfield Labs (April 2026) audited LOCOMO and reported 99 errors in 1,540 answer-key entries (6.4% ground-truth error rate), and a 62.81% false-positive rate on LOCOMO's default LLM judge against an intentionally-wrong topically-adjacent answer set.

Implications:

LOCOMO scores above 93.6% include benefit from answer-key errors.
LOCOMO score differences below ~6 points are within the judge false-positive band.

For context, Northcutt et al. (NeurIPS 2021) found a 3.3% label-error rate sufficient to destabilize benchmark rankings.

LongMemEval-S overlap with current context windows

LongMemEval-S uses 115K tokens of conversation per question. GPT-4o (128K), Claude 3.5 (200K), Gemini 1.5 Pro (1M), and GPT-5 (400K) all hold the full S corpus in a single prompt.

Mastra's full-context baseline at gpt-4o is 60.20%; their Observational Memory configuration at the same model is 84.23%. The 24-point delta partly reflects token-level compression rather than retrieval-architecture quality.

The M variant exceeds every production context window, removing this confound.

Cross-vendor reimplementation discrepancies

In May 2025, Mem0 published a research paper positioning their product as state-of-the-art on LOCOMO. Their comparison table scored Zep at 65.99%. Zep responded, reran the evaluation with their own configuration, and reported 75.14% ±0.17 for Zep. Zep attributed the gap to Mem0 running Zep with sequential search instead of concurrent search.

Zep's self-reported LongMemEval-S number is 71.2% at gpt-4o, from their SOTA blog post. An independent reproduction at arXiv:2512.13564 measured Zep at 63.8%, a 7.4 points gap.

Notable methodology-disclosure findings

EmergenceMem "Simple Fast" hardcodes top_k=42 in retrieval.
Mastra's research page publishes 84.23% at gpt-4o; the observer and reflector models in the same configuration are gemini-2.5-flash (cross-provider).
Mem0's research page reports 92.0% on LongMemEval; their research-2 page reports 93.4% on the same benchmark.
MemPalace published 100% claims on LongMemEval (retrieval recall@5, not end-to-end QA) and LOCOMO (top_k=50 over Claude Sonnet, exceeding the corpus and reducing the test to context-window QA). Documented in HackerNoon's post-mortem.

What competitors actually publish on 12 transparency axes

Transparency axis	Mem0	Mastra	Supermemory	Zep	Emergence	Letta	MemPalace	AgentOS
Aggregate accuracy	yes	yes	yes	yes	yes	partial	yes	yes
95% confidence range on headline	no	no	no	partial	no	no	no	yes
Per-category 95% range	no	no	no	no	no	no	no	yes
Reader model disclosed	no	yes	partial	yes	yes	no	no	yes
Observer / ingest model disclosed	no	yes	no	yes	yes	no	no	yes
USD cost per correct	no	no	no	no	no	no	no	yes
Latency avg / p50 / p95	no	no	no	partial	median only	no	no	yes
Per-category breakdown	no	yes	yes	yes	yes	partial	no	yes
Open-source benchmark runner	yes	partial	yes	partial	yes	no	partial	yes
Per-case run JSONs at fixed seed	no	no	no	no	no	no	no	yes
Judge-adversarial probe	no	no	no	no	no	no	no	yes
Matched-reader cross-vendor table	no	no	partial	partial	yes	no	no	yes

Judge FPR comparison

Benchmark	AgentOS judge FPR	LOCOMO default judge FPR (Penfield audit)
LongMemEval-S	1% [0%, 3%]	not measured
LongMemEval-M	2% [0%, 5%]	not measured
LOCOMO	0% [0%, 0%]	62.81%

LOCOMO's default gpt-4o-mini judge measures at 62.81% FPR on the Penfield adversarial set. agentos-bench LOCOMO runs use gpt-4o-2024-08-06 with rubric 2026-04-18.1, which measures at 0% FPR on the same adversarial set.

Part 5: Reproducing both headlines

OpenAI and Cohere API keys are required.

LongMemEval-S 85.6% headline


1git clone https://github.com/framersai/agentos-bench
2cd agentos-bench
3pnpm install && pnpm build
4
5# Set OPENAI_API_KEY and COHERE_API_KEY in your environment
6NODE_OPTIONS="--max-old-space-size=8192" pnpm exec tsx src/cli.ts run longmemeval-s \
7  --reader gpt-4o \
8  --memory full-cognitive --replay ingest \
9  --hybrid-retrieval --rerank cohere \
10  --embedder-model text-embedding-3-small \
11  --reader-router min-cost-best-cat-2026-04-28 \
12  --concurrency 5 \
13  --bootstrap-resamples 10000

LongMemEval-M 70.2% headline


1NODE_OPTIONS="--max-old-space-size=8192" pnpm exec tsx src/cli.ts run longmemeval-m \
2  --reader gpt-4o \
3  --memory full-cognitive --replay ingest \
4  --hybrid-retrieval --rerank cohere --rerank-candidate-multiplier 5 \
5  --reader-top-k 5 \
6  --hyde \
7  --embedder-model text-embedding-3-small \
8  --reader-router min-cost-best-cat-2026-04-28 \
9  --concurrency 5 \
10  --bootstrap-resamples 10000

Both runs emit per-case run JSONs under seed=42. Cross-run comparison is possible against the leaderboard at packages/agentos-bench/results/LEADERBOARD.md.

Architecture

The AgentOS memory decomposition follows the CoALA framework (Sumers et al., 2023): explicit memory partitions and a decision-making module selecting a strategy per query. The MemoryRouter corresponds to the CoALA memory module; the ReaderRouter corresponds to the decision module.

The closest comparable architecture in the published record is Letta (formerly MemGPT, Packer et al., 2023), which models the LLM as a virtual operating system with paged memory. Letta has not published a LongMemEval number under the post-MemGPT branding.

Eight cognitive-memory mechanisms underlie the architecture (Ebbinghaus decay, reconsolidation, retrieval-induced forgetting, feeling-of-knowing, gist extraction, schema encoding, source confidence decay, emotion regulation) and are documented with primary-source citations in Cognitive Memory for AI Agents.

Remaining headroom

Multi-session is the lowest per-category score on both variants. On M it measures at 48.9% (up from 29.3% under top-K=50), against a per-category ceiling of ~96% on SSA/SSU. On S it measures at 74.4% against the 85.6% aggregate. The MS category requires bridge queries across distinct sessions; pure retrieval-broadening does not close the gap.

Two candidate v2 mechanisms are queued:

Stage E: Hindsight 4-network typed-observer, adding a typed-graph signal orthogonal to BM25 + dense + Cohere rerank. Architecture follows Hindsight (vectorize.io, 2025).
K=V+fact key augmentation (Wu et al., Table 3 configuration): index sessions by raw content and extracted facts, with dual-key vector lookup. The Phase B --rerank-candidate-multiplier 10 ablation regressed -10.2 points on the same retrieval-heavy categories K=V+fact would affect, suggesting bounded expected lift.

Methodology disclosures

What's apples-to-apples in the comparisons above:

Same gpt-4o reader as Mastra OM gpt-4o, Supermemory gpt-4o, EmergenceMem.
Same benchmark dataset (LongMemEval-S, 500 cases; LongMemEval-M, 500 cases).
Same judge harness (gpt-4o-2024-08-06 with rubric 2026-04-18.1); judge false-positive rate 1% on S, 2% on M, 0% on LOCOMO.
95% confidence ranges at 10,000 resamples; most vendors don't publish ranges at all.

What isn't, with caveats:

Cost and latency comparisons against Mastra, Supermemory, and EmergenceMem aren't directly measurable, because those vendors don't publish $/correct or per-case latency. The cost and latency wins above are AgentOS-internal versus prior AgentOS configurations.
Mastra's 94.87% headline uses gpt-5-mini + gemini-2.5-flash observer. We can't reproduce it from their public methodology page, so it sits outside our matched-reader table.
Mem0 v3's 93.4% is a managed-platform number with no published confidence range, no judge disclosure, and no reader disclosure. Their own State of AI Agent Memory 2026 post reports 66.9% on LOCOMO for the production stack.
Hindsight's 91.4% uses gemini-3-pro as reader. Supermemory's 85.2% uses gemini-3-pro as reader. Both are cross-provider, so they sit outside the matched-gpt-4o table.
Managed-platform numbers (Mastra, Mem0 v3, agentmemory) run on infrastructure with platform-specific optimizations that aren't necessarily portable.

Evaluating memory libraries

Three open-source benchmark harnesses cover the LongMemEval / LOCOMO space:

agentos-bench: LongMemEval-S/M, LOCOMO, BEAM, and eight cognitive-mechanism micro-benchmarks. 95% confidence ranges, judge-adversarial probes, per-stage retention metric, per-case run JSONs at --seed 42.
Supermemory memorybench: LoCoMo, LongMemEval, ConvoMem against Supermemory, Mem0, and Zep with multi-judge support.
Mem0 memory-benchmarks: LOCOMO and LongMemEval against Mem0 Cloud and OSS.

Reproducible memory benchmarks require a published seed, configuration, and per-case run JSONs alongside the headline number.

AgentOS Memory Benchmarks: 85.6% on LongMemEval-S, 70.2% on M (Beats Mastra by 1.4 points at gpt-4o)

TL;DR for the busy reader

Part 1: LongMemEval-S at the `gpt-4o` reader

What changed: the policy router was removed

Reader-router calibration

15 adjacent configurations all regressed

Part 2: LongMemEval-M at the `gpt-4o` reader

What LongMemEval is, and what M means

The vendor landscape on LongMemEval, audited 2026-04-29

Three operational barriers to publishing on M

What the LongMemEval paper reports on M

Where we land

Step-by-step: 30.6% to 70.2%

The change that produced the headline: `--reader-top-k 5`

Per-category at the 70.2% M headline

Four one-knob probes all regressed on M

Part 3: Per-category retrieval failure at M scale

Mastra's `gpt-5-mini` configuration scores above their `gpt-4o` configuration

Part 4: Reproducibility issues in the published vendor record

LOCOMO answer-key and judge error rates

LongMemEval-S overlap with current context windows

Cross-vendor reimplementation discrepancies

Notable methodology-disclosure findings

What competitors actually publish on 12 transparency axes

Judge FPR comparison

Part 5: Reproducing both headlines

LongMemEval-S 85.6% headline

LongMemEval-M 70.2% headline

Architecture

Remaining headroom

Methodology disclosures

Evaluating memory libraries

Further reading

Comments

TL;DR for the busy reader

Part 1: LongMemEval-S at the gpt-4o reader

What changed: the policy router was removed

Reader-router calibration

15 adjacent configurations all regressed

Part 2: LongMemEval-M at the gpt-4o reader

What LongMemEval is, and what M means

The vendor landscape on LongMemEval, audited 2026-04-29

Three operational barriers to publishing on M

What the LongMemEval paper reports on M

Where we land

Step-by-step: 30.6% to 70.2%

The change that produced the headline: --reader-top-k 5

Per-category at the 70.2% M headline

Four one-knob probes all regressed on M

Part 3: Per-category retrieval failure at M scale

Mastra's gpt-5-mini configuration scores above their gpt-4o configuration

Part 4: Reproducibility issues in the published vendor record

LOCOMO answer-key and judge error rates

LongMemEval-S overlap with current context windows

Cross-vendor reimplementation discrepancies

Notable methodology-disclosure findings

What competitors actually publish on 12 transparency axes

Judge FPR comparison

Part 5: Reproducing both headlines

LongMemEval-S 85.6% headline

LongMemEval-M 70.2% headline

Architecture

Remaining headroom

Methodology disclosures

Evaluating memory libraries

Further reading

Comments

Part 1: LongMemEval-S at the `gpt-4o` reader

Part 2: LongMemEval-M at the `gpt-4o` reader

The change that produced the headline: `--reader-top-k 5`

Mastra's `gpt-5-mini` configuration scores above their `gpt-4o` configuration