The Episodic Memory Schema: Toward a Formal Specification for Temporally Grounded Human Experience Datasets

Mpalo Research Team

Memory Systems · Dataset Design · Cognitive Science · AI Infrastructure

The field of AI memory infrastructure has matured rapidly, yet the datasets used to train, benchmark, and evaluate memory systems remain structurally inadequate for the problem they purport to address. Existing large-scale conversational corpora — including LMSYS-Chat-1M, ShareGPT, and the datasets underlying current memory system benchmarks — model human-AI interaction as sequences of transactional exchanges rather than as temporally extended experiential continuity. This paper argues that this structural inadequacy is not incidental but symptomatic of a foundational category error: current datasets treat memory as a retrieval problem over facts, when the empirical and philosophical evidence establishes that human memory is a temporal reconstruction problem over experience. We introduce the Episodic Memory Schema (EMS), a formal field specification for datasets capable of supporting genuine episodic memory infrastructure. EMS defines seven primitive field categories — temporal anchoring, experiential content, affective valence, personal relevance gradient, contextual embedding, consolidation state, and decay trajectory — and provides the theoretical justification for each. We situate EMS within the growing body of related work on episodic memory benchmarks and architectural representations, argue that its specific contribution as a mandatory training-data field specification is distinct from existing benchmark schemas and architectural proposals, and assess existing datasets against EMS criteria. Our central finding is that no existing AI-oriented dataset achieves minimum EMS compliance. We propose schema-first publication as the appropriate near-term contribution: establishing the correct problem formulation before any dataset of sufficient scale can exist, thereby providing a shared specification language for a generation of episodic memory research.

Keywords: episodic memory, dataset schema, temporal grounding, AI memory infrastructure, conversational corpora, autobiographical memory, experiential continuity, memory systems


1. Introduction

A strange thing happened in the development of AI memory infrastructure: the field began building systems before it had adequately specified what memory is. The systems that emerged reflect this gap. They are technically sophisticated in some respects and philosophically naive in others, and the naivety is structural — built into the datasets on which these systems are trained and the benchmarks against which they are evaluated.

The core problem can be stated simply. Human memory is not a database of facts. It is a temporally extended fabric of experience — events, emotions, contexts, and personal meaning — woven together by the continuous thread of a particular person's life. When an AI memory system retrieves “the user prefers tea to coffee,” it has recalled a fact. It has not remembered an experience. The distinction matters because the products being built with memory infrastructure are not query engines over user preferences. They are agents expected to understand a user's history, anticipate their current context, and provide continuity across the temporal arc of a human life. For these products, fact recall is necessary but far from sufficient.

The datasets on which current memory systems are trained cannot support this ambition. LMSYS-Chat-1M contains approximately one million conversations collected over five months. ShareGPT contains anonymized ChatGPT conversations published voluntarily by users. Multi-Session Chat, one of the few datasets designed specifically for extended human-AI interaction, contains synthetic dialogues generated without temporal grounding. The memory benchmarks built on top of these datasets — LoCoMo, LongMemEval — represent genuine technical advances, but they measure performance on tasks derived from the structure of conversational artifacts rather than from the structure of human episodic memory. The gap between what these benchmarks measure and what episodic memory systems actually need to do has begun to attract attention from multiple research groups (Huet et al., 2025; Memory Bear AI, 2026; MemBench, 2025). This paper's contribution is not to be the first to notice the gap, but to provide the first formal specification of what a dataset must contain to close it.

The contribution of this paper is the Episodic Memory Schema (EMS): a mandatory field-level specification for training datasets intended to support genuine episodic memory infrastructure. EMS defines the primitive field categories that an episodic memory dataset must contain as first-class fields, provides theoretical justification for each primitive grounded in cognitive science, neuroscience, and the philosophy of memory, and assesses existing datasets against EMS criteria. Existing work in this space — including benchmark schemas (EpBench), architectural representations (Memory Bear AI's Emotion Memory Unit), and cognitive architecture frameworks (ACT-R, Soar, CoALA) — provides overlapping components but does not frame its contribution as a training-data specification requiring all seven primitives simultaneously as mandatory fields. This framing distinction is where EMS makes its claim.

A preliminary note on scope is warranted. EMS is concerned specifically with memory of personal experience — what cognitive scientists call episodic memory, following Tulving's foundational distinction (Tulving, 1972). It is not a schema for semantic memory, procedural memory, or working memory. These are distinct cognitive systems with distinct computational requirements, and conflating them has been a source of systematic confusion in the field.

2. What human memory actually is: the cognitive science baseline

Any specification for episodic memory datasets must begin with an account of episodic memory as an empirical phenomenon. The cognitive science literature on this question is rich, well-established, and largely ignored by the AI memory systems community.

Endel Tulving introduced the distinction between episodic and semantic memory in 1972, proposing that episodic memory — memory for personally experienced events — is a distinct system from semantic memory — memory for general world knowledge (Tulving, 1972). The concept of autonoetic consciousness — the phenomenological sense of mental time travel, of re-experiencing a past event from one's own perspective — was introduced in a subsequent paper (Tulving, 1985). On this account, episodic memory is not simply the storage of information about past events. It is the subjective capacity to project oneself into a past temporal context.

Conway and Pleydell-Pearce's self-memory system model (2000) proposes that autobiographical memory is organized hierarchically: at the apex, lifetime periods (school years, a period of employment, a relationship); below these, general events (repeated events or extended episodes); and at the base, event-specific knowledge — the granular, sensory-perceptual details of particular moments. Memory retrieval traverses this hierarchy in both directions, with the conceptual self exerting top-down control over what is retrieved and how it is reconstructed.

The reconstructive nature of episodic memory is among its most important properties. Memory is not reproductive — it does not replay stored recordings. It is reconstructive: each retrieval actively rebuilds the memory from fragments, schema, and current context (Bartlett, 1932; Schacter, 2001). The same event, retrieved on different occasions, may be reconstructed differently. New information can alter the representation of past events. The meaning of a memory is not fixed at encoding but evolves as the remembering subject evolves.

Affective valence is constitutive of episodic memory, not incidental to it. The amygdala modulates hippocampal encoding of emotionally significant events through a beta-adrenergic mechanism, explaining the enhanced consolidation and durability of emotionally charged memories (McGaugh, 2004). Any schema for episodic memory that omits emotional significance as a first-class field is structurally incomplete: the encoding, consolidation, and retrieval dynamics of a memory are partly determined by its affective properties at the time of experience.

Temporal structure is the most distinctive feature of episodic memory relative to other memory systems. Episodic memories are not merely tagged with timestamps. They are ordered along a personal timeline, and their meaning is partially constituted by their temporal relations to other memories. Friedman (1993) reviews the major theoretical frameworks for memory of time and identifies core distinctions between processes that reconstruct temporal position from contextual cues and processes that rely on direct temporal distance estimation. None of these are captured by a simple timestamp field.

Personal relevance — the degree to which a memory is connected to the remembering subject's goals, values, identity, and ongoing concerns — is another constitutive dimension. Conway's self-memory system model proposes that the conceptual self acts as a control structure on memory retrieval: memories consistent with the self-concept are more accessible; memories that threaten the self-concept may be actively suppressed or reconstructed to reduce dissonance (Conway & Pleydell-Pearce, 2000).

Consolidation and decay operate differently in episodic memory than in other cognitive domains. The standard model of systems consolidation holds that newly formed episodic memories initially depend on the hippocampus for retrieval but are gradually transferred to cortical networks over weeks to years (Frankland & Bontempi, 2005). Consolidation is accompanied by a change in representational form: from rich, contextually specific, episodic representations to more schematized, semantic, gist-based representations. The episodic specificity of a memory decreases over time; what remains is meaning rather than detail.

This empirical picture presents an immediate challenge to any AI system that attempts to implement episodic memory through vector similarity search over text. Vectors capture semantic similarity between linguistic representations; they do not capture temporal position, affective valence, personal relevance gradients, consolidation state, or decay trajectory. These are structural properties of the phenomenon, not implementation details that can be engineered around. A memory system that lacks mechanisms for representing and operating over these dimensions is not, in any epistemically meaningful sense, an episodic memory system. It is a semantically indexed fact store with a temporal metadata field.

3. A taxonomy of existing datasets and their structural limitations

The datasets currently serving as foundations for AI memory infrastructure fall into four categories. This section characterizes each and identifies the specific ways in which it falls short of the requirements established in Section 2. These characterizations concern structural properties, not quality: these datasets were not designed to support episodic memory systems, and the critique is of the field for treating them as sufficient, not of the datasets themselves for being what they are.

3.1 Large-scale conversational corpora

LMSYS-Chat-1M (Zheng et al., 2023) contains approximately one million conversations collected from the Chatbot Arena platform over approximately five months (April to August 2023). It is the largest publicly available dataset of real human-AI conversations. The dataset tracks 210,000 unique IPs during collection, but these are stripped from the public release, leaving no mechanism for cross-conversation user identity.

Each conversation is an isolated artifact: there is no persistent user identity across conversations, no temporal relationship between conversations from the same user, and no information about the broader experiential context underlying any individual exchange. The dataset captures what people say in individual sessions; it does not capture what people remember across sessions, how their concerns evolve, or the experiential continuity connecting one conversation to the next.

ShareGPT, the voluntarily published collection of ChatGPT conversations, carries the additional limitation of selection bias. Research on ShareGPT's composition confirms that voluntary publication introduces systematic distortion toward conversations users found worth sharing — interactions deemed successful, unusual, or entertaining (arXiv:2512.17843). WildChat (Zhao et al., 2024) partially addresses the identity problem by retaining hashed IP addresses across 204,736 unique users, enabling cross-session linkage as a research tool. Its structural limitation relative to episodic adequacy remains: conversations are still not connected by explicit temporal, affective, or personal-relevance relationships.

The common failure of this category: these datasets treat each conversation as a self-contained unit. Episodic memory, by definition, spans multiple events across time. A dataset in which each observation is a session cannot support training of systems that model memory across sessions.

3.2 Multi-session synthetic dialogue datasets

Multi-Session Chat (Xu et al., 2022) contains a training set of approximately 5,000 persona-consistent episodes across three to four sessions, with a validation and test set extending to five sessions. Personas are defined by flat attribute lists specifying preferences and background facts. The dataset was designed to study persona maintenance in long-horizon dialogue, which creates a partial alignment with episodic memory concerns.

The synthetic generation process introduces critical artifacts. Personas are defined by static fact lists rather than by the temporal arc of a life. There is no mechanism for fact evolution — a persona's stated preferences do not change across sessions as they would in real human experience. Temporal structure between sessions is arbitrary. The generation process is unidirectional: synthetic personas are the source of content, not real people whose experiential history could be recovered. A persona that states “I like hiking” across five sessions is demonstrating fact consistency, not episodic memory.

3.3 Memory system benchmarks

LoCoMo (Maharana et al., 2024) extends multi-session dialogue to longer horizons and introduces structured question-answering tasks across five subcategories: single-hop, multi-hop, temporal reasoning, commonsense, and adversarial/unanswerable. It also includes event graph summarization and multimodal dialogue generation. This breadth represents a genuine advance over conversational corpora for memory evaluation purposes.

The limitation of LoCoMo against episodic criteria is more specific than a simple fact-retrieval critique. Even its temporal reasoning subcategory reasons about events within a synthetic conversational timeline rather than about the temporal structure of personal experience as studied in cognitive science — order, duration, distance, and subjective recency as described in Section 2. The benchmark was constructed to evaluate what language models can do with long conversational histories, not to evaluate whether a system models the cognitive structure of episodic memory. This is a meaningful distinction even where the surface tasks overlap.

LongMemEval (Wu et al., 2024; ICLR 2025) evaluates five core abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention, across 500 questions and up to 1.5 million tokens of context. Like LoCoMo, it represents a serious technical contribution while sharing the same structural gap: its evaluation tasks are derived from the structure of synthetic conversational history rather than from the empirically established structure of human episodic memory.

An independent evaluation by Vectorize.io — the company behind Hindsight, a competing memory product — measured Mem0 at 49.0% and Zep at 63.8% on LongMemEval (arXiv:2603.04814). We note the evaluator's commercial interest; readers should consult the original arXiv paper for methodology details. The scores are widely cited as evidence of architectural difference between the two systems. What they cannot be cited as evidence of is episodic memory capability, because the benchmark was not designed to measure it.

3.4 Neuroscientific and psychological datasets

A fourth category exists in the cognitive science and neuropsychology literature: datasets of human autobiographical memory reports collected under controlled conditions. The Autobiographical Memory Interview (Kopelman et al., 1990) and the Autobiographical Memory Test (Williams & Broadbent, 1986) are established instruments for eliciting episodic memory content.

These datasets are episodically rich in precisely the ways that conversational corpora are not: they contain temporal structure, affective information, and reports from individuals with well-documented personal histories. Their limitations for AI memory infrastructure are different: they are small by the standards of machine learning, collected in highly controlled research contexts, and not designed to capture memory's dynamic evolution over real-time naturalistic experience.

The cognitive science datasets establish what episodic memory data should look like. The AI conversational datasets establish what AI interaction data actually looks like. The gap between them is the space the Episodic Memory Schema is designed to address.

4. The category error at the foundation of current AI memory training

The limitations surveyed in Section 3 reflect a category error that runs deeper than individual dataset design choices: the conflation of fact storage and retrieval with episodic memory.

Existing AI memory systems extract facts from conversations: “user likes hiking,” “user has two cats,” “user works in finance.” These facts are stored in some form — a vector database, a knowledge graph, a structured record. When the user next interacts with the system, relevant facts are retrieved and injected into the prompt. This is useful. It is not episodic memory.

The error becomes visible when we ask what is lost in the extraction process. When a user describes a difficult period at work — anxiety, a conflict, a resolution six months later — an extraction-based system learns that the user had a conflict. It does not learn the temporal arc of the experience: the anxiety, the duration, the resolution, and what the resolution meant in the context of everything else that was happening. It does not learn that this memory is affectively charged in a way that makes it more likely to surface in moments of current stress. It does not learn the personal significance of the event within the narrative of this particular person's life. These are not retrievable from extracted facts. They are properties of the experience itself.

The category error has theoretical roots in an underexamined assumption: that what is worth storing in a memory system is the propositionally expressible content of human communication. This assumption is understandable — propositions are extractable by language models, encodable in vectors, and retrievable by semantic similarity. It is also incorrect as a model of human memory. Hubert Dreyfus, drawing on Heidegger and Merleau-Ponty, argued that skilled human comportment cannot be represented as a store of explicit rules or facts (Dreyfus, 1972). Memory, on this account, is not a store of explicitly encoded experiences but a modification of the background against which new experiences are understood.

The evidence against the propositional assumption is accumulating in the user experience of current AI memory products: users consistently report that systems feel like they remember facts about the user but do not feel like they understand the user. The phenomenological distinction between knowing facts about someone and understanding someone tracks a real difference in the representational structure of what is stored.

The category error has a specific consequence for dataset design: if the task is defined as fact extraction and retrieval, then the appropriate datasets are conversational corpora. If the task is defined as episodic memory over personal experience, the appropriate datasets must represent the temporal, affective, contextual, and identity-relevant structure of experience. These are different problems requiring different data. The field has not yet made this transition.

5. Related work

The observation that AI memory benchmarks measure fact retrieval rather than genuine episodic memory has been made independently by several groups. EpBench (Huet et al., ICLR 2025) provides a formal episodic memory schema for LLM benchmarking with temporal and spatial contexts, entity tracking, and event structure, and explicitly criticizes existing benchmarks for following a retrieval-oriented approach. MemBench (ACL 2025 Findings) categorizes LoCoMo and LongMemEval as testing only "Factual Memory" and calls for broader evaluation criteria. The SORT paper (NeurIPS workshop) and MEMTRACK (NeurIPS 2025 Workshop) make overlapping arguments. This convergence is significant: multiple independent groups have identified the same limitation from different angles. EMS's contribution is not to repeat this critique but to derive from it a formal dataset specification.

Memory Bear AI's Emotion Memory Unit (EMU, arXiv:2603.22306) is the most structurally proximate prior work. EMU defines a formal representation including affective valence as a valence-arousal vector, temporal anchoring, salience, decay and forgetting functions, and consolidation. This covers at least four of the EMS's seven primitives in a mathematically specified architectural framework. The distinction between EMU and EMS is one of contribution type: EMU is an architectural representation used at inference time; EMS is a training data specification defining what must be collected as fields in every training record. A system could be trained on EMS-compliant data and implement EMU at inference, or implement EMU without EMS-compliant training data. They are complementary, not competing.

ZenBrain (April 2026) implements Ebbinghaus forgetting curves, sleep-time consolidation, and emotional valence tagging as architectural components. It shares the same distinction from EMS: it is an architectural proposal rather than a dataset specification.

In cognitive architectures, ACT-R has modeled memory activation, decay, and retrieval dynamics for decades (Anderson et al., 2004). Soar's episodic memory module stores temporal snapshots with formal schemas. The CoALA framework (Sumers et al., 2023) formalizes episodic memory as a typed component for language agents. These works establish the cognitive legitimacy of the EMS primitives but operate at the level of system architecture rather than training data requirements.

The EMS's specific contribution within this landscape is the framing of all seven primitives as mandatory fields in training datasets rather than as architectural components, benchmark dimensions, or cognitive simulation mechanisms. The claim is not that no prior work has touched these dimensions, but that no prior work has specified them as a complete, co-required set of training data fields with operational collection requirements. That specification gap is what EMS addresses.

6. The Episodic Memory Schema: formal specification

The Episodic Memory Schema defines the minimum set of primitive field categories required for a dataset to support training and evaluation of genuine episodic memory systems. The schema is prescriptive rather than descriptive — it specifies what episodic memory data should contain, not what existing data does contain.

EMS is organized around seven primitive field categories. Each primitive is defined by three properties: its semantic role in the representation of episodic memory, its theoretical justification, and its operational specification for dataset construction.

A dataset achieves EMS compliance at Level 1 if it contains all seven primitives in any form. It achieves Level 2 compliance if each primitive meets the operational specification defined in Section 7. It achieves Level 3 compliance if the primitives are represented in a way that enables the computational operations described in Section 9: temporal traversal, affective weighting, relevance-gradient retrieval, and consolidation-aware decay.

No existing AI-oriented dataset known to the authors achieves Level 1 EMS compliance. This is the central empirical finding of this paper.

6.1 The seven primitives

P1 — Temporal Anchoring

The representation of when an experience occurred, including its absolute temporal position on the user's personal timeline, its relative temporal relations to other memories, and its subjective temporal distance from the moment of recall.

P2 — Experiential Content

The representation of what occurred, including not only propositional content but phenomenological structure: the scene, the agents, the actions, the sequence, the sensory-perceptual context to whatever degree it is representable.

P3 — Affective Valence

The representation of the emotional character of the experience at the time of encoding — including valence (positive/negative), arousal level, and specific emotional categories — and its emotional resonance at the time of recall, which may differ from encoding valence.

P4 — Personal Relevance Gradient

The representation of the degree to which this experience is connected to the remembering subject's ongoing concerns, values, identity narrative, and long-term goals. Distinct from affective valence: an experience may be emotionally neutral but highly personally relevant, or emotionally charged but personally peripheral.

P5 — Contextual Embedding

The representation of the broader context within which the experience occurred — the life period, the social relationships active at the time, the concurrent events in other domains of the person's life that give this experience its meaning.

P6 — Consolidation State

The representation of how this memory has been processed over time — whether it remains vivid and episodically specific, has been partially abstracted into a more general representation, has been updated by subsequent experience, or has been integrated into a broader narrative.

P7 — Decay Trajectory

The representation of how this memory's accessibility, specificity, and affective charge have changed over time — not simply whether the information is still retrievable, but how its representational form has evolved.

7. Primitive field definitions and theoretical justification

7.1 Temporal Anchoring (P1)

Temporal information in episodic memory is not stored as a simple timestamp. Friedman (1993) reviews the major theoretical frameworks for memory of time and identifies fundamental distinctions between processes that reconstruct temporal position from contextual cues and those that rely on temporal distance estimation. Order memory, duration memory, and subjective recency are separable empirical phenomena. Research on the telescoping effect (Rubin & Baddeley, 1989) demonstrates that the subjective recency of significant events is systematically distorted relative to clock time. A memory system that operates on objective timestamps will not replicate the retrieval dynamics of human memory.

Operational Specification

A compliant dataset must contain (a) absolute temporal anchoring at day-level resolution minimum, (b) explicit representation of temporal relations between at least some event pairs (before, after, concurrent, overlapping), and (c) either direct measurement or a proxy measure of subjective temporal distance from the time of recall.

7.2 Experiential Content (P2)

Evans (1982) and subsequent work in singular thought distinguish between knowing that a proposition is true and demonstratively knowing a particular — knowing this event as an indexical, perspective-dependent particular. Episodic memory represents particulars of this kind: not merely that an event of type E occurred at time T, but that this event occurred, experienced from one's own perspective. Schacter and Addis (2007) establish that episodic specificity — the degree to which a memory represents the particular perceptual, spatial, temporal, and emotional context of a unique event rather than a generic script — is a measurable and neurally significant property. A schema that captures only propositional content discards the dimension along which this variation occurs.

Operational Specification

A compliant dataset must contain (a) at least one first-person perspective marker per event, (b) explicit representation of scene-level context beyond propositional facts, and (c) a specificity annotation indicating whether the content represents a unique episode or a generic script.

7.3 Affective Valence (P3)

McGaugh (2004) demonstrates that the amygdala enhances hippocampal encoding of emotionally arousing events through a beta-adrenergic mechanism. Affective charge predicts memory durability and accessibility: highly charged memories are retrieved more readily and appear in more response contexts. It is also well established that the emotional character of a recalled memory at the time of retrieval can differ systematically from its emotional character at the time of encoding — a distinction between hot and cold cognition with a long theoretical history (Abelson, 1963; Zelazo & Müller, 2002) that is relevant to how memory systems should model re-accessed emotional content (Levine & Pizarro, 2004). A system that stores only encoding-time valence cannot represent this dynamic.

Operational Specification

A compliant dataset must contain (a) affective valence annotation at encoding on a minimum categorical scale (positive/negative/neutral), (b) arousal level annotation, and (c) where the same memory is accessed multiple times, valence annotation at each access point.

7.4 Personal Relevance Gradient (P4)

Conway and Pleydell-Pearce's self-memory system model (2000) proposes that autobiographical memory is actively organized around the conceptual self, which shapes what is retrieved and how it is reconstructed. Personal relevance is not equivalent to emotional significance. A career decision may be recalled with equanimity but carry high personal relevance because it connects to ongoing concerns about professional identity. These two dimensions are empirically dissociable (Rubin et al., 2003) and computationally distinct. Among the EMS primitives, personal relevance as a required training-data field has the least coverage in prior work and represents EMS's most novel individual contribution.

Operational Specification

A compliant dataset must contain (a) an annotation of the degree to which each memory connects to identified long-term goals and values on at least a three-point scale, (b) linkage annotations connecting episodic memories to life themes, and (c) longitudinal tracking of relevance changes across multiple recall sessions.

7.5 Contextual Embedding (P5)

Brewer (1986) established that personal memories are embedded in coherent schemas of situations, activities, and life periods. A memory of a conversation cannot be understood without the relational context in which it occurred. A professional decision cannot be understood without the career context and concurrent circumstances that shaped it. For AI memory systems, contextual embedding enables reasoning about memory coherence — recognition that two apparently unrelated memories belong to the same life period, or that a current situation resembles a past period in ways that make specific earlier memories relevant.

Operational Specification

A compliant dataset must contain (a) life period annotation linking each memory to a labeled period in the remembering subject's life, (b) relational context annotation identifying social relationships active at the time, and (c) concurrent event annotation identifying the domain-level context of other major ongoing circumstances.

7.6 Consolidation State (P6)

Systems consolidation theory (Squire & Alvarez, 1995; Frankland & Bontempi, 2005) holds that episodic memories transition over time from hippocampally dependent, contextually specific representations to cortically distributed, semantically schematized representations. This transition changes what a memory is good for: a freshly encoded episodic memory supports specific contextual recall; a fully consolidated semantic representation supports schema-based generalization. An AI memory system that treats all memories as equivalent regardless of consolidation state misrepresents the system it is supposed to model.

Operational Specification

A compliant dataset must contain (a) a consolidation state annotation on at least a four-point scale (recent/episodic, partially consolidated, largely semantic, fully integrated), (b) the age of the memory in days from encoding to collection, and (c) annotation where a memory has been updated by subsequent experience.

7.7 Decay Trajectory (P7)

Ebbinghaus (1885) demonstrated that memory accessibility follows a characteristic exponential decay curve, with the steepest loss in the hours and days immediately following encoding. Different memory systems have different decay functions, and emotionally significant memories decay more slowly than neutral ones — a finding consistent with P3 having direct implications for P7. A memory encoded three days ago and a memory encoded three years ago should not be treated as equivalent regardless of their semantic similarity to a current query.

Operational Specification

A compliant dataset must contain (a) multi-point sampling of the same memory across time, (b) specificity change annotation tracking the degree to which episodic detail has been lost to semantic consolidation, and (c) affective charge trajectory tracking whether emotional intensity has changed between sampling points.

8. Adequacy criteria and the EMS compliance assessment

The following table applies EMS compliance criteria to the datasets surveyed in Section 3. Partial credit is awarded where a dataset contains a degraded or approximated version of a required primitive.

DatasetP1P2P3P4P5P6P7Level
LMSYS-Chat-1MPartialPartialNoneNoneNoneNoneNone< 1
ShareGPTPartialPartialPartialNoneNoneNoneNone< 1
WildChatPartialPartialNoneNoneNoneNoneNone< 1
MSC (Xu et al., 2022)PartialPartialNonePartialNoneNoneNone< 1
LoCoMoPartialPartialNoneNoneNoneNoneNone< 1
LongMemEvalPartialPartialNoneNoneNoneNoneNone< 1
AMI / CAMIFullPartialPartialPartialPartialNoneNone1 (partial)

No AI-oriented dataset achieves Level 1 compliance. The autobiographical memory instruments come closest, but they lack the longitudinal multi-point sampling required for P7 and the consolidation state annotation required for P6. The partial credit for P1 across conversational datasets reflects the presence of conversation-level timestamps, which provide a degraded approximation of temporal anchoring but not the temporal structure between memories that P1 requires.

The pattern of absences is revealing. Affective valence (P3), personal relevance gradient (P4), contextual embedding (P5), consolidation state (P6), and decay trajectory (P7) are systematically absent across all AI-oriented datasets. Their absence follows directly from the category error identified in Section 4: if the problem is defined as fact retrieval, there is no reason to collect affective, relevance, or decay data. The absence is not accidental — it is structural.

9. Implications for memory system design and evaluation

9.1 Temporal traversal as a first-class operation

A Level 3 EMS-compliant dataset supports temporal traversal: the ability to navigate a user's experiential timeline forward and backward from any point. Temporal traversal is distinct from semantic retrieval. Semantic retrieval returns the memories most similar in content to a query. Temporal traversal returns the memories most temporally adjacent, most likely to form a coherent narrative arc, or most relevant given a temporal reasoning question. This enables the class of queries that current systems handle poorly: “What was going on for this user during the period when they were dealing with X?” “How has the user's thinking about Z changed over time?”

9.2 Affective weighting in retrieval

P3 data enables affective weighting in retrieval: the ability to surface memories based on their affective charge, their relevance to current affective state, or the affective dynamics of a user's experiential history. Affective weighting also has safety implications. A system with access to affective valence data can identify patterns that may signal distress — increases in negative valence, sustained high-arousal states, the intrusion of highly charged memories into neutral contexts. Whether and how to use this information is a design and ethics question outside the scope of this paper; the schema does not prescribe an answer. But a system without affective data cannot even ask the question.

9.3 Consolidation-aware decay and update

P6 and P7 data enable consolidation-aware retrieval: a system that treats recent episodic memories differently from consolidated semantic representations, that updates memories when new information changes their meaning, and that decays episodic specificity over time in a way that mirrors natural consolidation. A user who interacted with an AI system five years ago should not have their memories from that period retrieved with the same representational form as memories from last week.

9.4 Evaluation beyond fact retrieval

The EMS implies an expanded benchmark suite. A complete evaluation of an episodic memory system should additionally measure: temporal reasoning accuracy (questions about the temporal structure of a user's history); affective coherence (whether memories surfaced in response to emotionally charged queries are affectively appropriate); relevance gradient calibration (whether the system assigns higher retrieval priority to memories with higher personal relevance, independent of semantic similarity); consolidation tracking (whether the system appropriately updates representations as time passes); and decay fidelity (whether the system's memory accessibility function approximates a plausible decay trajectory). None of these can be evaluated without EMS-compliant data. Their absence from current benchmark suites is a direct consequence of the absence of the training data required to construct them.

10. Objections and replies

Objection 1: The EMS requires data that cannot be collected at scale without violating user privacy.

EMS compliance is a schema specification, not a collection protocol. Several approaches are consistent with EMS requirements and rigorous privacy protection: explicit opt-in collection where users understand and consent to the value exchange; behavioral signal inference (affective valence and personal relevance can be approximated from message length, response latency, and re-visitation patterns without direct self-report); and synthetic data augmentation from established autobiographical memory research instruments. The privacy objection identifies a genuine constraint on collection methodology. It does not identify a flaw in the schema.

Objection 2: Personal relevance and consolidation state cannot be reliably measured.

Measurement difficulty does not imply theoretical incorrectness. Personal relevance and consolidation state are real properties of episodic memory with well-established empirical correlates; their difficulty of measurement reflects the genuine complexity of the phenomenon. The standards of reliability appropriate for dataset construction are lower than those for clinical assessment. A dataset that captures noisy but systematically valid signals is more episodically adequate than one that omits these dimensions entirely.

Objection 3: Existing systems achieve useful memory functionality without EMS-compliant data.

This is true for the products being built today. Current AI memory products support agents that remember facts about users, and this is genuinely useful. The products being built in the next five to ten years — agents expected to provide genuine continuity and respond to the full temporal arc of a human life — require capabilities that EMS-compliant training data enables. EMS compliance is unnecessary for current products. It will not be unnecessary for future products.

Objection 4: Much of this already exists in prior work.

As detailed in Section 5, several of EMS's individual primitives appear in prior work. What does not exist prior to this paper is a specification of all seven as mandatory co-required fields in training datasets with operational collection requirements. EpBench defines evaluation dimensions; EMU defines architectural representation; ACT-R defines simulation mechanisms. None of them define what must be collected as training data fields. That specification gap is EMS's contribution.

11. Scope and limitations

EMS is a schema for episodic memory training data. It is not a complete specification for AI memory system architecture, which requires additional decisions about encoding, storage, retrieval, and update mechanisms outside the scope of dataset design. EMS is designed for personal episodic memory and does not address semantic memory, procedural memory, or prospective memory. These are distinct cognitive systems; a complete AI memory infrastructure will need schemas for each.

EMS assumes the target system is designed to model human-like episodic memory. It is not applicable to systems with fundamentally different architectural commitments. The operational specifications reflect the current state of the cognitive science literature and may require updating as that literature develops.

The compliance table in Section 8 reflects the structural properties of datasets as described in their published documentation. We have not independently re-run experiments on these datasets. EMS compliance should be understood as an assessment of published design, not a re-evaluation of experimental results.

12. Conclusion

The field of AI memory infrastructure is confronting a problem it has not yet fully formulated. The systems being built today are useful. They are not, in the relevant sense, episodic memory systems. They are fact stores with temporal metadata — sophisticated, valuable, and architecturally inadequate for the products that will define the next decade of AI applications.

Multiple groups are working on pieces of this problem. EpBench provides better evaluation criteria. EMU provides a richer architectural representation. MemBench has documented the limits of existing fact-retrieval benchmarks. What is missing from this productive moment is a shared specification of what training data must contain. Different systems will be compared, improved, and deployed. Without a common data specification, those comparisons will remain structurally incommensurable.

The seven EMS primitives — temporal anchoring, experiential content, affective valence, personal relevance gradient, contextual embedding, consolidation state, and decay trajectory — are offered as the minimum structural requirements for a training dataset to support genuine episodic memory infrastructure. They are grounded in the cognitive science of what episodic memory actually is. They are co-required: a dataset that contains six of seven is not episodically adequate in the same sense that a dataset containing all seven is. And they are operationally specified: for each primitive, we provide collection requirements that make the schema constructible rather than merely desirable.

The field's current benchmark competition is being conducted on the wrong objective. The question is not which system can extract and retrieve more facts from synthetic multi-session dialogues. The question is which system can be trained to represent and reason over the temporal, affective, and identity-relevant structure of human experience. That question cannot be asked or answered without the data structure this paper specifies. The architecture and evaluation questions are already being worked on. The training data specification has been missing. This paper provides it.


  • Abelson, R. P. (1963). Computer simulation of "hot" cognition. In S. S. Tomkins & S. Messick (Eds.), Computer simulation of personality. Wiley.
  • Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., & Qin, Y. (2004). An integrated theory of the mind. Psychological Review, 111(4), 1036–1060.
  • Bartlett, F. C. (1932). Remembering: A study in experimental and social psychology. Cambridge University Press.
  • Brewer, W. F. (1986). What is autobiographical memory? In D. C. Rubin (Ed.), Autobiographical memory (pp. 25–49). Cambridge University Press.
  • Conway, M. A., & Pleydell-Pearce, C. W. (2000). The construction of autobiographical memories in the self-memory system. Psychological Review, 107(2), 261–288.
  • Dreyfus, H. L. (1972). What computers can't do: A critique of artificial reason. Harper & Row.
  • Ebbinghaus, H. (1885). Über das Gedächtnis: Untersuchungen zur experimentellen Psychologie. Duncker & Humblot.
  • Evans, G. (1982). The varieties of reference. Oxford University Press.
  • Frankland, P. W., & Bontempi, B. (2005). The organization of recent and remote memories. Nature Reviews Neuroscience, 6(2), 119–130.
  • Friedman, W. J. (1993). Memory for the time of past events. Psychological Bulletin, 113(1), 44–66.
  • Huet, A., et al. (2025). EpBench: Benchmarking episodic memory in large language models. ICLR 2025.
  • Johnson, M. (1987). The body in the mind: The bodily basis of meaning, imagination, and reason. University of Chicago Press.
  • Kopelman, M. D., Wilson, B. A., & Baddeley, A. D. (1990). The Autobiographical Memory Interview. Journal of Clinical and Experimental Neuropsychology, 12(5), 724–744.
  • Levine, L. J., & Pizarro, D. A. (2004). Emotion and memory research: A grumpy overview. Social Cognition, 22(5), 530–554.
  • Maharana, A., et al. (2024). Evaluating very long-term conversational memory of LLM agents. arXiv preprint arXiv:2402.17753.
  • McGaugh, J. L. (2004). The amygdala modulates the consolidation of memories of emotionally arousing experiences. Annual Review of Neuroscience, 27, 1–28.
  • Memory Bear AI. (2026). Emotion Memory Unit (EMU): A neuroscience-grounded memory representation for AI agents. arXiv preprint arXiv:2603.22306.
  • Rubin, D. C., & Baddeley, A. D. (1989). Telescoping is not time compression: A model of the dating of autobiographical events. Memory & Cognition, 17(6), 653–661.
  • Rubin, D. C., Schrauf, R. W., & Greenberg, D. L. (2003). Belief and recollection of autobiographical memories. Memory & Cognition, 31(6), 887–901.
  • Schacter, D. L. (2001). The seven sins of memory: How the mind forgets and remembers. Houghton Mifflin.
  • Schacter, D. L., & Addis, D. R. (2007). The cognitive neuroscience of constructive memory. Philosophical Transactions of the Royal Society B, 362(1481), 773–786.
  • Squire, L. R., & Alvarez, P. (1995). Retrograde amnesia and memory consolidation: A neurobiological perspective. Current Opinion in Neurobiology, 5(2), 169–177.
  • Sumers, T., Yao, S., Narasimhan, K., & Griffiths, T. L. (2023). Cognitive architectures for language agents. arXiv preprint arXiv:2309.02427.
  • Tulving, E. (1972). Episodic and semantic memory. In E. Tulving & W. Donaldson (Eds.), Organization of memory (pp. 381–403). Academic Press.
  • Tulving, E. (1985). Memory and consciousness. Canadian Psychology, 26(1), 1–12.
  • Vectorize.io. (2026). Mem0 vs Zep (Graphiti): AI agent memory compared. Retrieved from arXiv:2603.04814. Note: Vectorize.io produces Hindsight, a competing memory product.
  • Williams, J. M. G., & Broadbent, K. (1986). Autobiographical memory in suicide attempters. Journal of Abnormal Psychology, 95(2), 144–149.
  • Wu, X., et al. (2024). LongMemEval: Benchmarking chat assistants on long-term interactive memory. arXiv preprint arXiv:2410.10813. (ICLR 2025)
  • Xu, J., et al. (2022). Beyond goldfish memory: Long-term open-domain conversation. In Proceedings of ACL 2022 (pp. 5180–5197).
  • Zelazo, P. D., & Müller, U. (2002). Executive function in typical and atypical development. In U. Goswami (Ed.), Handbook of childhood cognitive development. Blackwell.
  • Zhao, W., et al. (2024). WildChat: 1M ChatGPT interaction logs in the wild. arXiv preprint arXiv:2405.01470.
  • Zheng, L., et al. (2023). LMSYS-Chat-1M: A large-scale real-world LLM conversation dataset. arXiv preprint arXiv:2309.11998.