The Synthetic Online Conversations (SOC) Dataset

The Synthetic Online Conversations (SOC) Dataset

Publicly available conversational data suffers from a persistent structural problem: the corpora that exist are either too small, too domain-specific, or scraped from sources that carry obvious ethical and privacy costs. Crowdsourced alternatives address some of these concerns, but they don’t scale — and the resulting exchanges tend to be formulaic in ways that strip out the texture of real conversation. We built the Synthetic Online Conversations (SOC) dataset to address this gap: a large-scale, realistic collection of synthetic dialogues intended for training and evaluating NLP models. SOC was first released in August 2025 as SOC-2508. Since then the pipeline has gone through several rounds of iteration, and we're now regenerating the entire dataset from scratch — starting with a fresh persona bank, SPB-2602. This post covers both what's new and how the system works end-to-end. SOC was first released in August 2025 as SOC-2508. Since then the pipeline has gone through several rounds of iteration, and we're now regenerating the entire dataset from scratch — starting with a fresh persona bank, SPB-2602. This post covers both what's new and how the system works end-to-end. This post describes the current pipeline end-to-end, including the architectural choices that distinguish it from prior work and the limitations we’re actively working to resolve.

Why This Exists

The fundamental problem with synthetic conversation is not that it’s artificial — it’s that most generation pipelines skip the scaffolding that makes conversations feel grounded. Two people talking don’t start from nowhere; they have a shared history, a concrete reason to be talking at this particular moment, and dozens of small contextual pressures operating in the background. Generating the words without generating that context produces exchanges that are fluent but hollow. Generating the context first — and letting the words follow from it — is the approach we take here.

How It’s Built

The core idea is to build conversations bottom-up: from people, to situations, to chats, with each stage grounding the next. The pipeline proceeds as follows:

text
Seed data
  → Persona generation     (iterative + diversity resets)
  → Experience generation  (pairing + relationship + situation + trigger)
  → Chat generation        (multi-turn, media tags, timestamp pacing)

The pipeline takes heavy inspiration from ConvoGen, a multi-agent framework by Gody et al. that grounds conversations in generated experiences — each experience bundles personas, a relationship, a situation, a topic, and a conversation starter — and uses iterative sampling from a dynamically updated few-shot hub to prevent templatic repetition at scale. We adopt that experience-first architecture and the iterative sampling idea wholesale; but we depart from ConvoGen in three ways. First, we bake timestamps and multi-message turns directly into the generation loop rather than annotating them afterward. Second, we extend the turn format with typed media attachments — images, audio, video, stickers — to reflect how online conversations actually flow. Third, we add a rolling summarization memory to manage long-context drift. Each of those choices plays out in the stages below.

The Synthetic Persona Bank (SPB-2602)

Every conversation starts with the people having it. This release ships with a fresh set of character personas, published separately as SPB-2602.

We generate personas using Kimi-K2-0905, Kimi-K2.5, and GLM-5, with prompts that push toward psychological realism and — crucially — show-don’t-tell narrative writing. The target population spans a wide range of occupations, personalities, life circumstances, and socioeconomic backgrounds, while deliberately aiming for the statistical average rather than the dramatic exception. A custom StatsEngine samples a world region, locale-appropriate subregion, name, and age Age is sampled with a mean of 25 and σ=5, intentionally skewed toward younger adults to match the downstream use case of online conversations. Broader age coverage is planned for future releases. Age is sampled with a mean of 25 and σ=5, intentionally skewed toward younger adults to match the downstream use case of online conversations. Broader age coverage is planned for future releases. for each call, with the region feeding into the prompt so that cultural details remain coherent rather than generic.

Preventing style collapse is an active concern. Each generation call receives a small set of examples as do-not-imitate-too-closely references drawn from a rolling window capped at 20 — seeded with hand-written personas at warmup, then mixing those seeds with recent outputs as generation proceeds. Older generations age out naturally, keeping references fresh without losing accumulated variety. The resulting records are structured markdown-in-XML character cards covering: Basic Information, Physical & Lifestyle, Personality Overview, Core Traits, Emotional Profile, Relationships, Values, Motivations & Fears, Behavioral Patterns, Communication Style, Example Messages, and a Summary. A representative record looks like this:

json
{
  "persona_text": "<character>\\n**Basic Information**\\n**Name:** Yong (杨勇)\\n**Age:** 18\\n**Location:** Wuhan, Hubei Province, China\\n**Pronouns:** He/him\\n\\n**Physical & Lifestyle**\\nYong is a lean, average-height young man who hasn't quite grown into his features yet...\\n\\n**Personality Overview**\\nYong is at the exact point where he knows who he was...\\n</character>",
  "meta": {
    "iteration": 42,
    "model": "moonshotai/Kimi-K2-Instruct-0905",
    "region": "East Asia",
    "subregion": "Wuhan, Hubei Province, China",
    "name": "Yong",
    "age": 18,
    "time_taken": 12.4
  }
}

Experiences — Context Before Conversation

Personas alone aren’t enough. The experience step answers a more specific question: why are these two people talking right now?

Each experience pairs two personas and synthesizes their relationship, the platform and situation, a message cadence, a concrete trigger for why the conversation starts today, a topic roadmap, and a handful of plausible background interruptions — a mother knocking, a phone battery dying — woven in mid-conversation to simulate real-life friction. Instant events are not guaranteed — at each turn the dynamic prompt has a 5% chance of incorporating one. Most turns pass without any interruption. Instant events are not guaranteed — at each turn the dynamic prompt has a 5% chance of incorporating one. Most turns pass without any interruption. Pairing is 50% region-matched and 50% fully random, balancing local coherence with cross-cultural variety. The generation pool is actively monitored, with diversity hints injected when needed — forcing a freeform generation, for instance, if the recent pool is overwhelmingly semi-structured. Experience generation uses the same rolling few-shot window as persona generation, keeping the pool fresh as batch size grows. Here is an excerpt from a generated experience:

xml
<experience>
Ahmed and Wei have been online friends for eight months after meeting in a Discord server
for mobile device repair enthusiasts...

It's Saturday evening in Peshawar, Sunday morning in Yantai. Ahmed is at the shop alone,
Yasir having left early for a family function. Wei is at his desk, headphones in, staring
at an IELTS writing prompt he's rewritten four times...

Conversation style: semi-structured
Message cadence: async (hours between replies, sometimes faster when both happen to be online)

The conversation will likely cover the following topics:
- Life updates: Wei's IELTS prep stress, Ahmed's family obligations and the slow shop week (8 turns) ← PRIMARY
- Future anxieties: Both circling around the question of whether their current paths lead somewhere... (6 turns)
- Small comforts: A funny moment from the tech Discord, a video game recommendation... (4 turns)

Initial state: life updates; 8; future anxieties

Possible instant events:
- Ahmed's cousin Yasir returns to the shop and makes a comment that Ahmed doesn't want to translate
- Wei's phone battery dies mid-reply, forcing him to find his charger and reconsider what he was typing
- A message from Ahmed's mother arrives in WhatsApp, visible in his notifications but unopened
- Wei's roommate starts cooking something that smells good, briefly pulling his attention away
</experience>

Chat Generation

Grounded in an experience, an LLM writes the conversation turn by turn. A running internal state tracks which topic is active, how many turns remain in that slot, and what comes next, allowing for smooth topic transitions — and a rolling summarization memory compresses older messages to keep the active window manageable.

One key failure mode in SOC-2508 was what we call knowledge dumping: personas would burn through their most emotionally significant material in the first few exchanges, producing conversations that felt artificially deep too early. The topic tracker directly addresses this by anchoring the model to the current slot; it can’t advance until that turn budget is spent. Depth has to be earned through the structure we’ve defined.

Each turn can include 1–3 messages with a type attribute for image, audio, video, or sticker content. Every message carries time and date attributes (t="HH:MM" and d="DD.MM") This replaces the explicit delay tags used in SOC-2508, making pacing an emergent property of the conversation rather than an annotation layered on top of it. This replaces the explicit delay tags used in SOC-2508, making pacing an emergent property of the conversation rather than an annotation layered on top of it. — the model calculates gaps natively, reflecting async rhythms and long absences based on the experience’s defined cadence rather than relying on a post-hoc annotation layer. Freeform conversations run to the full turn limit; structured and semi-structured conversations may end early via the <predefined_topics_exhausted/> tag, but only once all topic slots are spent. This replaces a suppression heuristic from SOC-2508 that required constant programmatic override. A representative chat excerpt illustrates how these mechanisms interact:

xml
<turn>
<state>
Journalism question; 8 turns remaining; HSE entrance exams
</state>
<message t="22:51" d="15.03" type="text">hse journalism</message>
<message t="22:54" d="15.03" type="text">everyone says the budget spots are basically lottery</message>
</turn>
...
<turn>
<state>
HSE entrance exams; 5 turns remaining; Ekaterina's twenties reflection
</state>
<message t="00:19" d="16.03" type="text">therapy speak or not. thats something i needed to hear i think</message>
<message t="00:26" d="16.03" type="text">anyway. if were being realistic about hse. what would you actually do in my position</message>
</turn>
...
<turn>
<state>
Family context; 1 turn remaining; Ekaterina's twenties reflection completed
</state>
<message t="02:44" d="16.03" type="text">you're going to spend years studying something. might as well be the one that lives in your head already</message>
</turn>
...
<turn>
<state>
Family context; 0 turns remaining; Ekaterina's twenties reflection completed
<predefined_topics_exhausted/>
</state>
<message t="03:02" d="16.03" type="text">https://youtu.be/kL8n9jVxrWQ</message>
<message t="03:05" d="16.03" type="text">anyway. its late. you should sleep. thanks for. idk. answering honestly i guess</message>
</turn>

Notice the timestamps: 22:51 on the 15th, then 03:05 the following morning. That pacing emerged entirely from the experience scaffold and the model working through its topic slots — no annotation, no post-processing. The turn countdown is visible decrementing across the excerpt, and <predefined_topics_exhausted/> lands exactly when the counter hits zero.

Known Limitations

We want to be clear about what this dataset does not yet do well. SPB-2602 personas are intentionally weighted toward younger adults; broader demographic coverage is planned for future releases. Despite diversity injection, the model can still subtly converge on certain personality archetypes across iterations — most visibly in how characters handle stress and coping. Specific links, places, institutions, and proper nouns within conversations may be fictional or confabulated, since we apply no factual grounding. Finally, the model still reaches for 3+ messages per turn more often than slower-paced experiences call for, producing a mild but consistent pacing artifact. These are the problems we know about; there are almost certainly others.

Directions for Future Work

Scale is the immediate priority — a larger batch is running through the improved pipeline now. The more pressing problem is evaluation. Spot-checking by a single reviewer doesn’t scale, and surface impressions miss a lot: demographic balance, conversational naturalness, topic adherence. We conjecture that a combination of automated metrics and targeted human evaluation frameworks could give us a much cleaner signal here, though designing those metrics in a way that tracks what actually matters — rather than what’s easy to measure — is itself an open problem. We’re actively thinking about what a principled evaluation suite for synthetic dialogue looks like, and we’d welcome collaboration on it. If you have ideas, reach out.

Conversations generated so far are viewable in the SOC Visualizer. Both SOC-2602 and SPB-2602 are released under CC BY 4.0. Generation scripts, seed samples, and prompt templates live on the dev branch for now — a tagged release will follow.

bibtex
@misc{marcodsn_2026_SOC2602,
  title  = {Synthetic Online Conversations},
  author = {Marco De Santis},
  year   = {2026},
  month  = {February},
  url    = {https://huggingface.co/datasets/marcodsn/SOC-2602},
}

@misc{marcodsn_2026_SPB2602,
  title  = {Synthetic Persona Bank},
  author = {Marco De Santis},
  year   = {2026},
  month  = {February},
  url    = {https://huggingface.co/datasets/marcodsn/SPB-2602},
}