The Synthetic Online Conversations (SOC) Dataset

The Synthetic Online Conversations (SOC) Dataset

This is a re-introduction SOC was first released in August 2025 as SOC-2508. Since then the pipeline has gone through several rounds of iteration, and we're now re-generating the entire dataset from scratch — starting with a fresh persona bank, SPB-2602. This post covers both what's new and how the system works end-to-end. SOC was first released in August 2025 as SOC-2508. Since then the pipeline has gone through several rounds of iteration, and we're now re-generating the entire dataset from scratch — starting with a fresh persona bank, SPB-2602. This post covers both what's new and how the system works end-to-end. of the Synthetic Online Conversations (SOC) dataset, an ongoing effort to build a large-scale, realistic collection of synthetic dialogues for training and evaluating NLP models.

Why This Exists

Publicly available conversational data has a real problem: what’s out there is either small, domain-specific, or scraped from places that raise obvious ethical and privacy concerns. Crowdsourced alternatives don’t scale, and workers tend to produce formulaic exchanges that flatten out the messiness of real conversation.

Synthetic generation is the obvious path forward — but only if you’re careful about how you do it. Most synthetic conversational datasets skip the scaffolding that makes real conversations feel grounded. Two people don’t just talk; they have a history, a reason for talking right now, and a thousand small things happening around them. SOC tries to capture that.

How It’s Built

The core idea is to build conversations bottom-up — from people, to situations, to chats. Each stage grounds the next.

plaintext
Seed data
  → Persona generation     (iterative + diversity resets)
  → Experience generation  (pairing + relationship + situation + trigger)
  → Chat generation        (multi-turn, media tags, timestamp pacing)

The pipeline takes heavy inspiration from ConvoGen, a multi-agent framework by Gody et al. that grounds conversations in generated experiences — each experience bundles personas, a relationship, a situation, a topic, and a conversation starter — and uses iterative sampling from a dynamically updated few-shot hub to prevent templatic repetition at scale.

SOC adopts that experience-first architecture and the iterative sampling idea, but departs from ConvoGen in three ways: it uses a single LLM writing both sides rather than separate AutoGen agents, it bakes timestamps and multi-message turns directly into the generation loop rather than annotating them afterward, and it adds a rolling summarization memory to manage long-context drift. Each of those choices plays out in the stages below.

The Synthetic Persona Bank (SPB-2602)

Every conversation starts with the people having it. This release ships with a fresh set of character personas, published separately as SPB-2602.

Personas are generated using Kimi-K2-0905, Kimi-K2.5, and GLM-5. Prompts push toward psychological realism and show-don’t-tell narrative writing, with a spread of occupations, personalities, life circumstances, and socioeconomic backgrounds — while still aiming for the statistical average rather than the dramatic exception.

A custom StatsEngine samples a world region, locale-appropriate subregion, name, and age Age is sampled with a mean of 25 and σ=5, intentionally skewed toward younger adults to match the downstream use case of online conversations. Broader age coverage is planned for future releases. Age is sampled with a mean of 25 and σ=5, intentionally skewed toward younger adults to match the downstream use case of online conversations. Broader age coverage is planned for future releases. for each call. The region feeds into the prompt to keep cultural details coherent rather than generic.

Preventing style collapse is an active concern. Each generation call receives a small set of examples as do-not-imitate-too-closely references. The pool is a rolling window capped at 20 — seeded with hand-written personas at warmup, then mixing those seeds with recent outputs as generation proceeds. Older generations naturally age out, keeping references fresh without losing accumulated variety.

The resulting records are structured markdown-in-XML character cards covering: Basic Information, Physical & Lifestyle, Personality Overview, Core Traits, Emotional Profile, Relationships, Values, Motivations & Fears, Behavioral Patterns, Communication Style, Example Messages, and a Summary.

json
{
  "persona_text": "<character>\\n**Basic Information**\\n**Name:** Yong (杨勇)\\n**Age:** 18\\n**Location:** Wuhan, Hubei Province, China\\n**Pronouns:** He/him\\n\\n**Physical & Lifestyle**\\nYong is a lean, average-height young man who hasn't quite grown into his features yet...\\n\\n**Personality Overview**\\nYong is at the exact point where he knows who he was...\\n</character>",
  "meta": {
    "iteration": 42,
    "model": "moonshotai/Kimi-K2-Instruct-0905",
    "region": "East Asia",
    "subregion": "Wuhan, Hubei Province, China",
    "name": "Yong",
    "age": 18,
    "time_taken": 12.4
  }
}

Experiences — Context Before Conversation

Personas alone aren’t enough. The experience step answers a more specific question: why are these two people talking right now?

Each experience pairs two personas and synthesizes their relationship, the platform and situation, a message cadence, a concrete trigger for why the conversation starts today, a topic roadmap, and a handful of plausible background interruptions — a mother knocking, a phone battery dying — woven in mid-conversation to simulate real-life friction. Instant events are not guaranteed — at each turn the dynamic prompt has a 5% chance of incorporating one. Most turns pass without any interruption. Instant events are not guaranteed — at each turn the dynamic prompt has a 5% chance of incorporating one. Most turns pass without any interruption.

Pairing is 50% region-matched and 50% fully random, balancing local coherence with cross-cultural variety. The generation pool is also actively monitored, with diversity hints injected when needed — forcing a freeform generation, for instance, if the recent pool is overwhelmingly semi-structured.

Experience generation uses the same rolling few-shot window as persona generation, keeping the pool fresh as batch size grows.

Here’s an excerpt from a generated experience:

xml
<experience>
Ahmed and Wei have been online friends for eight months after meeting in a Discord server
for mobile device repair enthusiasts...

It's Saturday evening in Peshawar, Sunday morning in Yantai. Ahmed is at the shop alone,
Yasir having left early for a family function. Wei is at his desk, headphones in, staring
at an IELTS writing prompt he's rewritten four times...

Conversation style: semi-structured
Message cadence: async (hours between replies, sometimes faster when both happen to be online)

The conversation will likely cover the following topics:
- Life updates: Wei's IELTS prep stress, Ahmed's family obligations and the slow shop week (8 turns) ← PRIMARY
- Future anxieties: Both circling around the question of whether their current paths lead somewhere... (6 turns)
- Small comforts: A funny moment from the tech Discord, a video game recommendation... (4 turns)

Initial state: life updates; 8; future anxieties

Possible instant events:
- Ahmed's cousin Yasir returns to the shop and makes a comment that Ahmed doesn't want to translate
- Wei's phone battery dies mid-reply, forcing him to find his charger and reconsider what he was typing
- A message from Ahmed's mother arrives in WhatsApp, visible in his notifications but unopened
- Wei's roommate starts cooking something that smells good, briefly pulling his attention away
</experience>

Chat Generation

Grounded in an experience, an LLM writes the conversation turn by turn. A running internal state tracks which topic is active, how many turns remain in that slot, and what comes next, allowing for smooth topic transitions. A rolling summarization memory compresses older messages to keep the active window manageable.

Preventing knowledge dumping. A key failure mode in SOC-2508 was personas burning through their most emotionally significant material in the first few exchanges — conversations that felt artificially deep too early. The topic tracker directly addresses this: the model is anchored to the current slot and can’t advance until that turn budget is spent. Depth has to be earned.

Multi-message turns and native pacing. Each turn can include 1–3 messages with a type attribute for image, audio, video, or sticker content. Every message carries time and date attributes (t="HH:MM" and d="DD.MM") This replaces the explicit delay tags used in SOC-2508, making pacing an emergent property of the conversation rather than an annotation layered on top of it. This replaces the explicit delay tags used in SOC-2508, making pacing an emergent property of the conversation rather than an annotation layered on top of it. — the model calculates gaps natively, reflecting async rhythms and long absences based on the experience’s defined cadence.

Conversation endings. Freeform conversations run to the full turn limit. Structured and semi-structured conversations may end early via the <predefined_topics_exhausted/> tag, but only once all topic slots are spent. This replaces a suppression heuristic from SOC-2508 that needed constant programmatic override.

xml
<turn>
<state>
Journalism question; 8 turns remaining; HSE entrance exams
</state>
<message t="22:51" d="15.03" type="text">hse journalism</message>
<message t="22:54" d="15.03" type="text">everyone says the budget spots are basically lottery</message>
</turn>
...
<turn>
<state>
HSE entrance exams; 5 turns remaining; Ekaterina's twenties reflection
</state>
<message t="00:19" d="16.03" type="text">therapy speak or not. thats something i needed to hear i think</message>
<message t="00:26" d="16.03" type="text">anyway. if were being realistic about hse. what would you actually do in my position</message>
</turn>
...
<turn>
<state>
Family context; 1 turn remaining; Ekaterina's twenties reflection completed
</state>
<message t="02:44" d="16.03" type="text">you're going to spend years studying something. might as well be the one that lives in your head already</message>
</turn>
...
<turn>
<state>
Family context; 0 turns remaining; Ekaterina's twenties reflection completed
<predefined_topics_exhausted/>
</state>
<message t="03:02" d="16.03" type="text">https://youtu.be/kL8n9jVxrWQ</message>
<message t="03:05" d="16.03" type="text">anyway. its late. you should sleep. thanks for. idk. answering honestly i guess</message>
</turn>

Notice the timestamps: 22:51 on the 15th, then 03:05 the following morning. That came entirely from the experience scaffold and the model working through its topic slots. Also notice the state tracker: the turn countdown is visible decrementing across the excerpt, and <predefined_topics_exhausted/> lands exactly when it hits zero.

Known Limitations

Age skew. SPB-2602 personas are intentionally weighted toward younger adults; broader coverage is planned.

Archetype drift. Despite diversity injection, the model can still subtly converge on certain personality types across iterations — most visibly in how characters handle stress and coping.

No factual grounding. Specific links, places, institutions, and proper nouns may be fictional or confabulated.

Multi-message pacing. The model still reaches for 3+ messages per turn more often than slower-paced experiences call for.

What’s Next

Scale is the immediate priority — a larger batch is running through the improved pipeline now. The more pressing problem is evaluation. Spot-checking by a single reviewer doesn’t scale, and surface impressions miss a lot: demographic balance, conversational naturalness, topic adherence. The next phase will focus on building more systematic tools — automated metrics, human evaluation frameworks, or both. If you have ideas or want to collaborate on this, reach out.

Conversations generated so far are viewable in the SOC Visualizer. Both SOC-2602 and SPB-2602 are released under CC BY 4.0. Generation scripts, seed samples, and prompt templates are on the dev branch for now — a tagged release will follow.

If you dig in and find things to improve, or interesting ways to use this, I’d love to hear from you.

bibtex
@misc{marcodsn_2026_SOC2602,
  title  = {Synthetic Online Conversations},
  author = {Marco De Santis},
  year   = {2026},
  month  = {February},
  url    = {https://huggingface.co/datasets/marcodsn/SOC-2602},
}

@misc{marcodsn_2026_SPB2602,
  title  = {Synthetic Persona Bank},
  author = {Marco De Santis},
  year   = {2026},
  month  = {February},
  url    = {https://huggingface.co/datasets/marcodsn/SPB-2602},
}