Process of High Quality Chat Dataset Synthesis

entfane · 2025-08-07T18:02:43.881Z

Does anyone have experience or could explain the process of how reasoning datasets like QuixAI/dolphin-r1 are synthesized I would highly appreciate it!

Pimpcat-AU · 2025-08-07T20:01:48.955Z

I constantly process datasets for training bots. That data looks incredibly noisy too me. I would not be using that to train anything. It needs to cleaned up. Unless your bot can hit at least a 1 percent loss before the 10000th step the data is too noisy and unless you want to build more of the same junk they’ve been passing off as Ai you use data that looks like that. The tip I will share with you. Clean your data. Data preparation is everything. Your vocab file is the key to everything. People in this industry all undervalue it but the truth is it’s the Rosetta stone. Put all of your time and effort into a pristine vocab and you set the foundations for pristine data. Your bot does not need all of that syntaxing and etc. It’s just noise. Noise = failure. Failure = a bot that can’t tell you how many R’s are in the word strawberry.

image1458×2125 344 KB

John6666 · 2025-08-07T23:16:48.722Z

I’m not sure if this will be sufficient data for reasoning, but here are some resources.

huggingface.co

@mrs83 on Hugging Face: "Introducing Completionist, an open-source...

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

@CultriX on Hugging Face: "Script for QA-style dataset generation from custom...

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

entfane · 2025-08-08T09:24:02.183Z

Thank you! Could you please share a bit more on data cleaning?

entfane · 2025-08-08T15:21:21.236Z

Thank you, @John6666