Does anyone have experience or could explain the process of how reasoning datasets like QuixAI/dolphin-r1 are synthesized I would highly appreciate it!
1 Like
I constantly process datasets for training bots. That data looks incredibly noisy too me. I would not be using that to train anything. It needs to cleaned up. Unless your bot can hit at least a 1 percent loss before the 10000th step the data is too noisy and unless you want to build more of the same junk they’ve been passing off as Ai you use data that looks like that. The tip I will share with you. Clean your data. Data preparation is everything. Your vocab file is the key to everything. People in this industry all undervalue it but the truth is it’s the Rosetta stone. Put all of your time and effort into a pristine vocab and you set the foundations for pristine data. Your bot does not need all of that syntaxing and etc. It’s just noise. Noise = failure. Failure = a bot that can’t tell you how many R’s are in the word strawberry.
2 Likes
I’m not sure if this will be sufficient data for reasoning, but here are some resources.
1 Like
Thank you! Could you please share a bit more on data cleaning?
1 Like