Handling large amount of data where all of it is not present at any moment #13815

paniabhisek · 2025-06-03T18:42:20Z

paniabhisek
Jun 3, 2025

We want to train large amount of data. All data can not be downloaded at a time. So we are deciding to download in chunk by chunk.

But while running megatron_gpt_pretraining.py, I need to give the data files in config file. Does it read the data files sequentially and sends to training one by one? If I give the file names, at the latter part of the list, which are not present currently, will it work ?

Is there any better way to do it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling large amount of data where all of it is not present at any moment #13815

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Handling large amount of data where all of it is not present at any moment #13815

Uh oh!

paniabhisek Jun 3, 2025

Replies: 0 comments

paniabhisek
Jun 3, 2025