我正在尝试使用最近发布的 tensorflow_dataset API 在开放图像数据集上训练 Keras 模型.数据集大小约为 570 GB.我用以下代码下载了数据:
I am trying to work with the quite recently published tensorflow_dataset API to train a Keras model on the Open Images Dataset. The dataset is about 570 GB in size. I downloaded the data with the following code:
import tensorflow_datasets as tfds
import tensorflow as tf
open_images_dataset = tfds.image.OpenImagesV4()
open_images_dataset.download_and_prepare(download_dir="/notebooks/dataset/")
下载完成后,与我的 jupyter notebook 的连接不知何故中断了,但提取似乎也完成了,至少所有下载的文件在已提取"文件夹中都有对应的文件.但是,我现在无法访问下载的数据:
After the download was complete, the connection to my jupyter notebook somehow interrupted but the extraction seemed to be finished as well, at least all downloaded files had a counterpart in the "extracted" folder. However, I am not able to access the downloaded data now:
tfds.load(name="open_images_v4", data_dir="/notebooks/open_images_dataset/extracted/", download=False)
这只会给出以下错误:
AssertionError: Dataset open_images_v4: could not find data in /notebooks/open_images_dataset/extracted/. Please make sure to call dataset_builder.download_and_prepare(), or pass download=True to tfds.load() before trying to access the tf.data.Dataset object.
当我调用函数 download_and_prepare() 时,它只会再次下载整个数据集.
When I call the function download_and_prepare() it only downloads the whole dataset again.
我错过了什么吗?
下载后extracted"下的文件夹有18个.tar.gz文件.
After the download the folder under "extracted" has 18 .tar.gz files.
推荐答案
这是 tensorflow-datasets 1.0.1 和 tensorflow 2.0.
This is with tensorflow-datasets 1.0.1 and tensorflow 2.0.
文件夹层次结构应该是这样的:
The folder hierarchy should be like this:
/notebooks/open_images_dataset/extracted/open_images_v4/0.1.0
/notebooks/open_images_dataset/extracted/open_images_v4/0.1.0
所有数据集都有一个版本.那么数据就可以这样加载了.
All the datasets have a version. Then the data could be loaded like this.
ds = tf.load('open_images_v4', data_dir='/notebooks/open_images_dataset/extracted', download=False)
我没有 open_images_v4 数据.我将 cifar10 数据放入名为 open_images_v4 的文件夹中,以检查 tensorflow_datasets 期望的文件夹结构.
I didn't have open_images_v4 data. I put cifar10 data into a folder named open_images_v4 to check what folder structure tensorflow_datasets was expecting.
相关推荐
最新文章