使用 tensorflow_datasets API 访问已下载的数据集数据、tensorflow

由网友(偷偷钻进你梦里)分享简介：我正在尝试使用最近发布的 tensorflow_dataset API 在开放图像数据集上训练 Keras 模型.数据集大小约为 570 GB.我用以下代码下载了数据:I am trying to work with the quite recently published tensorflow_dataset AP...

我正在尝试使用最近发布的 tensorflow_dataset API 在开放图像数据集上训练 Keras 模型.数据集大小约为 570 GB.我用以下代码下载了数据:

I am trying to work with the quite recently published tensorflow_dataset API to train a Keras model on the Open Images Dataset. The dataset is about 570 GB in size. I downloaded the data with the following code:

import tensorflow_datasets as tfds
import tensorflow as tf

open_images_dataset = tfds.image.OpenImagesV4()
open_images_dataset.download_and_prepare(download_dir="/notebooks/dataset/")

下载完成后，与我的 jupyter notebook 的连接不知何故中断了，但提取似乎也完成了，至少所有下载的文件在已提取"文件夹中都有对应的文件.但是，我现在无法访问下载的数据:

After the download was complete, the connection to my jupyter notebook somehow interrupted but the extraction seemed to be finished as well, at least all downloaded files had a counterpart in the "extracted" folder. However, I am not able to access the downloaded data now:

tfds.load(name="open_images_v4", data_dir="/notebooks/open_images_dataset/extracted/", download=False)

这只会给出以下错误:

AssertionError: Dataset open_images_v4: could not find data in /notebooks/open_images_dataset/extracted/. Please make sure to call dataset_builder.download_and_prepare(), or pass download=True to tfds.load() before trying to access the tf.data.Dataset object.

当我调用函数 download_and_prepare() 时，它只会再次下载整个数据集.

When I call the function download_and_prepare() it only downloads the whole dataset again.

我错过了什么吗?

下载后extracted"下的文件夹有18个.tar.gz文件.

After the download the folder under "extracted" has 18 .tar.gz files.

推荐答案

这是 tensorflow-datasets 1.0.1 和 tensorflow 2.0.

This is with tensorflow-datasets 1.0.1 and tensorflow 2.0.

文件夹层次结构应该是这样的:

The folder hierarchy should be like this:

/notebooks/open_images_dataset/extracted/open_images_v4/0.1.0

所有数据集都有一个版本.那么数据就可以这样加载了.

All the datasets have a version. Then the data could be loaded like this.

ds = tf.load('open_images_v4', data_dir='/notebooks/open_images_dataset/extracted', download=False)

我没有 open_images_v4 数据.我将 cifar10 数据放入名为 open_images_v4 的文件夹中，以检查 tensorflow_datasets 期望的文件夹结构.

I didn't have open_images_v4 data. I put cifar10 data into a folder named open_images_v4 to check what folder structure tensorflow_datasets was expecting.

阅读全文

相关专题：数据；tensorflow_datasets ；API ；发布时间：2023-09-06 05:28:06

使用 tensorflow_datasets API 访问已下载的数据集数据、tensorflow_datasets、API

推荐答案