快(最快?)的方式获得的文件数在目录中有超过20万的文件文件、中有、最快、方式

由网友(小妞丶哥的裤衩呢)分享简介:我有一个包含测试数据,通常在每个目录20万小(〜4K)文件的一些目录。I have some directories containing test data, typically over 200,000 small (~4k) files per directory.我现在用的是下面的C#code得到的文件目录...

我有一个包含测试数据,通常在每个目录20万小(〜4K)文件的一些目录。

I have some directories containing test data, typically over 200,000 small (~4k) files per directory.

我现在用的是下面的C#code得到的文件目录中的数量:

I am using the following C# code to get the number of files in a directory:

int fileCount = System.IO.Directory.GetFiles(@"C:SomeDirectory").Length;

这是非常,非常缓慢的但是 - 在没有任何替代品,我可以使用

This is very, very slow however - are there any alternatives that I can use?

每个文件夹包含数据的一天,我们将有18个月左右的目录(〜550目录)。我也很感兴趣的性能增强人民以自己的返工平的目录结构,以多个嵌套的人发现。

Each folder contains data for one day, and we will have around 18 months of directories (~550 directories). I am also very interested in performance enhancements people have found by reworking flat directory structures to more nested ones.

推荐答案

不使用System.IO.Directory命名空间,没有。你必须找到查询目录的方式,不涉及创建文件的一个巨大的名单。

Not using the System.IO.Directory namespace, there isn't. You'll have to find a way of querying the directory that doesn't involve creating a massive list of files.

这看起来有点微软一时的疏忽,该的Win32 API一直有功能,可以在目录数的文件。

This seems like a bit of an oversight from Microsoft, the Win32 APIs have always had functions that could count files in a directory.

您可能还需要考虑拆分了目录。你如何管理一个20万文件目录是超越我: - )

You may also want to consider splitting up your directory. How you manage a 200,000-file directory is beyond me :-)

更新:的

约翰·桑德斯提出的意见好点。我们已经知道,(通用)的文件系统根本没有能力来处理存储这个水平。一件事的是的装备来处理小文件庞大的数字是一个数据库。

John Saunders raises a good point in the comments. We already know that (general purpose) file systems are not well equipped to handle this level of storage. One thing that is equipped to handle huge numbers of small "files" is a database.

如果可以识别一个键为每个(含有,例如,日期,时间及客户号),这些文件应注入的数据库。 4K容量的记录尺寸和108万行(200,000行/天* 30天/月* 18个月)应该由最专业的数据库,可以很容易地处理。我知道,DB2 / Z就啃了吃早饭。

If you can identify a key for each (containing, for example, date, hour and customer number), these files should be injected into a database. The 4K record size and 108 million rows (200,000 rows/day * 30 days/month * 18 months) should be easily handled by most professional databases. I know that DB2/z would chew on that for breakfast.

然后,当你需要提取到文件的一些测试数据,你有一个脚本/程序,只提取相关记录到文件系统。然后运行你的测试成功完成,并删除该文件。

Then, when you need some test data extracted to files, you have a script/program which just extracts the relevant records onto the file system. Then run your tests to successful completion and delete the files.

这应该让你的具体问题,很容易做到:

That should make your specific problem quite easy to do:

select count(*) from test_files where directory_name = '/SomeDirectory'

假设你已经在目录名的指标,当然。

assuming you have an index on directory_name, of course.

阅读全文

相关推荐

最新文章