Hadoop的GZIP COM pressed文件文件、GZIP、Hadoop、pressed

由网友(老仙草)分享简介:我是新来的Hadoop,并试图处理维基百科转储。这是一个6.7 GB的gzip COM pressed XML文件。我读了Hadoop的支持gzip的COM pressed文件,但只能通过映射在一个工作作为唯一一个映射器DECOM preSS它处理。这似乎把一个限制的处理。是否有别的选择吗?像DECOM pressin...

我是新来的Hadoop,并试图处理维基百科转储。这是一个6.7 GB的gzip COM pressed XML文件。我读了Hadoop的支持gzip的COM pressed文件,但只能通过映射在一个工作作为唯一一个映射器DECOM preSS它处理。这似乎把一个限制的处理。是否有别的选择吗?像DECOM pressing和拆分XML文件分成多个块,RECOM $ P $用gzip pssing他们。

I am new to hadoop and trying to process wikipedia dump. It's a 6.7 GB gzip compressed xml file. I read that hadoop supports gzip compressed files but can only be processed by mapper on a single job as only one mapper can decompress it. This seems to put a limitation on the processing. Is there an alternative? like decompressing and splitting the xml file into multiple chunks and recompressing them with gzip.

我读到有关从http://researchcomputing.blogspot.com/2008/04/hadoop-and-com$p$pssed-files.html

感谢您的帮助。

推荐答案

一个文件,COM pressed与GZIP codeC无法在此codeC的作品,因为道路分割。 在Hadoop中单个裂口只能由一个单一的映射处理;因此单个GZIP文件只能由单个映射器进行处理。

A file compressed with the GZIP codec cannot be split because of the way this codec works. A single SPLIT in Hadoop can only be processed by a single mapper; so a single GZIP file can only be processed by a single Mapper.

有ATLEAST三种方式去解决这个限制:

There are atleast three ways of going around that limitation:

作为preprocessing步:Uncom preSS文件,RECOM preSS使用的可分裂codeC(LZO) 作为preprocessing步:Uncom preSS文件,分割成更小集,RECOM preSS。 (See这) 使用这个补丁的Hadoop(这是我写的),它允许一个办法解决:可裂Gzip已 As a preprocessing step: Uncompress the file and recompress using a splittable codec (LZO) As a preprocessing step: Uncompress the file, split into smaller sets and recompress. (See this) Use this patch for Hadoop (which I wrote) that allows for a way around this: Splittable Gzip

心连心

阅读全文

相关推荐

最新文章