Hadoop的GZIP COM pressed文件文件、GZIP、Hadoop、pressed

由网友(老仙草)分享简介：我是新来的Hadoop，并试图处理维基百科转储。这是一个6.7 GB的gzip COM pressed XML文件。我读了Hadoop的支持gzip的COM pressed文件，但只能通过映射在一个工作作为唯一一个映射器DECOM preSS它处理。这似乎把一个限制的处理。是否有别的选择吗？像DECOM pressin...

我是新来的Hadoop，并试图处理维基百科转储。这是一个6.7 GB的gzip COM pressed XML文件。我读了Hadoop的支持gzip的COM pressed文件，但只能通过映射在一个工作作为唯一一个映射器DECOM preSS它处理。这似乎把一个限制的处理。是否有别的选择吗？像DECOM pressing和拆分XML文件分成多个块，RECOM $ P $用gzip pssing他们。

I am new to hadoop and trying to process wikipedia dump. It's a 6.7 GB gzip compressed xml file. I read that hadoop supports gzip compressed files but can only be processed by mapper on a single job as only one mapper can decompress it. This seems to put a limitation on the processing. Is there an alternative? like decompressing and splitting the xml file into multiple chunks and recompressing them with gzip.

我读到有关从http://researchcomputing.blogspot.com/2008/04/hadoop-and-com$p$pssed-files.html

感谢您的帮助。

推荐答案

一个文件，COM pressed与GZIP codeC无法在此codeC的作品，因为道路分割。在Hadoop中单个裂口只能由一个单一的映射处理;因此单个GZIP文件只能由单个映射器进行处理。

A file compressed with the GZIP codec cannot be split because of the way this codec works. A single SPLIT in Hadoop can only be processed by a single mapper; so a single GZIP file can only be processed by a single Mapper.

有ATLEAST三种方式去解决这个限制：

There are atleast three ways of going around that limitation:

作为preprocessing步：Uncom preSS文件，RECOM preSS使用的可分裂codeC（LZO）作为preprocessing步：Uncom preSS文件，分割成更小集，RECOM preSS。（See这）使用这个补丁的Hadoop（这是我写的），它允许一个办法解决：可裂Gzip已 As a preprocessing step: Uncompress the file and recompress using a splittable codec (LZO) As a preprocessing step: Uncompress the file, split into smaller sets and recompress. (See this) Use this patch for Hadoop (which I wrote) that allows for a way around this: Splittable Gzip

心连心

阅读全文

相关专题：文件；gzip ；Hadoop ；pressed ；com ；发布时间：2023-09-10 23:52:57