使用.NET CSV解析选项选项、NET、CSV

由网友(偷走你的满目星河)分享简介:我看着我的分隔文件(如CSV,制表符分隔,等等)基于MS解析选项堆在一般情况下,和.net明确。我排除的唯一技术是SSIS,因为我已经知道它不会满足我的需求。I'm looking at my delimited-file (e.g. CSV, tab seperated, etc.) parsing option...


I'm looking at my delimited-file (e.g. CSV, tab seperated, etc.) parsing options based on MS stack in general, and .net specifically. The only technology I'm excluding is SSIS, because I already know it will not meet my needs.


So my options appear to be:

Regex.Split TextFieldParser OLEDB CSV分析器 Regex.Split TextFieldParser OLEDB CSV Parser


I have two criteria I must meet. First, given the following file which contains two logical rows of data (and five physical rows altogether):

101,鲍勃,保持自己的房子,干净。 需要工作的洗衣房。 102,艾米,辉煌。 驱动。 勤奋。

101, Bob, "Keeps his house ""clean"". Needs to work on laundry." 102, Amy, "Brilliant. Driven. Diligent."


The parsed results must yield two logical "rows," consisting of three strings (or columns) each. The third row/column string must preserve the newlines! Said differently, the parser must recognize when lines are "continuing" onto the next physical row, due to the "unclosed" text qualifier.


The second criteria is that the delimiter and text qualifier must be configurable, per file. Here are two strings, taken from different files, that I must be able to parse:

var first = @"""This"",""Is,A,Record"",""That """"Cannot"""", they say,"","""",,""be"",rightly,""parsed"",at all";
var second = @"~This~|~Is|A|Record~|~ThatCannot~|~be~|~parsed~|at all";


A proper parsing of string "first" would be:

这 是,A,记录 在那不能,他们说, _ _ 是 正确 解析 在所有

在'_'只是意味着一个空白被抓获 - 我不希望出现一个文字下划线。

The '_' simply means that a blank was captured - I don't want a literal underbar to appear.


One important assumption can be made about the flat-files to be parsed: there will be a fixed number of columns per file.




First, many responders comment that regex "is not the best way" to achieve the goal. I did, however, find a commenter who offered an excellent CSV regex:

var regex = @",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))";
var Regex.Split(first, regex).Dump();


The results, applied to string "first," are quite wonderful:

在本 是,A,记录 在这,不能,他们说, , _ 在是 正确 在已解析 在所有


It would be nice if the quotes were cleaned up, but I can easily deal with that as a post-process step. Otherwise, this approach can be used to parse both sample strings "first" and "second," provided the regex is modified for tilde and pipe symbols accordingly. Excellent!


But the real problem pertains to the multi-line criteria. Before a regex can be applied to a string, I must read the full logical "row" from the file. Unfortunately, I don't know how many physical rows to read to complete the logical row, unless I've got a regex / state machine.


So this becomes a "chicken and the egg" problem. My best option would be to read the entire file into memory as one giant string, and let the regex sort-out the multiple lines (I didn't check if the above regex could handle that). If I've got a 10 gig file, this could be a bit precarious.


TextFieldParser 的


Three lines of code will make the problem with this option apparent:

var reader = new Microsoft.VisualBasic.FileIO.TextFieldParser(stream);
reader.Delimiters = new string[] { @"|" };
reader.HasFieldsEnclosedInQuotes = true;


The Delimiters configuration looks good. However, the "HasFieldsEnclosedInQuotes" is "game over." I'm stunned that the delimiters are arbitrarily configurable, but in contrast I have no other qualifier option other than quotations. Remember, I need configurability over the text qualifier. So again, unless someone knows a TextFieldParser configuration trick, this is game over.



A colleague tells me this option has two major failings. First, it has terrible performance for large (e.g. 10 gig) files. Second, so I'm told, it guesses data types of input data rather than letting you specify. Not good.



So I'd like to know the facts I got wrong (if any), and the other options that I missed. Perhaps someone knows a way to jury-rig TextFieldParser to use an arbitrary delimiter. And maybe OLEDB has resolved the stated issues (or perhaps never had them?).



你尝试寻找一个已经存在的.NET的 CSV解析器? 这其中声称处理多行显著记录比OLEDB更​​快。

Did you try searching for an already-existing .NET CSV parser? This one claims to handle multi-line records significantly faster than OLEDB.


