如何RegexOptions.Compiled工作?工作、RegexOptions、Compiled

由网友(支离破碎的童话)分享简介:这是怎么回事幕后当你标记一个普通的前pression为一体,以编译?请问这个比较/是从缓存经常EX pression不同?利用这些信息,你如何确定时相比,业绩增长计算的成本可以忽略不计?解决方案 RegexOptions.Compiled 指示定期EX pression引擎编译正EX pression EX pr...

这是怎么回事幕后当你标记一个普通的前pression为一体,以编译?请问这个比较/是从缓存经常EX pression不同?

利用这些信息,你如何确定时相比,业绩增长计算的成本可以忽略不计?

解决方案

RegexOptions.Compiled 指示定期EX pression引擎编译正EX pression EX pression到IL使用轻量级code代( LCG )。该对象的构造过程中,此编译发生和重减缓下来。反过来,匹配使用常规EX pression更快。

如果你不指定这个标志,你的正常的前pression被认为是跨preTED。

拿这个例子:

 公共静态无效TimeAction(字符串描述,诠释次,动作FUNC)
{
    // 暖身
    FUNC();

    无功表=新的秒表();
    watch.Start();
    的for(int i = 0; I<次;我++)
    {
        FUNC();
    }
    watch.Stop();
    Console.Write(介绍);
    Console.WriteLine(经过时间{0}毫秒,watch.ElapsedMilliseconds);
}

静态无效的主要(字串[] args)
{
    VAR简单=^  D + $;
    VAR媒体= @"^((to|from)W)?(?<url>http://[w.:]+)/questions/(?<questionId>d+)(/(w|-)*)?(/(?<answerId>d+))?";
    VAR复杂= @^(([^&LT;&GT;()[] ;: S @] +
      + @( [^&其中;&GT;()[] ;: S @。] +)*)|())@,+。
      + @(( [[0-9] {1,3} 。[0-9] {1,3} 。[0-9] {1,3}
      + @ [0-9] {1,3} ])|(([A-ZA-Z   -  0-9] + )+
      + @[A-ZA-Z] {2}))$;


    字符串[]号=新的String [] {1,二,8378373,38737,3873783z};
    字符串[]电子邮件=新的String [] {sam@sam.com,SSS @ S,sjg@ddd.com.au.au,onelongemail@oneverylongemail.com};

    在新的[]的foreach(VAR项目{
        新{模式=简单,匹配=编号,名称=简单的数字匹配},
        新{模式=中,匹配=电子邮件,名称=简单的电子邮件匹配},
        新{模式=复杂,匹配=电子邮件,名称=复杂的邮件匹配}
    })
    {
        INT I = 0;
        正则表达式正则表达式;

        TimeAction(item.Name +除preTED未缓存的单场(×1000),1000,()=&GT;
        {
            正则表达式=新的正则表达式(item.Pattern);
            regex.Match(item.Matches [我+ +%item.Matches.Length]);
        });

        I = 0;
        TimeAction(item.Name +编制未缓存的单场(×1000),1000,()=&GT;
        {
            正则表达式=新的正则表达式(item.Pattern,RegexOptions.Compiled);
            regex.Match(item.Matches [我+ +%item.Matches.Length]);
        });

        正则表达式=新的正则表达式(item.Pattern);
        I = 0;
        TimeAction(item.Name +prepared除preTED比赛(x1000000),1000000()=&GT;
        {
            regex.Match(item.Matches [我+ +%item.Matches.Length]);
        });

        正则表达式=新的正则表达式(item.Pattern,RegexOptions.Compiled);
        I = 0;
        TimeAction(item.Name +prepared编译匹配(x1000000),1000000()=&GT;
        {
            regex.Match(item.Matches [我+ +%item.Matches.Length]);
        });

    }
}
 

它进行4次测试在3个不同常规的前pressions。首先,它测试一个单一旦脱落比赛(编与非编)。其次它测试的重复使用同一个正前pression重复匹配。

在我的机器上的结果(在发行版编译的,不附加任何调试器)

1000单场比赛(构建正则表达式,匹配和处置)

类型|平台|平凡号码|简单的电子邮件检查|外部电子邮件检查
-------------------------------------------------- ----------------------------
国米preTED | X32 | 4毫秒| 26毫秒| 31毫秒
国米preTED | 64 | 5毫秒| 29毫秒| 35毫秒
编译| X32 | 913毫秒| 3775毫秒| 4487毫秒
编译| 64 | 3300毫秒| 21985毫秒| 22793毫秒
开眼了,腾讯是如何使用 Git

1,000,000比赛 - 重复使用正则表达式的对象

类型|平台|平凡号码|简单的电子邮件检查|外部电子邮件检查
-------------------------------------------------- ----------------------------
国米preTED | X32 | 422毫秒| 461毫秒| 2122毫秒
国米preTED | 64 | 436毫秒| 463毫秒| 2167毫秒
编译| X32 | 279种毫秒| 166毫秒| 1268毫秒
编译| 64 | 281毫秒| 176毫秒| 1180毫秒

这些结果表明,编制定期EX pressions可以达到 60%更快因为你所重用的正则表达式对象的情况。 然而在某些情况下可过级的 3订单慢构建。

这也表明,在 x64版本的.NET的可以是 5〜6倍慢当它涉及到常规的前pressions编译。

的建议是使用的情况下,编译版本其中任一

您不在乎对象的初始化成本,并需要额外的性能提升。 (注意,我们说的毫秒级这里) 您关心的初始化成本一点点,而是重新使用正则表达式对象,因此很多时候,它会在你的应用程序生命周期进行补偿。

扳手的作品,正则表达式缓存

常规EX pression引擎包含一个LRU缓存持有近15常规EX $ P $的是使用上的正则表达式类的静态方法测试pssions

例如: Regex.Replace Regex.Match 等全部使用正则表达式缓存

缓存的大小可以增加通过设置Regex.CacheSize.它在你的应用程序生命周期中接受改变大小的任何时间。

新建普通EX pressions只缓存按静态佣工的正则表达式类。如果你构建你的对象的缓存检查(再利用和碰撞),但是,正规的前pression您构建的不追加到缓存

这缓存是琐碎 LRU高速缓存,它使用的是简单的双链表实现。如果你碰巧增加至5000,而在静态助手使用5000种不同的呼叫,每次定期EX pression建设将抓取的5000个条目,看看它是previously缓存。有一个锁周围的检查,所以检查可降低并行性和引入线程阻塞。

数设定得相当低,以保护自己的情况下,像这样的,虽然在某些情况下,你可能别无选择,只能增加。

我的强烈推荐是永远通过 RegexOptions.Compiled 选项设置为静态的帮助。

例如:

  警告:糟糕code
Regex.IsMatch(@ D +,10000,RegexOptions.Compiled)
 

原因是,你是冒着严重的LRU高速缓存未命中将触发超级贵编译。此外,你不知道你所依赖的库都在做,所以几乎没有能力控制或predict在最佳高速缓存的大小。

另请参见:BCL团队博客

注意:这是相关的.NET 2.0和.NET 4.0。有4.5的一些预期的变化,可能导致此进行修改。

What is going on behind the scenes when you mark a regular expression as one to be compiled? How does this compare/is different from a cached regular expression?

Using this information, how do you determine when the cost of computation is negligible compared to the performance increase?

解决方案

RegexOptions.Compiled instructs the regular expression engine to compile the regular expression expression into IL using lightweight code generation (LCG). This compilation happens during the construction of the object and heavily slows it down. In turn, matches using the regular expression are faster.

If you do not specify this flag, your regular expression is considered "interpreted".

Take this example:

public static void TimeAction(string description, int times, Action func)
{
    // warmup
    func();

    var watch = new Stopwatch();
    watch.Start();
    for (int i = 0; i < times; i++)
    {
        func();
    }
    watch.Stop();
    Console.Write(description);
    Console.WriteLine(" Time Elapsed {0} ms", watch.ElapsedMilliseconds);
}

static void Main(string[] args)
{
    var simple = "^d+$";
    var medium = @"^((to|from)W)?(?<url>http://[w.:]+)/questions/(?<questionId>d+)(/(w|-)*)?(/(?<answerId>d+))?";
    var complex = @"^(([^<>()[].,;:s@""]+"
      + @"(.[^<>()[].,;:s@""]+)*)|("".+""))@"
      + @"(([[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}"
      + @".[0-9]{1,3}])|(([a-zA-Z-0-9]+.)+"
      + @"[a-zA-Z]{2,}))$";


    string[] numbers = new string[] {"1","two", "8378373", "38737", "3873783z"};
    string[] emails = new string[] { "sam@sam.com", "sss@s", "sjg@ddd.com.au.au", "onelongemail@oneverylongemail.com" };

    foreach (var item in new[] {
        new {Pattern = simple, Matches = numbers, Name = "Simple number match"},
        new {Pattern = medium, Matches = emails, Name = "Simple email match"},
        new {Pattern = complex, Matches = emails, Name = "Complex email match"}
    })
    {
        int i = 0;
        Regex regex;

        TimeAction(item.Name + " interpreted uncached single match (x1000)", 1000, () =>
        {
            regex = new Regex(item.Pattern);
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

        i = 0;
        TimeAction(item.Name + " compiled uncached single match (x1000)", 1000, () =>
        {
            regex = new Regex(item.Pattern, RegexOptions.Compiled);
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

        regex = new Regex(item.Pattern);
        i = 0;
        TimeAction(item.Name + " prepared interpreted match (x1000000)", 1000000, () =>
        {
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

        regex = new Regex(item.Pattern, RegexOptions.Compiled);
        i = 0;
        TimeAction(item.Name + " prepared compiled match (x1000000)", 1000000, () =>
        {
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

    }
}

It performs 4 tests on 3 different regular expressions. First it tests a single once off match (compiled vs non compiled). Second it tests repeat matches that reuse the same regular expression.

The results on my machine (compiled in release, no debugger attached)

1000 single matches (construct Regex, Match and dispose)

Type        | Platform | Trivial Number | Simple Email Check | Ext Email Check
------------------------------------------------------------------------------
Interpreted | x32      |    4 ms        |    26 ms           |    31 ms
Interpreted | x64      |    5 ms        |    29 ms           |    35 ms
Compiled    | x32      |  913 ms        |  3775 ms           |  4487 ms
Compiled    | x64      | 3300 ms        | 21985 ms           | 22793 ms

1,000,000 matches - reusing the Regex object

Type        | Platform | Trivial Number | Simple Email Check | Ext Email Check
------------------------------------------------------------------------------
Interpreted | x32      |  422 ms        |   461 ms           |  2122 ms
Interpreted | x64      |  436 ms        |   463 ms           |  2167 ms
Compiled    | x32      |  279 ms        |   166 ms           |  1268 ms
Compiled    | x64      |  281 ms        |   176 ms           |  1180 ms

These results show that compiled regular expressions can be up to 60% faster for cases where you reuse the Regex object. However in some cases can be over 3 orders of magnitude slower to construct.

It also shows that the x64 version of .NET can be 5 to 6 times slower when it comes to compilation of regular expressions.

The recommendation would be to use the compiled version in cases where either

You do not care about object initialization cost and need the extra performance boost. (note we are talking fractions of a millisecond here) You care a little bit about initialization cost, but are reusing the Regex object so many times that it will compensate for it during your application life cycle.

Spanner in the works, the Regex cache

The regular expression engine contains an LRU cache which holds the last 15 regular expressions that were tested using the static methods on the Regex class.

For example: Regex.Replace, Regex.Match etc.. all use the Regex cache.

The size of the cache can be increases by setting Regex.CacheSize. It accepts changes in size any time during your applications life cycle.

New regular expressions are only cached by the static helpers on the Regex class. If you construct your objects the cache is checked (for reuse and bumped), however, the regular expression you construct is not appended to the cache.

This cache is a trivial LRU cache, it is implemented using a simple double linked list. If you happen to increase it to 5000, and use 5000 different calls on the static helpers, every regular expression construction will crawl the 5000 entries to see if it is previously cached. There is a lock around the check, so the check can decrease parallelism and introduce thread blocking.

The number is set quite low to protect yourself from case like this, though in some cases you may have no choice but to increase it.

My strong recommendation would be never pass the RegexOptions.Compiled option to a static helper.

For example:

 WARNING: bad code
Regex.IsMatch(@"d+", "10000", RegexOptions.Compiled)

The reason being that you are heavily risking a miss on the LRU cache which will trigger a super expensive compile. Additionally, you have no idea what the libraries you depend on are doing, so have little ability to control or predict the best possible size of the cache.

See also: BCL team blog

Note : this is relevant for .NET 2.0 and .NET 4.0. There are some expected changes in 4.5 that may cause this to be revised.

阅读全文

相关推荐

最新文章