算法找到字符串中的最常见的串字符串、最常见、算法

由网友(少年玩心不玩命)分享简介:是否有可用于确定字符串中的最常见的短语(或子)的任何算法?例如,下面的字符串将有世界你好为最常见的两个字母的词组:Is there any algorithm that can be used to find the most common phrases (or substrings) in a string? F...

是否有可用于确定字符串中的最常见的短语(或子)的任何算法?例如,下面的字符串将有世界你好为最常见的两个字母的词组:

Is there any algorithm that can be used to find the most common phrases (or substrings) in a string? For example, the following string would have "hello world" as its most common two-letter phrase:

你好世界,这是世界你好。你好世界在重复这个字符串三次!

在上面的弦,最常见的串(空字符串的字符,其中重复的无数次后)将是空格字符

In the string above, the most common string (after the empty string character, which repeats an infinite number of times) would be the space character .

有什么办法来产生这串常见的子字符串的列表,从最常见到最不常见?

Is there any way to generate a list of common substrings in this string, from most common to least common?

推荐答案

这是因为任务类似于Nussinov算法,实际上更简单的,因为我们不允许在对齐的任何缝隙,插入或错配。

This is as task similar to Nussinov algorithm and actually even simpler as we do not allow any gaps, insertions or mismatches in the alignment.

有关字符串是一个具有长度为N,定义了一个 F [-1。N,-1 .. N] 表,并填写使用以下规则:

For the string A having the length N, define a F[-1 .. N, -1 .. N] table and fill in using the following rules:

  for i = 0 to N
    for j = 0 to N
      if i != j
        {
          if A[i] == A[j]
            F[i,j] = F [i-1,j-1] + 1;
          else
            F[i,j] = 0;
        }

例如,对于 BA 0 BA B:

这运行在为O(n ^ 2)的时间。在表中的最大的值现在指向最长自我匹配subquences的端部位置(ⅰ - 另 - 酮occurence,j的末尾)。在开始时,该阵列被假定为零初始化。我已经加入条件,排除对角线是最长的,但可能不是有趣的自我匹配。

This runs in O(n^2) time. The largest values in the table now point to the end positions of the longest self-matching subquences (i - the end of one occurence, j - another). In the beginning, the array is assumed to be zero-initialized. I have added condition to exclude the diagonal that is the longest but probably not interesting self-match.

思考更多,这表是对称多对角线所以它足以计算仅一半。此外,该阵列是零初始化,以便分配零是多余的。这仍然

Thinking more, this table is symmetric over diagonal so it is enough to compute only half of it. Also, the array is zero initialized so assigning zero is redundant. That remains

  for i = 0 to N
    for j = i + 1 to N
      if A[i] == A[j]
         F[i,j] = F [i-1,j-1] + 1;

较短,但可能比较难理解。该计算表中包含的所有比赛中,短期和长期的。因为你需要,你可以添加更多的过滤。

Shorter but potentially more difficult to understand. The computed table contains all matches, short and long. You can add further filtering as you need.

在接下来的步骤中,您需要恢复的字符串,从非零细胞和左边对角线以下。在该步骤期间也是微不足道使用一些散列映射计数自相似匹配的数目为相同的字符串。随着正常的字符串和正常最小长度只有少数的表格单元格将通过这个地图进行处理。

On the next step, you need to recover strings, following from the non zero cells up and left by diagonal. During this step is also trivial to use some hashmap to count the number of self-similarity matches for the same string. With normal string and normal minimal length only small number of table cells will be processed through this map.

我觉得用HashMap的直接实际上需要为O(n ^ 3)为关键字符串在访问结束时,必须以某种方式进行相等比较。这种比较可能是为O(n)。

I think that using hashmap directly actually requires O(n^3) as the key strings at the end of access must be compared somehow for equality. This comparison is probably O(n).

阅读全文

相关推荐

最新文章