添加例外莱文斯坦 - 距离样算法算法、文斯、距离

由网友(Death、-宁)分享简介:我想计算出有类似的多达6个变量序列。目前我使用的是集合计数器返回不同的变量作为我的编辑距离的频​​率。I'm trying to compute how similar a sequence of up to 6 variables are. Currently I'm using a Collections Cou...

我想计算出有类似的多达6个变量序列。目前我使用的是集合计数器返回不同的变量作为我的编辑距离的频​​率。

I'm trying to compute how similar a sequence of up to 6 variables are. Currently I'm using a Collections Counter to return the frequency of different variables as my edit-distance.

默认情况下,在编辑一个变量的距离(添加/分/变更)为1或0。我想改变的距离取决于变量,我​​设置了该变量。什么价值

所以我可以说某些变量类似于其他变量,并为他们多么相似的值。 我还想说的某些变量值得比平时少跌多的距离。

So I can say certain variables are similar to other variables, and provide a value for how similar they are. I also want to say certain variables are worth less or more distance than usual.

下面是我的previous岗位作为背景:Modify莱文斯坦 - 距离忽视订单

Here is my previous post as context: Modify Levenshtein-Distance to ignore order

例如:

# 'c' and 'k' are quite similar, so their distance from eachother is 0.5 instead of 1
>>> groups = {['c','k'] : 0.5}

# the letter 'e' is less significant, and 'x' is very significant
>>> exceptions = {'e': 0.3, 'x': 1.5}

>>> distance('woke', 'woc')
0.8

说明:

woke
k -> c = 1
woce
-e = 1
woc
Distance = 2

# With exceptions:
woke
k -> c = 0.5
woce
-e = 0.3
woc
Distance = 0.8

我怎么能做到这一点?这将是可以实现这个反算法?

目前的code(谢谢你,大卫Eisenstat)

Current code (thank you David Eisenstat)

def distance(s1, s2):
    cnt = collections.Counter()
    for c in s1:
        cnt[c] += 1
    for c in s2:
        cnt[c] -= 1
    return sum(abs(diff) for diff in cnt.values()) // 2 + 
        (abs(sum(cnt.values())) + 1) // 2

推荐答案

我结束了将过程分成几个阶段,然后通过每个阶段的串进行迭代。我不知道它是有效的,因为它可以,但它的工作原理。

I ended up dividing the process into a few stages then iterating through the strings for each stage. I'm not sure if its as efficient as it could be but it works.

总结我试图实现(与编辑距离算法)

Summing up what I was trying to achieve (in relation to Edit-distance algorithms)

从一个字母到另一个距离为1。变化的J - > K = 1 0是没有任何区别的。例如变化的J - > J = 0 类似的信件可以值得小于1(由我指定)如 C K 音同,因此 C,K = 0.5 变动c - > K = 0.5 在某些字母可能价值更多或更少(由我指定)如 X 是罕见的,所以我希望它有更多的重量, X = 1.4 x更改 - > K = 1.4 Distance from one letter to another is 1. change j -> k = 1 0 being no difference at all. e.g. change j -> j = 0 Similar letters can be worth less than 1 (specified by me) e.g. c and k sound the same, therefore c, k = 0.5, change c -> k = 0.5 Certain letters could be worth more or less (specified by me) e.g. x is uncommon so I want it to have more weight, x = 1.4, change x -> k = 1.4

创建2字典,1 类似于书信,1 的异常

Created 2 dictionaries, 1 for similar letters, 1 for exceptions

填充计数器 - 遍历两个字符串 匹配相似的项 - ,迭代字符串1,如果在类似快译通,迭代字符串2,如果在同类词典 更新计数 - 删除类似的项目, 查找距离 - 加起来绝对频率,占区别在字符串的长度 包含例外距离 - 异常值的基础上的字母频率帐户 Populate Counter - Iterate through both strings Match similar items - Iterate string1, if in similar dict, iterate string2, if in similar dict Update Counter - remove similar items, Find Distance - add up absolute frequencies, account for difference in string length Include exceptions distance - Account for exception values based on frequency of letters
阅读全文

相关推荐

最新文章