查找字典单词字典、单词

由网友(買棺材ㄨ葬愛)分享简介:我有很多的复合字符串是两个或三个英文单词的组合。I have a lot of compound strings that are a combination of two or three English words.e.g. "Spicejet" is a combination of the words "sp...

我有很多的复合字符串是两个或三个英文单词的组合。

I have a lot of compound strings that are a combination of two or three English words.

    e.g. "Spicejet" is a combination of the words "spice" and "jet"

我需要这些独立的英语单词从这样的复合字符串分隔。我的意思是要由约100000字。

I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.

什么是最有效的通过,我可以分开单独的英语单词从这样的复合字符串。

What would be the most efficient by which I can separate individual English words from such compound strings.

推荐答案

我不知道你有多少时间或频率要做到这一点(每日它是一个一次性操作?周?),但你显然会希望有一个快速,加权字典查询。

I'm not sure how much time or frequency you have to do this (is it a one-time operation? daily? weekly?) but you're obviously going to want a quick, weighted dictionary lookup.

您还需要有一个解决冲突的机制,也​​许侧面队列手动解决冲突上有多个可能的含义元组。

You'll also want to have a conflict resolution mechanism, perhaps a side-queue to manually resolve conflicts on tuples that have multiple possible meanings.

我会去了解一下尝试。使用一个你可以高效地找到(重量)的prefixes,这是$ P $你将寻找pcisely什么。

I would look into Tries. Using one you can efficiently find (and weight) your prefixes, which are precisely what you will be looking for.

您将不得不从一本好字典源码构建自己的尝试次数和重量上满语中的节点提供自己一个良好的质量的机制,以供参考。

You'll have to build the Tries yourself from a good dictionary source, and weight the nodes on full words to provide yourself a good quality mechanism for reference.

只是集思广益这里,但如果你知道你的数据集主要由小芯片或三胞胎,你很可能逃脱多个特里查找,比如查找'穗',然后'ejet,然后发现两个结果有一个低分数,弃入'香料和喷气,其中两个尝试次数将产生两者之间的良好的综合结果。

Just brainstorming here, but if you know your dataset consists primarily of duplets or triplets, you could probably get away with multiple Trie lookups, for example looking up 'Spic' and then 'ejet' and then finding that both results have a low score, abandon into 'Spice' and 'Jet', where both Tries would yield a good combined result between the two.

此外,我会考虑使用频率分析最常见的prefixes长达一个武断的或动态的限制,例如:过滤'的'或'在'联合国'或与相应的权重。

Also I would consider utilizing frequency analysis on the most common prefixes up to an arbitrary or dynamic limit, e.g. filtering 'the' or 'un' or 'in' and weighting those accordingly.

听起来像一个有趣的问题,祝你好运!

Sounds like a fun problem, good luck!

阅读全文

相关推荐

最新文章