算法找到一个文本的关键词算法、文本、关键词

由网友(趁年轻去努力)分享简介:给定一组文本(可能是书籍,文章,文件等),你会怎么找到相关的关键字每个文本?常识告诉我们,以:Given a set of texts (might be books, articles, documents, etc.) how would you find relevant keywords for each...

给定一组文本(可能是书籍,文章,文件等),你会怎么找到相关的关键字每个文本? 常识告诉我们,以:

Given a set of texts (might be books, articles, documents, etc.) how would you find relevant keywords for each text? Common sense suggests to:

拆分的话 在排除常用词(也称为停的话,如一, 于,在) 在数字频率 给出分数每个字,以一个公式,考虑到在文档和其他文件中每个单词的频率,文档和所有文件的单词的总数的字的数目 split words exclude common words (also called stop-words, like "a, to, for, in") count words frequencies give a score to each word, with a formula that takes into account the frequency of each word in the document and in other documents, the number of words of the document and the total number of words of all documents

现在的问题是:这是一个很好的公式来做到这一点。

The question is: which is a good formula to do that?

推荐答案

我已经开发了。

对于每个字计算此比例:

For each word calculate this ratio:

(frequency of word in this text) * (total number of words in all texts)
-----------------------------------------------------------------------
  (number of words in this text) * (frequency of word in all texts)

关键字是那些字,其比例是在最高的20%(对于本doucument)。

Keywords are those words whose ratio is in the highest 20% (for this doucument).

Ankerl 也提出了自己公式:

tanh(curVal/curWords*200) - 5*tanh((allVal-curVal)/(allWords-curWords)*200)

其中:

CURVAL:多久字的得分是present在待分析的文本 curWords:在待分析的文本字的总数 allVal:多久字的得分是present在索引的数据集 allWords:索引的数据集的话总数

这两种算法的工作pretty的好,结果往往一致。你知道有办法做到这一点更好?

Both algorithms work pretty well, and results often coincide. Do you know any way to do it better?

阅读全文

相关推荐

最新文章