是否有一个哈希算法是宽容的细微差别?细微、算法、有一个、差别

由网友(开始厌倦)分享简介:我在做一些网络爬虫类的东西,我正在寻找某些条款的网页,并在网页上找到自己的位置,然后将其高速缓存以备后用。我希望能够定期检查网页的任何重大变化。类似的MD5可以通过简单地将当前的日期和时间在网页挫败I'm doing some web crawling type stuff where I'm looking for...

我在做一些网络爬虫类的东西,我正在寻找某些条款的网页,并在网页上找到自己的位置,然后将其高速缓存以备后用。我希望能够定期检查网页的任何重大变化。类似的MD5可以通过简单地将当前的日期和时间在网页挫败

I'm doing some web crawling type stuff where I'm looking for certain terms in webpages and finding their location on the page, and then caching it for later use. I'd like to be able to check the page periodically for any major changes. Something like md5 can be foiled by simply putting the current date and time on the page.

是否有任何这样的事情工作的哈希算法?

Are there any hashing algorithms that work for something like this?

推荐答案

做文档相似性的常见方法是的叠瓦,这是有点比散列更为复杂。还应考虑的内容定义的组块的一种方式来分割的文件。

A common way to do document similarity is shingling, which is somewhat more involved than hashing. Also look into content defined chunking for a way to split up the document.

我读报纸几年前有关使用布鲁姆过滤器相似性检测。 使用布鲁姆过滤器来优化网络搜索结果。这是一个有趣的想法,但我从来没有抽时间去用它进行试验。

I read a paper a few years back about using Bloom filters for similarity detection. Using Bloom Filters to Refine Web Search Results. It's an interesting idea, but I never got around to experimenting with it.

阅读全文

相关推荐

最新文章