如何在不使用空格为单词的列表拆分文本？空格、单词、文本、如何在

由网友(无字情书°)分享简介：输入： tableapplechairtablecupboard ......多的话什么是一个有效的算法，以这样的文本分裂单词的列表，并获得：What would be an efficient algorithm to split such text to the list of words and get:...

输入： tableapplechairtablecupboard ......多的话

什么是一个有效的算法，以这样的文本分裂单词的列表，并获得：

What would be an efficient algorithm to split such text to the list of words and get:

输出： [表，苹果，椅子，表，[橱柜，[杯具，板] ，...]

这卡梅斯想到的是去通过所有可能的话（从第一个字母），并找到最长的单词可能，从位置= word_position + LEN（字）


First thing that cames to mind is to go through all possible words (starting with first letter) and find the longest word possible, continue from position=word_position+len(word)
 P.S。
我们拥有的所有可能的单词列表。
词橱柜可以是杯具和板，选择最长的。照片
语言：Python的，但主要的是算法本身。 
P.S.
We have a list of all possible words.
Word "cupboard" can be "cup" and "board", select longest.
Language: python, but main thing is the algorithm itself. 
推荐答案
当应用到真实世界的数据一个天真的算法不会给良好的效果。这是一个20行的算法，利用相对词频给出准确的结果实字文本。
A naive algorithm won't give good results when applied to real-world data. Here is a 20-line algorithm that exploits relative word frequency to give accurate results for real-word text.
 （如果你想要的答案，您不使用词频原来的问题，需要改进究竟是最长的单词的含义：这是最好有一个20个字母和10个3 -letter的话，还是更是有五10字母的单词？一旦你看中一个precise定义，你只需要改变该行定义 wordcost 来反映预期的含义。）的
(If you want an answer to your original question which does not use word frequency, you need to refine what exactly is meant by "longest word": is it better to have a 20-letter word and ten 3-letter words, or is it better to have five 10-letter words? Once you settle on a precise definition, you just have to change the line defining wordcost to reflect the intended meaning.)
要进行的最好的办法是的模式的输出分配。一个良好的第一近似值是假设所有的词都独立分布。然后，你只需要知道的所有词的相对频率。这是合理的假设，他们遵循齐普夫定律，那就是在单词列表排名的 N 的这个词的概率约为1 /（ N 的日志的 N ）其中的 N 的是单词在字典中的数量。
The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.
一旦你有固定的模式，你可以使用动态编程来推断空间的位置。最有可能的一句话是一个最大化每个单词的概率的乘积，并且很容易与动态规划计算它。而不是直接使用，我们使用定义为概率的倒数的对数成本的概率，以避免溢出。
Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it's easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.
from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))



您可以使用
s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))





结果

我使用这个快速和肮脏的125K字字典我放在一起从维基百科的一小部分的。




The results

I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.
 之前： thumbgreenappleactiveassignmentweeklymetaphor 。
  之后：拇指青苹果主动分配每周比喻

  Before: thumbgreenappleactiveassignmentweeklymetaphor.
  After: thumb green apple active assignment weekly metaphor.
   
 之前： thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen
  odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho
  rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery
  whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot。

  
Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen
  odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho
  rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery
  whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.
 之后：有群众的它是由HTML解析，但没有分隔的人物在其中，例如拇指青苹果主动分配每周比喻人民意见的文本信息显然是有经验的青苹果等在字符串中我也有一个大词典查询单词是否合理，什么;提取THX的最快的方法有很多。
After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.
   
 之前： itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.

  Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.
 之后：那是一个夜黑风高的山洪大雨除了在偶尔的间隔时间，当它被风猛烈阵风席卷了大街小巷它检查是在伦敦，我们的现场位于沿房顶剑拔弩张和激烈搅拌那挣扎着对抗黑暗的灯的寥寥无几火焰。
After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.
正如你可以看到它本质上是完美的。最重要的是要确保你的单词列表被培养成类似你实际上会遇到一个主体，否则结果会很糟糕。
As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.
的实施消耗时间和存储器的线性量，因此它是相当有效的。如果您需要进一步的加速，可以构建从词汇列表中的后缀树，以减少候选集的大小。
The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.
如果你需要处理一个连续的非常大的字符串，它是合理的分割字符串，以避免过多的内存使用情况。例如，你可以处理在10000字块的文字加上1000字两边保证金，以避免边界效应。这将保持存储器使用到最低限度，将具有几乎肯定的质量没有影响
If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.


      
          
          阅读全文
      
	
	
	
		相关专题：空格 ；单词 ；文本 ；如何在 ；列表 ；  发布时间：2023-09-10 22:33:52
	



相关推荐



					洋河大曲（江苏省宿迁市特产）
				

					网络路由器（2023最新网络路由器百科介绍）
				

					日月轮垫（日月轮垫）
				

					四川暴雨（发生严重洪涝等灾害的自然现象）
				

					箍围（箍围）
				

					看山阁闲笔（看山阁闲笔）
				

					维库特尔（维库特尔）
				

					Derived class（Derived class）
				

					银鹿村（银鹿村）
				

					2022年辽宁省专升本考试的时间是什么时候_2022年安徽省专升本考试
				

					吃蒜苗的好处_产妇能吃蒜苗吗
				

					莲藕汤上火怎么办？_排骨黄豆莲藕汤怎么做呢
				

					腰椎内固定手术都需要打钉子固定吗_腰椎微创手术的优点
				

					每个人都有痔疮吗_长寿人都爱吃7种食物
				

					清炖猪肚汤的做法_清炖鲤鱼的做法
				

					变天膝关节的护理方法_股骨骨折的护理方法
				

					髋关节骨折的治疗方法_尺桡骨双骨折的治疗方法有哪些
				

					春季打好保胃战，多吃17种养胃食物！_老祖宗推崇的七种养胃食物一定要
				

					世界读书日演讲稿(世界读书日演讲稿400字)
				

					车辆出租合同(个人租车合同怎么写)
				

	


最新文章


					成诗京（韩国男歌手）
				

					朱晓玲（湖北作家）
				

					富通大厦（位于北京西城区的建筑）
				

					王吉（北京大学马列学院博士）
				

					王院村（江苏省海安县大公镇王院村）
				

					长花冠管组（长花冠管组）
				

					最有效的方法来测试两个二叉树是否相等方法来、最有效、两个、测
				

					返回PHP多维数组来的JavaScript / AJAX多维、数组、PHP、JavaScri
				

					的闩推送模式的UUID模式、UUID
				

					悲愤欲绝：谈论社会不公问题_为什么社会不公？
				

					什么是凑活？_为什么要凑活？
				

					紫红色芝麻汤圆的做法_芝麻汤圆的做法
				

					河南有哪些品牌服装_河南的香烟有哪些品牌
				

					右腹部有气原因有哪些_右腹部隐痛的原因
				

					神经性呕吐是什么原因，如此饮食效果好_介绍大便拉出血块的原因与饮
				

					虾糊的做法有哪些_白米虾饼的做法
				

					黄茅的功效与作用_加减犀角地黄茅花汤的功效与作用
				

					油煎散的功效与作用_糖煎散的功效与作用
				

					英雄联盟锐雯s8天赋加点图推荐(英雄联盟s8狗头上单符文推荐搭配)
				

					为什么大话西游第二部比第一部火(大话西游为什么后来才火)