
由网友(我的孤單誰會懂)分享简介:我遇到了以下编程面试中的问题:I came across the following programming interview problem:挑战1:N-克这是的N-gram是从给定的单词序列的N个连续字符。单词试点有3个3克:PIL,国际劳工组织和很大。对于一组给定的单词和n​​克长度你的任务是An...


I came across the following programming interview problem:


这是的N-gram是从给定的单词序列的N个连续字符。单词试点有3个3克:PIL,国际劳工组织和很大。 对于一组给定的单词和n​​克长度 你的任务是

An N-gram is a sequence of N consecutive characters from a given word. For the word "pilot" there are three 3-grams: "pil", "ilo" and "lot". For a given set of words and an n-gram length Your task is to

• write a function that finds the n-gram that is the most frequent one among all the words
• print the result to the standard output (stdout)
• if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)


Note that your function will receive the following arguments:

• text
    ○ which is a string containing words separated by whitespaces
• ngramLength
    ○ which is an integer value giving the length of the n-gram


• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)


• your function is expected to print the result in less than 2 seconds

例 输入 文字:AAAAB a0a baaab C

Example Input text: "aaaab a0a baaab c"

输出AAA ngramLength:3

Output aaa ngramLength: 3


对于上述3克依频率psented输入$ P $是:

For the input presented above the 3-grams sorted by frequency are:

• "aaa" with a frequency of 3
• "aab" with a frequency of 2
• "a0a" with a frequency of 1
• "baa" with a frequency of 1


If I have only one hour to solve the problem and I chose to use the C language to solve it: is it a good idea to implement a Hash Table to count the frequency of the N-grams with that amount of time? because in the C library there is no implementation of a Hash Table...


If yes, I was thinking to implement a Hash Table using separate chaining with ordered linked lists. Those implementations reduce the time that you have to solve the problem....




如果执行效率才是最重要的,你使用的是C,我会初始化指针数组以正克开始在字符串中,使用的qsort 的指针根据正克,他们的一部分,然后遍历该排序数组进行排序,并找出计数。

If implementation efficiency is what matters and you are using C, I would initialize an array of pointers to the starts of n-grams in the string, use qsort to sort the pointers according to the n-gram that they are part of, and then loop over that sorted array and figure out counts.

这应该执行速度不够快,也没有必要$ C C任何花哨的数据结构$。

This should execute fast enough, and there is no need to code any fancy data structures.


