在正克,它是最常见者一切的字中它是、最常、见者

由网友(我的孤單誰會懂)分享简介:我遇到了以下编程面试中的问题:I came across the following programming interview problem:挑战1:N-克这是的N-gram是从给定的单词序列的N个连续字符。单词试点有3个3克:PIL,国际劳工组织和很大。对于一组给定的单词和n​​克长度你的任务是An...

我遇到了以下编程面试中的问题:

I came across the following programming interview problem:

挑战1:N-克

这是的N-gram是从给定的单词序列的N个连续字符。单词试点有3个3克:PIL,国际劳工组织和很大。 对于一组给定的单词和n​​克长度 你的任务是

An N-gram is a sequence of N consecutive characters from a given word. For the word "pilot" there are three 3-grams: "pil", "ilo" and "lot". For a given set of words and an n-gram length Your task is to

• write a function that finds the n-gram that is the most frequent one among all the words
• print the result to the standard output (stdout)
• if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)

请注意,您的功能将得到以下参数:

Note that your function will receive the following arguments:

• text
    ○ which is a string containing words separated by whitespaces
• ngramLength
    ○ which is an integer value giving the length of the n-gram

数据限制

• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)

效率的制约

• your function is expected to print the result in less than 2 seconds

例 输入 文字:AAAAB a0a baaab C

Example Input text: "aaaab a0a baaab c"

输出AAA ngramLength:3

Output aaa ngramLength: 3

说明

对于上述3克依频率psented输入$ P $是:

For the input presented above the 3-grams sorted by frequency are:

• "aaa" with a frequency of 3
• "aab" with a frequency of 2
• "a0a" with a frequency of 1
• "baa" with a frequency of 1

如果我只有一个小时解决问题,我选择使用C语言来解决它:它是实现一个哈希表来计算的N克随着时间量的频率是一个好主意?因为在C库中没有实现的哈希表...

If I have only one hour to solve the problem and I chose to use the C language to solve it: is it a good idea to implement a Hash Table to count the frequency of the N-grams with that amount of time? because in the C library there is no implementation of a Hash Table...

如果是的,我想实现使用分离链与有序链表的哈希表。这些实现减少你解决问题的时间......

If yes, I was thinking to implement a Hash Table using separate chaining with ordered linked lists. Those implementations reduce the time that you have to solve the problem....

是最快的选项可能吗?

感谢您!

推荐答案

如果执行效率才是最重要的,你使用的是C,我会初始化指针数组以正克开始在字符串中,使用的qsort 的指针根据正克,他们的一部分,然后遍历该排序数组进行排序,并找出计数。

If implementation efficiency is what matters and you are using C, I would initialize an array of pointers to the starts of n-grams in the string, use qsort to sort the pointers according to the n-gram that they are part of, and then loop over that sorted array and figure out counts.

这应该执行速度不够快,也没有必要$ C C任何花哨的数据结构$。

This should execute fast enough, and there is no need to code any fancy data structures.

阅读全文

相关推荐

最新文章