COMP 241: Project 5

Autocomplete

Autocomplete is pervasive in modern applications. As the user types, the program predicts the complete query (typically a word or phrase) that the user intends to type. Autocomplete is most effective when there are a limited number of likely queries. For example, the Internet Movie Database uses it to display the names of movies as the user types; search engines use it to display suggestions as the user enters web search queries; cell phones use it to speed up text input.

In these examples, the application predicts how likely it is that the user is typing each query and presents to the user a list of the top-matching queries, in descending order of likelihood. These likelihoods are determined by historical data, such as box office revenue for movies, frequencies of search queries from other Google users, or the typing history of a cell phone user. For the purposes of this assignment, you will have access to a corpus of English text from which you will deduce how likely it is for certain words to be autocompleted in various ways.

The performance of autocomplete functionality is critical in many systems. For example, consider a search engine (such as Google) which runs an autocomplete application on a server farm. According to one study, the application has only about 50ms to return a list of suggestions for it to be useful to the user. Moreover, in principle, it must perform this computation for every keystroke typed into the search bar and for every user!

In this assignment, you will implement autocomplete in three phases. First, you will take a corpus of text and count which words appear most frequently. This will be done using a hash table of your creation. Second, you will sort the words in descending order of frequency (so the most common words appear at the beginning of the list). This will be done using the quicksort algorithm. Third, you will create a different hash table which associates every possible word prefix (a partial word that would be typed in by user) with the word that the prefix should autocomplete to. For instance, the word prefix “th” might be autocompleted in various ways, but the most likely word is “the,” because it appears more frequently in English text than other words that start with “th.”

Purpose

The purpose of this project is twofold: One, you will be learning how to implement a hash table and the quicksort algorithm. Two, you will learn about some C++ classes that are useful in the real world, namely list (the C++ built-in doubly-linked list), and unordered_map (the C++ built-in hash table class).

Getting started

Download the starter code and read through it. Download the sample text files as well. Here’s what they mean:

The flow of the program is as follows:

Step 1: Implement the self-designed hash table

The Hashtable.h and Hashtable.cpp files implement a hash table that maps strings to integers. The code is designed to be generic, referring only to “keys” (strings) and values (integers), but will be used in this program to store words and their frequencies.

This hash table uses separate chaining to store entries. Specifically, each key-value pair is maintained in a struct called an entry. The Hashtable class maintains a vector of linked lists of entries. We use the C++ built-in list class which implements a doubly-linked list.

I suggest the following steps to write this class:

At this point, you should have a working hash table!

Step 2: Implement reading the corpus text file and creating the hash table

Step 3: Implement quicksort

Step 4: Implement the construction of the autocomplete hash table.

Step 5: Test your code

Testing and submitting

Test your output against mine for the small.txt and nyt.txt examples.

Your output should match mine for both of these test cases.

Upload main.cpp, Hashtable.h, and Hashtable.cpp.

Hints, tips, and miscellaneous

The C++ list class

C++ has a built-in doubly-linked list class called (unsurprisingly) list. You may use it by putting #include <list> at the top of your code. We will be implementing a hash table using chaining, so we’ll use a table that stores linked lists of entrys. The reference material is at http://en.cppreference.com/w/cpp/container/list, but I’ll give you the highlights. To introduce what a list can do, we’ll simplify things and use a list of ints:

Common operations:

mylist.push_back(5);  
mylist.push_back(7);  
mylist.push_front(3);  // mylist is now [3, 5, 7]  
for (int item : mylist) {  
  cout << item;  
}

This is a useful idiom that you will use frequently to iterate over a list. However, what if we want to change items within the list as we’re iterating? This won’t work:

for (int item : mylist) {  
  item++;  
}

The issue is that item is a new variable that is a copy of the actual data from the internal linked list node. However, C++ has a slick way of giving us access directly to the internal data of the list: we can use a reference variable:

for (int& item : mylist) {  
  item++;  
}

Notice the ampersand in the for loop. This is similar to how pass-by-reference works in functions; it tells C++ that item should not be a copy of the internal integer from the linked list node, but rather it should refer directly to the contents of the list. That way, when we do a ++ operation on item, we’re directly modifying the contents of the list, rather than a copy.

Remember that your lists will contain entries, and you will be iterating over a list stored inside a vector, so you will want to do something like:

for (entry& p : table[some_index])