Hi There!

I'm Dan Schlegel, an Associate Professor in the Computer Science Department at SUNY Oswego

Assignment 5 – File Index

In this assignment you will build a simple index for a set of text files. Specifically, you will build what’s known in the information retrieval field as an inverted index – mapping from tokens (“words”) to the set of documents containing those tokens. Implemented using a binary search tree, your program will allow users to retrieve the names of all documents containing a word or group of words, with limited wildcard mechanisms. A sample set of files to test your code on is available here.

Specifically, you will modify the BST class as follows:

  • Instead of nodes containing only a value, they contain both a key and a set of values (use, e.g., HashSet). The key represents the token, and the set of values represents the set of document names. Hint: your BSTNode class will have a definition beginning something like: private class BSTNode<K1 extends Comparable<K1>, V1>;
  • The key of a node is the criterion on which placement in (and retrieval from) the BST will depend;
  • When a key-value pair is inserted into the BST, if a node already exists for the key, you should add the new value to the set of values for the found node;
  • Add a method, get, which searches for a node and returns its values, or, if no such node exists, throws a NoSuchElementException;
  • Add a method, contains, which returns a boolean value corresponding to whether a given key exists in the tree;
  • Add a method, getInRange, which takes two arguments and retrieves all nodes with keys equal to or between the given arguments. The method should return the union of the set’s values from the found nodes. Be sure your method is implemented efficiently (i.e., don’t traverse the whole tree unnecessarily);
  • Add any additional methods you need to support these operations.

Write a class, TreeIndex, which includes:

  • A method returning all of the non-directory files in a directory;
  • A method for reading a file into a String;
  • A method which, given a String, breaks the String into its constituent tokens (“words”) using something like String.split or StringTokenizer. Note that we’ll reserve * as a special symbol, so you should delimit on it (among other symbols, probably). The method should then add the token – document name pairs to the tree;
  • A command processor which:
    • Asks the user which directory to load files from;
    • Allows the user to enter a single word, and returns a list of files containing that word;
    • Allows the user to enter a word ending in a *, indicating that the user would like to retrieve all documents containing the prefix entered (hint: use your getInRange method);
    • Allows the user to enter multiple words (each possibly with *), and returns a list of files which contain all of those words (set intersection on the individual words);
  • Any additional methods/fields you need to support these operations.

Opportunities for Extra Credit

There are several opportunities for extra credit on this assignment. Each of the following will result in your score being multiplied by 1.1. 

  • A word-based search index would not normally be implemented using a BST. Opportunities for an unbalanced tree, and the clunkiness of wildcard searches are only a couple of the reasons for this. One alternative is to use a trie, a kind of prefix tree. After completing the project using a BST, implement the project again using a trie and examine the benefits of each approach. 
  • Allow the user to add and remove files from the index after the tree is built. You’ll likely want to maintain a second index, mapping from each document to the set of tokens contained in that document. More efficient methods are also possible (but more difficult).
  • Create a graphical user interface for your program, including an interface for entering a query and displaying the results of a query allowing the user to view the underlying document.