Wednesday, July 21, 2010

C# Trie

*Update* I have developed a new an improved version of this data structure which you can find here.

For anyone interested, I have also written a Java Trie here.

In a recent project I have ventured into, I had to develop a set of classes in C# for a simple trie which are now released as open source (http://code.google.com/p/typocalypse/source/browse/#hg/Trie). This blog post shall describe how it works and how to use it.

Introduction to tries
Simple tries are a data structure for mapping strings to values such that it enables you to efficiently retrieve all the values whose key strings have a particular prefix. This works by storing all the strings in a tree where each node is a prefix to a number of strings and each edge is a character to append to the prefix in the parent node. The root node is the empty string which is a prefix to all strings.

As you traverse a path from the root downwards, you follow the edges which lead to prefixes of the string you are trying to match. Eventually if such a string exists, a node will be pointing to the value mapped by the string. By traversing the descendants of a particular node, you can retrieve all the stored strings which have that prefix in common and their corresponding value.

Here's a diagram example from wikipedia:
http://upload.wikimedia.org/wikipedia/commons/b/be/Trie_example.svg

Implemented trie
The implemented trie supports adding, removing and character-by-character prefix searching of string-value pairs.

The way it works is by keeping a reference in each node to a value object such that if the reference is null then the node does not point to a value object and hence the prefix is not a complete string. The type of the values is generic, any referencable type is allowed. Each node points to its children via a Dictionary object which maps characters to nodes, thereby allowing a fast search by following the characters.

The following is a description of each class:

TrieNode
This is a non-public class which represents the nodes of the trie, that is, the prefixes. However in this implementation, the prefix itself is not stored, only the last character of the prefix called the Key is stored in order to know which character was followed to access this node. It also stores the value (generic type), a pointer to the parent node and a dictionary which maps characters to children nodes. If this node terminates a string then it's value field points to an object, otherwise it is null.

PrefixMatcher
This is a non-public class which is dedicated to allow the user to search the trie for a particular string or strings with a particular prefix. It was designed for enabling the user to search for the values of said strings on a character by character basis rather than providing the whole string immediately so that it can be used to search for user input as the user types in the query. This requires the maintenance of a state so that it is known which prefix has been entered thus far as well as which node that prefix is found in so as to enable an efficient search. For this reason the search feature was kept in a class of its own. Each time the trie is modified with the removal of a string, the current search is cancelled and must be redone from first character so as to avoid the user from searching non-existent strings.

IPrefixMatcher
This is a public interface which abstracts the methods of the PrefixMatcher class so that the user is provided with an abstraction rather than a concrete class in order to search.

Trie
This is the main class which provides the features of the trie data structure. Searching is done through the public property "Matcher" which returns an instance of IPrefixMatcher. When a string is removed, the current search is cleared so as to avoid the user from searching non-existent strings.

Examples
Here is an example of how to use the implemented trie:
//Create trie
Trie < string > trie = new Trie < string > ();

//Add some key-value pairs to the trie
trie.Put("James", "112");
trie.Put("Jake", "222");
trie.Put("Fred", "326");

//Search the trie
trie.Matcher.NextMatch('J'); //Prefix thus far: "J"
trie.Matcher.GetPrefixMatches(); //[112, 222]
trie.Matcher.IsExactMatch(); //false
trie.Matcher.NextMatch('a');
trie.Matcher.NextMatch('m'); //Prefix thus far: "Jam"
trie.Matcher.GetPrefixMatches(); //[112]
trie.Matcher.NextMatch('e');
trie.Matcher.NextMatch('s'); //Prefix thus far: "James"
trie.Matcher.IsExactMatch(); //true
trie.Matcher.GetExactMatch(); //112

//Remove a string-value pair
trie.Remove("James");

Link to classes
http://code.google.com/p/typocalypse/source/browse/#hg/Trie

15 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. This example code does not seem to work, since the library classes require that the template parameter for Trie be a reference type...

    ReplyDelete
  3. Thanks for the heads up Eric! Fixed.

    ReplyDelete
  4. Hi, link to the class is broken.

    ReplyDelete
    Replies
    1. Are you sure naren? It works fine from my end. You're supposed to be taken to the directory containing the class files. It's more than one class.

      Delete
  5. Hi, I am having problems with the Tree. If I have those names in the Tree: {Carlos, Carla, Cesar} and i search for Cz, the method is returning the entire Tree! how do I procede?
    Pedro

    ReplyDelete
    Replies
    1. Hi Pedro, I think you're confused by the fact that the Trie object will ignore letters that do not match any existing strings. What's happening is that the 'C' is matched with Carlos, Carla, and Cesar but when you use 'z' as a next letter then the Trie will ignore it and remain focused on the 'C' you entered. It is up to you to check what the NextMatch method returns. If it returns false then that means that the letter you're trying to search by doesn't match any strings.

      Delete
  6. Following the above example I'm only getting the values printed out. Any suggestions how to print the keys associated with the values of GetPrefixMatches();

    Adam

    Thanks

    ReplyDelete
    Replies
    1. I can't believe that I forgot to add a method to get all the strings which share a prefix. I'll add one as soon as I can and release a new version.

      Delete
    2. Late reply, but here is the new version: http://geekyisawesome.blogspot.com/2015/04/new-and-improved-c-trie.html

      Delete
  7. how can i do merge sort on atrie

    ReplyDelete
    Replies
    1. I'm not sure what you mean Ayman. A trie isn't meant to be used for sorting. Maybe you can do a depth-first traversal and visit the children in character order in order to get all the strings in sorted order?

      Delete
  8. Replies
    1. Hello Ayman. Last time I tried this code it worked well so I'm not sure why you're getting this error. Have you tried the new version of the code at
      http://geekyisawesome.blogspot.com/2015/04/new-and-improved-c-trie.html
      ?

      Delete