Java Industry News
Sets and Lists and Maps - Oh My!
Sets and Lists and Maps - Oh My!
Jan. 1, 2000 12:00 AM
(May 29, 2003) - Reusable objects for storing collections of data - Vector and HashTable - have been available to Java developers since the language's first release. In Java 1.2, Sun created the collections framework to more completely represent the types of data structures that application programmers require, while offering a set of base abstractions that allow for a separation between collection interface and the implementation. The result is called the collections framework, which represents three key interfaces - set, list, and map. While games can be played to make any implementation of any interfaces work a particular situation, choosing the right implementation can have a positive impact on performance and ease of development.
There are a number of advantages to using the collections framework. First and foremost, the most common data structures, algorithms for manipulating the data in those structures, and the science behind of all of it have been thoroughly thought out by the people who developed Java, so there is no need to write them yourself. When you can take a working LinkedHashMap for granted, programming effort has already been greatly reduced. Because of the organization of the collections framework, when the interfaces are used correctly, switching between collections implementations as an applications needs evolves becomes a trivial task.
The most important distinction to make is the differences between the interfaces. A set is a collection of elements where all elements are unique. An element is defined as unique if every element in the set fails the equals() test for that element. A list is an ordered collection of elements where the uniqueness rule is not applied, and duplicates of elements are allowed. The map interface offers a collection of data based on key-value pairs. The key, which in most cases can be any object, is what is used to look up the value. While maps represent mappings rather than simple collections, methods within the map interface allow you to extract the set of keys or values and allow them to be manipulated as collections.
If you take a look at the Javadoc for the collections interface you'll see that while the super interface of all collections specific a number of methods for modifying the collection - add, remove, addAll etc. - not all of these methods are supported for all implementations. The implementation may additionally throw an UnsupportedOperatoinException if that is the case. The implementations contained within Java's collections framework are complete and offer the full range of function one will need to effectively use that collection without fear of such an exception cropping up.
Collection Implementations - Sets
There are three concrete set implementations as part of Java 1.4: TreeSet, HashSet and LinkedHashSet.
The TreeSet is stored as a balanced tree, and guarantees that when it is iterated over it will return the elements of the set in ascending order according to the natural order of the elements (a-z for strings) or according to the comparator offered at the construction of the set. If the elements contained in the set implement the comparable interface, the element's compareTo method will dictate the ordering of the elements. One further note about TreeSet: TreeSet deviates from the standard uniqueness test for elements in that it uses the comparison method as a test for uniqueness rather than the equals method. If the comparison method returns 0, the element is consider a duplicate of another element in the TreeSet and will not be added. This means that the compare method requires some extra forethought.
The hashCode method had been core to Java Object superclass since the beginning. It is the means by which you can define an object as unique, as a function of any of the data that the object represents. HashSet takes advantage of that method to ensure that the elements of the set are unique. Unless hashCode is overridden, the hashCode of an object will be an integer representation of its location in memory. This is an important concept to keep in mind when building HashSets of objects that you have defined. If you fail to override the hashCode method your set may contain multiple copies of objects that essentially represent the same data because the set sees them as having unique hash codes. The string object overrides the hashCode method and creates a 'unique' integer representation of the string based on the content of the string, so if you try and add the word "carp" to the HashSet twice only one will be added.
One drawback of the HashSet is the order in which elements are returned when you iterate over them. Specifically that iteration order will be unpredictable, and there are times when you need to be able to predict the order of the elements of the collection, and natural ordering might not fit the bill. LinkedHashSet offers a solution to that problem by maintaining an ordered list in the background along with the HashSet. When a LinkedHashSet it iterated over the results will be returned in the same order in which they were added to the set, while still maintaining uniqueness.
Collection Implementations - Lists
There are two concrete list implementations included in Java 1.4: LinkedList and ArrayList.
The LinkedList is a sequential collection of elements which is constructed, effectively, by having each element point to its next neighbor. The benefits for this type of list come when you need to add elements to the middle of the list. That operation takes only a few reference manipulations to accomplish. The downside of such a set up is that trying to access data at random points in the list, since the only way to do that is to scan the list from the beginning and count until you get to the right point.
The ArrayList is also a sequential collection of elements; however it is backed by an array. Retrieving a element from an random position in an ArrayList is very efficient operation, compared to the linked list, but adding new elements to anywhere but the end of the list can take some time. This is due to the internal operations which need to adjust the size of the list and reposition its elements in order to accomplish the add.
Collection Implementations - Maps
There are five concrete map implementations included with Java 1.4: TreeMap, HashMap, LinkedHashMap, IdentityHashMap, and WeakHashMap.
As mentioned before, maps offer storage for a key/value pair and not just a simple collection. The implementation prefixes - Tree, Hash and LinkedHash - when applied to a map, refer to the properties of the set of keys for that map. Just as a TreeSet keeps the collection of elements sorted according to the rules mentioned above, a TreeMap keeps the keys ordered according to those same rules. When you iterate through the keySet of a TreeMap, the keys will be returned according to their natural order.
A HashMap is a map that insures the uniqueness of the keys by evaluating the hashCode method for each object that it stores and creating a new entry in the map if the key does not already exist there, or updating the value there currently if the entry does exist.
LinkedHashMap operates identically to HashMap, except it will return the list of keys in the same order that they where added when the set of keys is iterated over.
If you need strict reference equality, and more than likely you won't, you'd use an IndentityHashmMap which uses the == operator to see if the key already exists in the map.
If you don't know what a weak reference is, you won't need to use a WeakHashMap. Even if you do, there is a good chance you still won't need to use it. A WeakHashMap keeps keys as weak references that are allowed to be reclaimed by the garbage collector if needed.
Do You Really Need a Tree?
Probably not. The most common justification for a tree is the eventual need to display sorted data to a user. The drawback of keeping the tree is the amount of data reorganization that needs to happen whenever the collection is updated in order to keep the data sorted in memory. If the only need for sorted data is when it is displayed, you're probably best off keeping the data in a hash (set or map) and building the tree just when the data needs to be displayed. Fortunately, the collections framework makes this easy to accomplish. Create your hash and populate, then when you need to display your data build a new tree, TreeSet sorteddata = new TreeSet(hash). Then used the resulting tree for displaying to the user.
Define Your Collections in Terms of Their Interfaces.
Rather than defining HashSet set = new HashSet(), you should define Set set = new HashSet(). This will allow you to change the implementation to suit the changing needs of you code. For example, you might find yourself wanting to retrieve the data in the same order it was entered. By defining the interface, all you need to do is change your declaration to Set set = new LinkedHashSet() and you get the rest for free.
You Probably Need a HashMap.
Really. HashMap will cover most of the situations that you run into. The lists and sets are all useful and definitely have their place, but more often than not you'll find yourself looking to query the data of a particular element, and HashMap is the way to go.