DocuBurst: Visualizing Document Content using Language Structure
Christopher Collins, Sheelagh Carpendale, and Gerald Penn

Abstract
DocuBurst is the first visualization of document content which takes advantage of the human-created structure in lexical databases. We use an accepted design paradigm to generate visualizations which improve the usability and utility of WordNet as the backbone for document content visualization. A radial, space-filling layout of hyponymy (IS-A relation) is presented with interactive techniques of zoom, filter, and details-on-demand for the task of document visualization. The techniques can be generalized to multiple documents.

Collins, Christopher; Carpendale, Sheelagh; and Penn, Gerald. DocuBurst: Visualizing Document Content using Language Structure. Computer Graphics Forum (Proceedings of Eurographics/IEEE-VGTC Symposium on Visualization (EuroVis '09)), 28(3): pp. 1039-1046, June, 2009. Available in PDF.

Eurovis presentation

Download these slides as a PDF.

Media Coverage

[DocuBurst featured in Marti Hearst's wonderful book, Search User Interfaces]

[DocuBurst featured in the Toronto Star!]

[DocuBurst on 'information aesthetics' blog]

Interview with Margaux Watt of CBC Radio One Manitoba's "Up To Speed", 21 Feb, 2008:

A feature story on DocuBurst aired on FairChild TV "Media Focus" (cable 36 in Toronto), Friday, March 14, 2008!


Project Overview

‘What is this document about?’ is a common question when navigating large document databases. Overviews of document content have been an active area of research in information visualization for many years. Most reported works do not make use of human-annotated linguistic structure in the visualization, providing detail on topic content without a consistent view that can be compared across documents. We provide a visualization of document content based on the human-annotated IS-A noun hierarchy of WordNet and embedded in the multi-view visualization system WordNet Explorer. The IS-A relation in WordNet is used in DocuBurst to cluster related terms and propagate counts to more general concepts. For example if the relation "robin IS-A bird" occurs in WordNet, then the word counts for "robin" will also be counted for "bird". In this way, more specific terms contribute to the visual significance of general themes.

The combined structure of WordNet hyponymy and document lexical content is visualized in WordNet Explorer as the DocuBurst visualization, which uses a radial space filling layout technique. The root node is shown as a circle. All other nodes are assigned to a sector of an annulus with angular width which is part of the parent node’s width. Angular width can be either (a) proportional to the number of leaves in the subtree rooted at that node (leaf count) or (b) proportional to the number of word occurrences counted for synsets in the subtree rooted at that node (word count). The width of each annulus is maximized to allow for all visible graph elements to fit within the display space (on initial display with neutral zoom factor).

Document content is visualized through the transparency of the fill colour of the nodes. Gray hue is also used to distinguish nodes with zero occurrence counts. Highly coloured nodes have many occurrences; almost transparent nodes have few occurrences. Words and senses that are more
prominent in the document of interest stand out easily against a more transparent context.

To use DocuBurst, a user loads a document of interest into the visualization, and chooses a WordNet node at which to root the visualization. Below, we see that "idea" was chosen to root the visualization, and the occurrences of concepts that fall under "idea" appear in the visualization. The gold coloured nodes indicate search results for nodes matching 'pl' at the beginning.

The visualization can be used to drill down into the loaded document. A paragraph browser, with a fish-eye lens, at the right of the DocuBurst display shows which paragraphs in the document from beginning to end (top to bottom) contain the selected node. By clicking any of the paragraph numbers the full text of that paragraph is shown in the details window, with occurrences of the selected node highlighted in the text.

Ongoing work includes extending DocuBurst to multi-document comparison, developing an ambient display version of DocuBurst to accept RSS feeds, and planning a study of the effectiveness of DocuBurst for information retrieval.

This work was created with the excellent prefuse information visualization toolkit and the Java WordNet Library.


Source Code

The code for displaying and interacting with radial, space filling trees in prefuse is open source, and is available for download. The code is distributed as a zip file and can be imported into Eclipse. It is dependent on the prefuse information visualization toolkit and, unfortunately, is minimally documented at this time:


v1.6, Updated 7 July, 2009

demo code


 

Acknowledgements



 

Top