Developing and Evaluating a Document Visualization System for Information Management

Department of Computer Science
University of Toronto


Parts of this page to jump to...


Abstract

Documents have discourse structures that are common to documents of a particular type. We believe that document structure can be exploited in knowledge management so that information may be assessed quickly and displayed intelligently. This thesis describes a system that automatically extract information from scientific articles and patent claims based on document structure. Based on the results of 101 documents, evaluation shows 72% accuracy for patents and 50% accuracy for scientific articles. To assess overall user satisfaction with the system an adapted usability method from human-computer interaction is used. This novel method is described and shows clear advantages over existing summarization evaluation methods. Based on the 4 most and least accurate documents from the accuracy evaluation, our system scored a user satisfaction level of 2.7/5 for patents and 3.9/5 for scientific articles. Discussion on extending our usability evaluation to other natural language systems is presented.


System Architecture

The major components of this system are preprocessing, segmentation, merging, and interface generation. First, relevant sections are extracted in the preprocessing stage. Extracted sections are passed into the segmentation module, where each sentence is assigned a category based on pre-defined hand-coded patterns. The segmentation module has a lexical analysis component commonly found in IE systems. The subcomponent that our system makes the most use of is scenario pattern matching, and we also draw on some parts of partial syntactic and discourse analysis. Our patterns are hand-coded rather than machine-learned because the IE community generally agrees that hand-coded patterns and rules are more effective than machine-learned ones (MUC-7). Categorized sentences are passed into the merging module, where similar sentences are merged and common phrases are removed. The output is a set of point-form phrases or sentences in a template format. Finally, the interface component takes the textual template and transforms it into a graphical layout. Figure 1 shows the interaction of these components.

Figure 1: System Componenets

Both scientific articles and patent claims propose unique solutions, with certain assumptions and claims, to various goals. We use these common goal-oriented elements to structure our templates. In reorganizing the information from text documents, we have created a canonical template with the slots shown in Table 1. This table also includes a ``dummy'' category, OTHER, which we have created to temporarily hold information that is irrelevant for our system. This category and its information do not appear in the final interface to the users.

Slot Name Abbreviation Description
GOAL G high-level goals, domain goals
PROBLEM P immediate goals
RELATED R cited or discussed related works
SOLUTION S overall proposed approach
SOFTGOAL F desired benefits of the area
ASSUMPTIONS A assumptions, definitions, beliefs, trends
METHOD M particular algorithms, steps, devices
CLAIMS C results, claims, advantages, limitations
EXTENSIONS Z extensions, future directions
OTHER X other information: e.g. background, evidence

Table 1: Template Slots

The corresponding graphical representation for the generic template is shown in Figure 2. Each node in the graph corresponds to a slot in the template, except for the GOAL and SOFTGOAL slots, which may have zero or more nodes in the graph.

Figure 2: Graphical View of Document Contents

The interface produces the graphical model of the template and shows textual details on the side when a node is clicked on. Figure 3 shows a sample use of the system prototype, with the title of the paper placed at the top of the screen.

Figure 3: Sample System Use

In Figure 3, the EXTENSIONS node is highlighted to show that it has been clicked on by the user. The details in the corresponding slot of this template appears on the right-hand side of the screen. Note that the graphical model in this figure does not have any GOAL nodes and has two SOFTGOAL nodes. As mentioned above, the number of nodes for GOAL and SOFTGOAL vary depending on what is written in the original document and what is found by the algorithm. The interface maintains the structural view of the paper while displaying details on demand.

Our visionary system builds concepts across documents in large document collections. The graphical model with GOAL and PROBLEM nodes would make up an ontology of research areas, as shown in Figure 4.

Figure 4: Our Visionary Ontology

This ontology serves as the overview of research problems to tackle. Upon clicking on a particular node, the system zooms in to the structure of the node. By clicking on anyone of the nodes on the screen, details appear on demand. A mock-up of this interface is shown in Figure 5. In this way, the ontology provides an overview of research problems. Zooming into a particular problems gives a further breakdown of that problem and the approaches taken towards it. The details of a particular solution are then displayed when a SOLUTION node has been clicked on. Our interface follows the visual information-seeking mantra (Shneiderman): ``Overview first, zoom and filter, then details-on-demand''.

Figure 5: Our Visionary System


Demo of Current Prototype

Note: This is an early version of the system... the graphical layout is not as nice as it should be.

First to illustrate our goal-oriented graphical representation model, we take the paper titled ``Cubes Marching to the Beat of a Different Drum'' as an example. (The full paper and the extracted sections are both available for viewing.)

Graphical Interface
Using System's Segmentation Results
After Manual Verification

STILL NEED TO EXPLAIN THE ERRORS MADE BY THE SYSTEM, THE NUMBER OF FIXES MADE MANUALLY, AND THE NOTICEABLE DIFFERENCES BETWEEN THE TWO GRAPHICAL INTERFACES OF THE SAME DOCUMENT.


Accuracy Evaluation

Here, we give a brief overview of the accuracy of the segmentation phase of our system. Two views are presented; Table 2 shows the accuracy in terms of the domains and Table 3 shows the accuracy in terms of the categories (template slots). These results are based on 93 documents -- 47 patent claims and 46 scientific articles.

Domain Nc Nm Ns Precision Recall
Pat 1: colour toy 236 319 324 73.981 72.839
Pat 2: education mathematics 349 429 430 81.351 81.162
Pat 3: design blouse 254 386 400 65.803 63.500
Pat 4: modern chair 219 326 311 67.177 70.418
Sci 1: women and language 202 534 527 37.827 38.330
Sci 2: children-related HCI 428 769 791 55.656 55.656
Sci 3: interface-related HCI 260 507 502 51.282 51.792
Sci 4: computational linguistics 314 636 640 49.371 49.062

Table 2: Segmentation Results By Domain

Category Nc Nm Ns Precision Recall
GOAL 232 242 252 95.867 92.063
PROBLEM 133 216 261 61.574 50.957
RELATED 6 17 113 35.294 05.309
SOLUTION 95 110 119 86.363 79.831
SOFTGOAL 202 329 300 61.398 67.333
ASSUMPTIONS 9 27 109 33.333 08.256
METHOD 782 1516 1031 51.583 75.848
CLAIMS 387 656 826 58.993 46.852
EXTENSIONS 3 4 42 75.000 07.142
OTHER 413 789 872 52.344 47.362
Total 2262 3906 3925 57.911 57.631

Table 3: Segmentation Results By Category

The highest domains are ``education mathematics'', with an F-socre of 81.3, for patents and ``children-related HCI'', with an F-score of 55.7, for scientific articles. These two are consistent with each other, because the documents used in these domains talk about children and education in sciences (including mathematics). The lowest domains are ``design blouse'', with an F-score of 64.6, for patents and ``women and language'', with an F-score of 38.1, for scientific articles. Examining the results of each document in ``design blouse'', we see that the SOFTGOAL category was very poor in most cases (i.e., below 50%), while categories PROBLEM, METHOD, and CLAIMS fluctuated in performance. For scientific articles, it was expected that the ``women and language'' domain would score the lowest because its style of writing was not very scientific -- many of them were narrative and often contained many inline quotes.

On average across all the categories, F-score is 57.8 for just over 3900 units. In the order of highest to lowest for precision, the categories are:

GOAL, SOLUTION, EXTENSIONS, PROBLEM, SOFTGOAL, CLAIMS, OTHER, METHOD, RELATED, ASSUMPTIONS

In the order of highest to lowest for recall, the categories are:

GOAL, SOLUTION, METHOD, SOFTGOAL, PROBLEM, OTHER, CLAIMS, ASSUMPTIONS, EXTENSIONS, RELATED

Furthermore, patents consistently score better than scientific articles, which was also true in our pilot study. On average, the patents scored about 72% and scientific articles scored about 50% in this experiment. Again, we believe that these results are credited to the well-structured writing that patents exhibit.

Readers interested in seeing the results of each document are referred to Appendix C.


Usability Evaluation

We adapted the heuristic evaluation method to measure the usability of our system. The newly developed heuristic principles are:

  1. Conciseness
  2. Retention
  3. Coherence
  4. Consistency
  5. Informativeness
  6. Comprehensibility
  7. Fit for Audience
  8. Fit for Purpose

A question-answering task which is modeled after the reviewing task of a conference referee is used in our experiment. This way, evaluators acted as reviewers using only the system output. To keep the workload of the evaluator to a minimum, we limited the task to three questions:

  1. What is the problem addressed by this work? Does it describe why the problem is significant?
  2. Does the work present the approach taken to solve the problem targeted? Is the design or implementation of a system described in terms of key ideas of the approach?
  3. What are the contributions of this work? Are the benefits and limitations clear? Are the results positive or negative?
The files used for the pilot study and the real evaluation are listed in Tables 4 and 5. To assess the usability of the system, only the graphical prototype was made available to the evaluators. Only after understanding the procedure and carrying out the question-answering task, were the evaluators allowed access to the original full-length document and the extract sections used by the system. They were then asked to evaluate the goodness of the system's output according to the 8 heuristic principles.

Files Excerpts Interface
Future Directions for HCI Extracted Sections Graphical Layout
Educational Treasure Hunt Game Extracted Sections Graphical Layout

Table 4: Files Used in the Pilot Usability Evaluation

Files Excerpts Interface
Playing structure and modules therefor Extracted Sections Graphical Layout
Method for remediation based on knowledge and/or functionality Extracted Sections Graphical Layout
Counting on Frank: Postmortem of an Edutainment Product Extracted Sections Graphical Layout
Children as Our Technology Design Partners Extracted Sections Graphical Layout

Table 5: Files Used in the Real Usability Evaluation

Evaluators took 1.5 to 2 hours each to complete a session of 2 to 4 documents (two in the pilot study and four in the real evaluation). Evaluators were asked to type in their comments in an email and summarize their results on a 5-point scale. Principles that could not be assessed in the session were assigned as non-applicable, which corresponds to a score of 0. A summary of the scores of each principle are presented in Table 6 (this table only shows the scores of the real evaluation).

Principle P1 P2 Total P's S1 S2 Total S's Total
Conciseness 11112212162850
Retention 11092017143151
Coherence 10091914173150
Consistency 16112717183552
Informativeness 11092015142949
Comprehensibility 04081217153244
Fit for Audience 13122516163257
Fit for Purpose 15132817173462

Table 6: Summary of Usability Results by Principles

All of this data can be summarized in two points:

Comments from evaluators were consistent with these results: ``If the correct information was pulled out consistently, this system would be very useful'' and ``I really liked the graphical representation of the parts of the document and the fact that a user can click and see the summarized data on the right'' pointed to the usefulness of the system, while ``Sometimes the info was dead on and other times I couldn't decipher it'' and ``The summary really lack in comprehensibility'' pointed to problems of comprehensibility. In general, the patents did worse than the scientific articles, which the evaluators pointed to as well. A rather general problem for all the documents was caused by the inaccuracy of the segmentation phase -- when sentences are miscategorized and are either put into the wrong category or left out as irrelevant. An evaluator wrote that ``I found myself switching back and forth between the method and the [claims] to figure out what was going on'', which indicated that extra effort was needed by the user of the system to process the document. Another evaluator supported this comment: ``Information is not always where it should be. To identify the main purpose of the document for example I had to look at the claims and the methods and do some reasoning to understand what is the exact problem captured by the system.'' This problem needs to be addressed by the improvements of the segmentation phase.


Thesis

These are some papers written during the course of the project and the thesis itself. Papers are listed chronologically. Everything is in .ps.gz format.
  1. thesis proposal titled: Automatic Document Structuring to Support Analysts (May 2001)
  2. thesis (latest) draft titled: Developing and Evaluating a Document Visualization System for Information Management (January 2002)
  3. thesis in parts: