WebCat Documentation

3.4 Domain Knowledge-base

One of the key properties of WebCat is that we know ahead of time the domain of the sources. Once the domain is chosen, either we rely on an existing ontology, or we build an ontology of the domain. We build the ontology by using all possible terms at the finest granularity possible. For example in the case of clothing catalogs, if one catalog has the first branch above the leaves called shirts, and another has casual shirts, and shirts is one level above, we choose the lowest level possible. That is, our hierarchy will be shirts-> casual shirts. With this approach, we ensure a high accuracy for the retrieval.

The ontology is not fixed; it is updated each time it is necessary. A report is generated after each integration. Using this report, we can check whether the ontology still accommodates all possible terms, or whether some additional terms need to be added.

The ontology is stored in a knowledge base, which is implemented using the description file format.

3.5 Integrator

As the name suggests, the Integrator assimilates a new catalog into the virtual catalog. The Integrator reads a description file, combines the new data with the existing data, and builds a new version of the virtual catalog.

The integrator maps the new catalog structure to the virtual catalog structure. However, in most cases only branches have to be mapped, which is no more than a few tens of records.

3.6 Query Engine

The Query Engine is a combination between an SQL query engine (SQL-QE) and an additional component, which performs full text search on specified attributes. The additional component is called Query Filter (QF).

We need the Query Filter because most of the data from the catalog is semi-structured. It is the case that some attributes may be, or may not be present even in the same catalog. Therefore, we use the SQL-QE to query on attributes that we know always exist, such as labels, catalog names, and prices. We use the Query Filter to "filter" on additional attributes such as size, color, and any other description from the same clothing catalog.

In order to explain how the Query Filter works, let’s take the size attribute. No two catalogs use the same range of sizes. Furthermore, it is often the case that size comes in combination with color. Consequently, instead of considering a separate attribute for each size, we store all values for sizes in one attribute in a multi-valued field fashion. If we have a query request for a specific size, we first extract the size attribute using the SQL engine, and then we filter the result by checking each value from the multi-valued field size.

Once the query answer is computed, the answer is formatted in a tabular form that shows all the specified details, and the URL for each leaf (see Apparel WebCat). Thus, the user can navigate through the answer and can pick a leaf for a closer examination.