Thesauri and Vocabulary Control - Principles and Practice

Thesaurus principles and practice

By Leonard D. Will

This paper was originally presented at a workshop, "Thesauri for museum documentation", held at the Science Museum, London, on 24th February 1992. The proceedings of the workshop have been published by the MDA (formerly the Museum Documentation Association and now known as the Collections Trust), http://www.collectionstrust.org.uk/.

Contents

1. Why do we need a thesaurus?

2. A limited list of indexing terms

3. Hierarchical relationships

4. Related terms

5. Definitions and scope notes

6. Form of the thesaurus

7. Special factors relating to museum objects

o Abstract terms and disciplines

o Singular or plural terms

o Parts and wholes

o Polyhierarchies

8. Use of a thesaurus when cataloguing

9. Use and modification of existing thesauri

10.Thesaurus maintenance

11.What sort of fields is a thesaurus appropriate for?

12.Other subject retrieval techniques

o Classification schemes

o Free text

13.Bibliography on thesaurus construction and use

1 Why do we need a thesaurus?

One of the reasons for documenting our collections is that we wish to be able to find objects of a particular kind. We may ask "What thermometers do we have in the collection?", "What arrowheads?", "What frocks?", "What whales?" or "What textile machinery?"

The simple answer is that we give each item a "name", and then we can create a file of index cards, or a computer file, in which we can search for these names and expect to find all the appropriate items. This is the concept of the simple name field in the MDA data structure. It is straightforward at first, and seems intuitive, but once you have documentation which has been built up over time, perhaps by many different people, problems creep in unless there are rules and guidelines to maintain consistency.

The word thesaurus is a rather fancy name, which has acquired a certain mystique, because it is often bandied about as something necessary for effective information retrieval, but something which sounds as though it will involve a lot of work. I have often heard curators say "That's all very well if you have the time and resources, but I have this great backlog of cataloguing to do, and I would never get through the half of it if I had to spend time setting up anything as complicated as a thesaurus. What I need is a simple list of names which I can use to index my objects."

My main purpose in this paper is to make three points:

· A simple name list without some rules will rapidly become a mess.

· Only three simple rules are needed; using them will make life easier for you, not harder.

· So long as you stick to these rules, you can take an existing thesaurus and adapt it to your needs; you are not limited to using the terms which are listed in it already, and you are not obliged to use more detail than you need.

What are these rules?

1. Use a limited list of indexing terms, but plenty of entry terms [non-preferred synonyms]
-- link these with USE and USE FOR (UF) relationships.

2. Structure terms of the same type into hierarchies
-- link these with BROADER TERM/NARROWER TERM (BT/NT) relationships.

3. Remind users of other terms to consider
-- link these with RELATED TERM/RELATED TERM (RT/RT) relationships.

I shall consider each of these rules in turn.

2 A limited list of indexing terms

A major purpose of a thesaurus is to match the terms brought to the system by an enquirer with the terms used by the indexer. Whenever there are alternative names for a type of item, we have to choose one to use for indexing, and provide an entry under each of the others saying what the preferred term is. If we index all full-length ladies' garments as dresses, then someone who searches for frocks must be told that they should look for dresses instead.

This is no problem if the two words are really synonyms, and even if they do differ slightly in meaning it may still be preferable to choose one and index everything under that. I do not know the difference between dresses and frocks but I am fairly sure that someone searching a modern clothing collection who was interested in the one would also want to see what had been indexed under the other. We normally do this by linking the terms with the terms USE and USE FOR , thus:

Dresses

USE FOR

Frocks

USE

Dresses

This may be shown in a printed list, or it may be held in a computer system, which can make the substitution automatically. If an indexer assigns the term Frocks, the computer will change it to Dresses, and if someone searches for Frocks the computer will search for Dresses instead, so that the same items will be retrieved whichever term is used. A friendly computer will explain what it is doing, so that the user is not puzzled by being given items with terms different from those asked for.

USE and USE FOR relationships are thus used between synonyms or pairs of terms which are so nearly the same that they do not need to be distinguished in the context of a particular collection. Other examples might be:

Cloaks

USE

Capes

USE FOR

Cloaks

Nuclear energy

USE

Nuclear power

USE FOR

Nuclear energy

Baby carriages

USE

Perambulators

USE FOR

Baby carriages

Perambulators

USE FOR

Prams

USE

Perambulators

If we name objects, we want to be as specific as possible. If we have worked hard to discern subtle distinctions in nature, type or style, we certainly want to record these. The point is that the thesaurus is not the place to do this. Detailed description of an object is the job of the catalogue record; the job of the thesaurus, and the index which is built by allocating thesaurus terms to objects, is to provide useful access points by which that record can be retrieved.

USE and USE FOR relationships can also be used to group similar items together, because too much specificity is as bad as too little. If we have a small clothing collection, containing ten jackets, it is more useful to give them all the index term jackets than to create many specific categories. Anyone searching our catalogue will then be able to search on the single term jackets and see a list of the ten items, each with a description of exactly what kind of jacket it is, as follows:

Jackets:

1. Anorak in green cotton, England, 1985.

2. Tweed sports jacket, Hawick, Scotland

3. Silk bolero with floral embroidery, Spanish, 1930.

If we used all the possible specific names, each of which would have only one or two items in it, such as blazers, dinner jackets, boleros, donkey jackets, anoraks, flying jackets, sports jackets, and so on, enquirers would have to search the catalogue under each name in turn in order to find all the jackets in the collection, and they would never be sure that there was not a kind of jacket that they had overlooked.

To help enquirers who approach the system by one of these terms, we therefore create the references:

Blazers

USE

Jackets

Dinner jackets

USE

Jackets

and so on.

3 Hierarchical relationships

If we have a hundred jackets, a list under a single term will be too long to look through easily, and we should use the more specific terms. In that case, we have to make sure that a user will know what terms there are. We do this by writing a list of them under the general heading, thus:

Jackets
NT	Anoraks Blazers Boleros Dinner jackets Donkey jackets Flying jackets Kagouls Sports jackets

We could just invert terms and rely on the alphabet to bring them together, in a list such as

Jackets, dinner
Jackets, donkey
Jackets, flying
Jackets, sports

but this is unreliable and subject to the vagaries of the language, which does not always describe a specific type of item by an adjective preceding the generic name. We have to accommodate types of jacket which have their own distinctive names such as Anoraks or Blazers.

In both the above cases, it is important that the terms which are linked are of the same type. That is to say that any narrower term must be a specific case of the broader term, and able to inherit its characteristics. (The developers of Object Oriented Programming have recently discovered this idea, which has been known to the worlds of information science and biological taxonomy for a very long time.) Thus if we say that Blazers is a narrower term of Jackets, we mean that every blazer is, whatever else it may be, inherently a jacket, and that it has the characteristics which define a jacket.

Mice can properly be said to be a narrower term of Rodents, because all mice are inherently rodents, but it is not correct to list Mice as a narrower term of Pests, because some mice, such as laboratory mice and pet mice, are not pests. The idea is to have relationships in the thesaurus which are always true, irrespective of context. In the same way, it would not be correct to list Buses as a narrower term of Diesel-engined vehicles, although many of them are; if we have a diesel-engined bus in our collection, we should show this by giving it the two terms Buses and Diesel-engined vehicles.

Broader and narrower terms Hierarchical relationships
· Relationships must be independent of context · Terms must represent the same type of entity
	Mice	BT	Rodents
	Rodents	NT	Mice
	Shoes	BT	Footwear
	Footwear	NT	Shoes
	Mice	BT	Pests
	Pests	NT	Mice
	Shoes	BT	Shoemaking
	Shoemaking	NT	Shoes

Good computer software should allow you to search for "Jackets and all its narrower terms" as a single operation, so that it will not be necessary to type in all the possibilities if you want to do a generic search:

Diagram: expanding a search to include narrower terms

If we restrict the hierarchical relationship to true specific/generic relationships, we need another mechanism to draw attention to other terms which an indexer and a searcher should consider. These are RELATED TERMS of the starting term. Related terms may be of several kinds:

1. Objects and the discipline in which they are studied, such as Animals and Zoology.

2. Process and their products, such as Weaving and Cloth.

3. Tools and the processes in which they are used, such as Paint brushes and Painting.

It is also possible to use the RELATED TERM relationship between terms which are of the same kind, not hierarchically related, but where someone looking for one ought also to consider searching under the other, e.g. Beds RT Bedding; Quilts RT Feathers; Floors RT Floor coverings.

5 Definitions and scope notes

A thesaurus is not a dictionary, and it does not normally contain authoritative definitions of the terms which it lists. It could perfectly well do this, but a lot more work would be required to develop it in this way. In an automated system, however, the thesaurus would be a logical place to record information which is common to all objects to which a term might be applied, for example notes on the history and origin of Anoraks or the identifying characteristics and lifestyle of Mice (or perhaps Mus musculus in a taxonomic thesaurus).

Where there is any doubt about the meaning of a term, or the types of objects which it is to represent, a SCOPE NOTE (SN) is attached to it. For example,

Fruit
SN	distinguish from Fruits as an anatomical term
BT	Foods
Preserves
SN	includes jams
Neonates
SN	covers children up to the age of about 4 weeks; includes premature infants

6 Form of the thesaurus

A list based on these relationships can be arranged in various ways; alphabetical and hierarchical sequences are usually required, and thesaurus software is generally designed to give both forms of output from a single input. A typical simple thesaurus of a few clothing terms is shown in Tables 1 and 2.

Table 1: Sample thesaurus - hierarchical sequence

knitwear
> cardigans
> pullovers
outerwear
> blouses
> cardigans
> coats
> > raincoats
> dresses
> jackets
> > anoraks
> > blazers
> > dinner jackets
> > donkey jackets
> > reefer jackets
> leggings
> pullovers
> rainwear
> > raincoats
> shawls
> shirts
> skirts
> suits
> trousers
> > jeans
> > shorts
> > slacks

Table 2: Sample thesaurus - alphabetical sequence

anoraks
BT	jackets
blazers
BT	jackets
blouses
UF BT	smocks outerwear
breeches
USE	trousers
capes
USE	coats
cardigans
SN	knitted jackets with front opening
BT	knitwear
outerwear
cloaks
USE	coats
coats
UF	capes
cloaks
overcoats
BT	outerwear
NT	raincoats
dinner jackets
BT	jackets

donkey jackets
BT	jackets
dresses
UF BT	frocks outerwear
duffel jackets
USE	reefer jackets
frocks
USE	dresses
jackets
BT	outerwear
NT	anoraks
blazers
dinner jackets
donkey jackets
reefer jackets
jeans
BT	trousers
jumpers
USE	pullovers
knitwear
NT	cardigans
pullovers
leggings
BT	outerwear

outerwear
NT	blouses
cardigans
coats
dresses
jackets
leggings
pullovers
rainwear
shawls
shirts
skirts
suits
trousers
overcoats
USE	coats
pullovers
UF	jumpers
sweaters
BT	knitwear
outerwear
raincoats
BT	coats
rainwear
rainwear
BT	outerwear
NT	raincoats
reefer jackets
UF	duffel jackets
BT	jackets

shawls
UF	wraps (clothing)
BT	outerwear
shirts
BT	outerwear
shorts
BT	trousers
skirts
BT	outerwear
slacks
BT	trousers
smocks
USE	blouses
suits
BT	outerwear
sweaters
USE	pullovers
trousers
UF	breeches
BT	outerwear
NT	jeans
shorts
slacks
wraps (clothing)
USE	shawls

7 Special factors relating to museum objects

7.1 Abstract terms and disciplines

Many thesauri have been created with the intention of being used to index documentary material, and thus they include many terms which relate to abstract concepts, disciplines and areas of discussion, as well as the names of concrete objects which are of primary interest to museums. We have to be careful to be consistent in how we use these terms. The most straightforward way is to concentrate first on what objects actually are - spades are Spades and should be given this term, rather than the area in which they are used, whether it is gardening or gravedigging.

You may well wish to allocate abstract and discipline terms to objects too, so that you can retrieve all the objects to do with Dentistry, Laundry, Warfare or Food preparation. These terms can also be included in the thesaurus, so long as they are not given hierarchical relationships to names of objects. They should be given RT relationships to an appropriate level of object terms.

Some thesauri, such as ROOT [published by the British Standards Institution in 1981], interfile terms of different types in their hierarchical display. Indentation in such cases does not necessarily indicate a BT/NT relationship. The relationships are shown in ROOT's alphabetical sequence, and it is unfortunate that they are not distinguished in the hierarchical one.

Because these abstract terms do not describe what the object is, they could be put into a field in the catalogue record labelled concept or subject, distinct from the field containing terms which name the object. I do not think that such a distinction will generally be helpful to users, however, and there seems to be no disadvantage in putting both types of term into a single field so that they can easily be searched as alternatives or in combination. Such a field would not be correctly called name and I therefore prefer to call it simply indexing terms or subject indexing terms.

7.2 Singular or plural terms

There has been much discussion on whether thesaurus terms should be expressed in the singular or the plural. I believe that the difficulty arises from different views of what is being done when a term is assigned to an object record. If a cataloguer thinks that (s)he is naming the object in hand, (s)he will naturally use the singular: "This is a clock". If (s)he is assigning the object to a category of similar objects, the thought will be "This belongs in the category of clocks". An enquirer will normally ask for a category, so the latter form will be more natural and logical.

The point is not a trivial one, because as discussed in section 2 above there is a conceptual difference between naming or describing an object and grouping it with others so that it can be found. Both are essential steps, but an information retrieval thesaurus is primarily concerned with grouping.

Singular or plural terms?
The cataloguer thinks: "This is a clock".
The enquirer asks: "What clocks do you have?"
Prefer plural terms because: · We should design the catalogue to fit the way the user thinks. · Clocks is the name of a category, including many types, so plural is more logical.

The British Standard for thesaurus construction [which has served as a basis for and has been superseded by ISO 25964] recommends that plural terms should be used, except for a few well-defined cases, and my view is that this practice should be followed. Unfortunately, there are many records in museum collections which have been given singular "object names", and the work of changing these to plurals in a move to a thesaurus structure may be so great as to require some compromise.

7.3 Parts and wholes

The British Standard recommends that when indexing parts or components, separate terms should be assigned for the component and for the object of which it forms part, so that aircraft engines would be indexed by the two terms Aircraft and Engines. This causes problems in a museum collection, however, because items indexed in this way would be retrieved in a search for Aircraft, when only whole aircraft were being sought. It therefore seems preferable to use a term such as Aircraft components. A particular engine may well be an aircraft component, but it is not an aircraft. Similarly a timer from a cooker can be indexed by the terms Timers and Cooker components, and a handle broken from a vase might be indexed as Handles and Vase fragments. There needs to be local agreement on how this approach is to be applied to a particular collection.

In the thesaurus, BT/NT relationships can be used for parts and wholes in only four special cases: parts of the body, places, disciplines and hierarchical social structures.

7.4 Polyhierarchies

As shown in the sample thesaurus above, a term can have several broader terms, if it belongs to several broader categories. The thesaurus is then said to be polyhierarchical. Cardigans, for example, are simultaneously Knitwear and Jackets, and should be retrieved whenever either of these categories is being searched for.

With a polyhierarchical thesaurus it would take more space to repeat full hierarchies under each of several broader terms in a printed version, but this can be overcome by using references, as ROOT does. There is no difficulty in displaying polyhierarchies in a computerised version of a thesaurus.

8 Use of a thesaurus when cataloguing

A thesaurus is an essential tool which must be at hand when indexing a collection of objects, whether by writing catalogue cards by hand or by entering details directly into a computer. The general principles to be followed are:

1. Consider whether a searcher will be able to retrieve the item by a combination of the terms you allocate.

2. Use as many terms as are needed to provide required access points.

3. If you allocate a specific term, do not also allocate that term's broader terms.

4. Make sure that you include terms to express what the object is, irrespective of what it might have been used for.

If you have a computerised thesaurus, with good software, this can give you a lot of direct help. Ideally it should provide pop-up windows displaying thesaurus terms which the cataloguer can choose from and then "paste" directly into the catalogue record without re-typing. It should be possible to browse around the thesaurus, following its chain of relationships or displaying tree structures, without having to exit the current catalogue record, and non-preferred terms should automatically be replaced by their preferred equivalents. A cataloguer should be able to "force" new terms onto the thesaurus, flagged for review later by the thesaurus editor. When editing thesaurus relationships, reciprocals should be maintained automatically, and it should not be possible to create inconsistent structures.

9 Use and modification of existing thesauri

As there are many thesauri in existence already, it is worth considering seriously whether one of these can be used before embarking on the job of creating a new one for a particular museum or collection. So long as the general principles are followed, you should be able to expand a thesaurus to give you more detail if you need it, or truncate some sections at a high level if they contain more detail than your collections justify. So long as the relationships are universally true, it should be possible to combine sections of thesauri developed by different museums and thus avoid duplication of work.

Even when using an authoritative thesaurus, some care is needed, and I have mentioned some limitations of ROOT and AAT in 7.1 and 7.4 above. It is still much easier to base your work on something like these than to build your own from scratch, unless you have a very specialised collection.

10 Thesaurus maintenance

Someone has to be responsible for this. New terms can be suggested, and temporarily "forced" into the thesaurus by cataloguers as they catalogue objects, but someone has to review these terms regularly and either accept them and build them into the thesaurus structure, or else decide that they are not appropriate for use as indexing terms. In that case they should generally be retained as non-preferred terms with USE references to the preferred terms, so that people who seek them will not be frustrated. An encouraging thought is that once the initial work of setting up the thesaurus has been done, the number of new terms to be assessed each week should decrease, and many systems have operated successfully in the past with printed thesauri, which are quite difficult to keep up to date.

11 What sort of fields is a thesaurus appropriate for?

A thesaurus is not a panacea which will meet all subject retrieval needs. It is particularly appropriate for fields which have a hierarchical structure, such as names of objects, subjects, places, materials and disciplines, and it might also be used for styles and periods. A thesaurus proper would not normally be used for names of people and organisations, but a similar tool, called an authority file is usually used for these. The difference is that while an authority file has preferred and non-preferred relationships, it does not have hierarchies.

[Authority files and thesauri are two examples of a generalised data structure which can allow the indication of any type of relationship between two entries, and modern computer software should allow different types of relationship to be included if needed.]

12 Other subject retrieval techniques

A thesaurus is an essential component for reliable information retrieval, but it can usefully be complemented by two other types of subject retrieval mechanism.

12.1 Classification schemes.

While a thesaurus inherently contains a classification of terms in its hierarchical relationships, it is intended for specific retrieval, and it is often useful to have another way of grouping objects. This may relate to administrative distribution of responsibility for "collections" within a museum, or to subdivisions of these collections into groups which depend on local emphasis. It is also often necessary to be able to print a list of objects arranged by subject in a way which differs from the alphabetical order of thesaurus terms. Each subject group may be expressed as a compound phrase, and given a classification number or code to make sorting possible.

12.2 Free text.

It is highly desirable to be able to search for specific words or phrases which occur in object descriptions. These may identify individual items by unique words such as trade names which do not occur often enough to justify inclusion in the thesaurus. A computer system may "invert" some or all fields of the record, i.e. making all the words in them available for searching through a free-text index, or it may be possible to scan records by reading them sequentially while looking for particular words. The latter process is fairly slow, but is a useful way of refining a search once an initial group has been selected by using thesaurus terms.

Main menu

Navigation

You are here

Thesauri and Vocabulary Control - Principles and Practice

Main menu

Navigation

User login

You are here

Search

Thesauri and Vocabulary Control - Principles and Practice