Wednesday 9 May 2007

Definitive

Subject Define

My day job mostly concerns handeling data. And one of the things that always strikes me as odd is that the definitions of those data is either very bad or completely missing. It is simply assumed that everyone that should know about the data already knows about it. It is these misconceptions that lead to misunderstandings and misinterpretations of data across various departments. When headquarters needs to reconcile the data it runs the risk of comparing apples with oranges. This is where the need for good and up-to-date descriptions of that data, or meta data, could be useful. And it is also where I seem to spend most of my time on. Here are some guidelines that are a result of roughly a decade of data modelling and writing definitions.

The first rule is that a definition should be a complete and valid sentence. Like your teacher told you in school when he explained how to write-up answers for a test, this makes it easier to read. So don't do the following when defining Incoming Mail:
something you get sent.
But please consider writing it more like:
Incoming Mail is Mail you get sent via the postal services.
The components of meta data obviously vary with its usage, but for a normal business person, the following components should be described:
  1. Subject Areas
  2. Descriptors
  3. Measures
  4. Dimensions with their Value Lists
  5. Relationships with other Subject Areas

1. Subject Areas

The data modelling world uses the term Entity, the object modelling world used Class or Object but in essence it is a specific subject in the area of the business about which there is a need to administer things. If something is good to know, but without a business need to administer it, then there is no need to describe its meta data.

A Subject Area should have a definition, because then everyone that has to deliver data for it, knows what to deliver. It is the definition that acts as a filter for the subject area: Every piece of data that fits the definition should become part of that Subject Area all others should be rejected.
Think of the out tray on your desk for all mail that has been processed but not yet archived. The mail out tray can be seen as a Subject Area and its definition would then be something like:
Out Tray Mail is Mail that is processed and that should be archived.
If you now receive a mail you and you want to put it in the out tray, then you could run it by the definition of the Out Tray Mail and check if it matches. It probably doesn't because you still need to process it. So you put it somewhere else. Once you've processed it, you make a mark on it to that effect and then it passes the definition of the Out Tray Mail. So now it can be part of Out Tray Mail.

This contrived example shows the importance of a good definition on a Subject Area level. But how to define a Subject Area. For this it is probably easiest if you think of sets and subsets. In the above example there is a set of things that should be in the Out Tray. It's called Out Tray Mail. And it is most likely that there is a set of things that are unprocessed Mails. Let us call it In Tray Mail. These both sets are part of the larger set that we can call Mail.
To write a good definition for a Subject Area, it is best to start to refer to a well-known term that is a more generic concept of the Subject Area. Then all you need to do is write the conditions under which the well-known term is the Subject Area. The well-known term can be seen as a set of data of which the Subject Area is a subset. In the above example, the well-known term is Mail and the subject area is Out Tray Mail. The condition that makes a Mail an Out Tray Mail is the fact that it has been processed and is ready for archiving. So that is what narrows the set of all Mails down to Out Tray Mails. And that is what has been put in the definition.

To recap:
Subject Area is a <more generic concept> that <conditions that make it Subject Area>.

2. Descriptors

Descriptors are things that describe the Subject Area. It's as simple as that. But how to go on describing or defining the descriptors? As with the Subject Area definition, you could refer to a more generic concept. The problem here is that there are only a few concepts that fit the bill. And the following list comes to mind:
  • Name,
  • Text,
  • Date,
  • Time,
  • Timestamp and
  • Number
but that's basically it. So we need something more to make the definition more descriptive. There is only one reason that you want a descriptor data item on a subject area and that is because it tells you something about that subject area. In fact the descriptor has a specific role to fill for the subject area, so
it should be part of the definition.

To elaborate on our Out Tray Mail example. A typical descriptor would be the timestamp at which the Mail was processed and put into the out tray. This descriptor simply describes the date and time at which the Mail landed in the out tray. So a definition could be:
Processed Timestamp is a Timestamp that records the date and time at which the Mail was processed.
This refers to a generic concept (Timestamp) and relates it to the subject area by explaining the role the data item plays for the subject area.

So the definition of a descriptor could be formulated thus:
A descriptor is a <[Name|Text|Date|Time|Timestamp|Number]> <role it plays for the Subject Area>.

3. Measures

Measure definitions are more or less the same, structurally as descriptor definitions. It's only the generic concepts that can vary. On the top of my head I can think of the following list
  • Amount, for which you should also note its unit of measure,
  • Ratio,
  • Percentage and
  • Promilage
which should basically cover it.
The rest of the format of the definition is the same. Describe what is measured of the Subject Area. So, for example, if you would want to describe the price of a product you would get something like:
Stock Price is an Amount denoted in US Dollars that denotes the price of the Product when it is in stock.
This refers to the generic concept Amount and its unit of measure, US Dollars. The subject area here the is called Product. It is also possible that the unit of measure is stored in another data item to which you can refer in your definition.
Stock Price is an Amount in the currency of Stock Price Currency that denotes the price of the Product when it is in stock.
Here the unit of measure is hold in another data item, presumably in the same data item.
So the definition of a Measure could be described thus:
A <measure> is a <[Amount with Unit of Measure|Ratio|Percentage|Promilage]> <role it plays for the Subject Area>.

4. Dimensions with Value Lists

This covers those data items whose value range is expressed by a list. This can be a simple yes / no list or a more complicated list. The definition should state the condition that makes you choose one of the items in that list. Please make sure that the values in the list are mutually exclusive so that it is always clear what option to choose. Also, if the dimension is mandatory, please make sure that the values are collectively exhaustive.
An example dimension for the Incoming Mail could be Mail Processed Indicator:
Mail Processed Indicator distinguishes Incoming Mails between those that have been processed and those that have not been processed. Possible domain values are:
  • Unprocessed Mail
  • Processed Mail
This is a simple indicator example. For larger sets this would still hold:
Mail Processing Urgency Type distinguishes Incoming Mails on how urgent they need to be processed. Possible domain values are:
  • Before Yesterday
  • As Soon As Possible
  • Within a week
  • Within a month
  • No urgency whatsoever
Here the last value had to be put in to make sure all possibilities are covered.

5. Relationships with other Subject Areas

To link subject areas together, you simply put in the name or ID of that subject area you link to. To define it is very similar to the way Dimensions are defined. What should be part of the definition is an explanation on why to choose one specific entry of the referred Subject Area over another. Another part should of course be explaining the role this link fulfills for this Subject Area.

Let's say that Mail is always from a Customer 1 then the definition would run something like
Customer is the Customer who signed the Mail.
This definition tells us that the data item Customer is a link to the Subject Area Customer and that the role it plays for the Mail is that it identifies the customer who signed the Mail. This is also how you would go about finding the Customer in your filing system: you'd look for the signature on the mail and then look that up in your Customer file. If you feel that this definition doesn't tell you anything about who or what a Customer is, then you're right. This is part of the definition of the Subject Area describing Customer. There is no need to repeat that information here.

The general form of this type of definition is:
<Data Item><[is|identifies]> a <Subject Area><how to select the right instance of refering Subject Area><role it plays for its Subject Area.>


Conclusions

For the definitions of data items it is important that you are able to put in the relationship to the Subject Area in the definition. If you cannot put it in, then that data item does not belong to the Subject Area. This is exactly what happened with the definition of the Processed Timestamp above. As you've noticed, it refers to Mail and not to Out Tray Mail because at the time of processing, the mail is not yet an Out Tray Mail. It is only at the moment when the processing is done, that the Mail becomes an Out Tray Mail. This is a bit of a border-line case. But writing the definition made me question where it should be.

Of course you could go all wild and analytical on it by applying all kinds of relational calculus to the data item to prove that it is in the first, second, third and boyce-codd normal forms, but looking at the definition and specifically at the relationship the data item has with the subject area will create a 3NF compliant data model for you.



1 Of course this is a contrived example. Most likely Mail will come from customers and suppliers, but for simplicity's sake, we'll pretend that it's only about Customers.back