June 15, 2002 Database SIG Meeting Notes

June 15, 2002 - Database S.I.G. Meeting Notes

By John Ashmead

Summary of PACS Database SIG Meeting of 6/15/2002:

This is from my notes scribbled during the meeting plus jumbled recollections thereafter plus stuff that I should have mentioned so will now pretend that I did. Please feel free to note additions/corrections & so on. Wayne suggested we keep the group going as email or whatnot over the summer; Linda (back from South Dakota) has suggested setting up a Forum to do this. So please feel particularly free to respond to this & we will find some way to repost to all. (We == Linda mostly).

We started on Chapter Nine of Hernandez Database Design for Mere Mortals, "Field Specifications".

We discussed why you want to specify what fields you are using and spent some time on character & numeric fields. We'll probably have to pick up at the start of the chapter come the next meeting in September. We also covered a bit of XML. Four copies of ".Net Developer" were raffled off. Several of the raffle winners admitted, sometimes after an inspection of the manual, that in all candor they were not going to get to this Microsoftus Gigantosaurus & passed so at the end of the day we had only one more person than we had manuals. We applauded EO & Wendell for their candor in this matter! And thanks to Linda for supplying these!

Details of points made, questions, & so forth:

1.) Why do you want to specify the fields you are doing to use as exactly as possible?

Because if you don't, your database will turn around & haunt you for the rest of yours & its days. The more time you spend up front designing the less time you spend later debugging.

2.) Why is it nice if the database supplies user defined types?

User defined types are types such as "phone" that you can make up. You might set "phone" to be "10 digit character field" for instance. Or set "age" to be "integer going from 0 to 200". Then whenever any field takes a phone or an age, you set it to the "user defined type" rather than spelling it out.

This makes it much easier to avoid inconsistencies and much, much easier to change the types, i.e. when you find you have to go from 5 digit zip to 9 digit zip codes.

Unfortunately, only higher end databases, i.e. Informix or Oracle, supply this. For the lower end databases you just have to do it the same way everywhere in the database. Which is where having explicit field specs can be useful...

(Sometimes it is helpful to actually create a table or two defining the types themselves so you can print out a database report listing them. Of course you need enough types for this to be worthwhile, perhaps 20 or 30 is enough.)

3.) What is the tape measure rule?

It is a good idea to limit the range of variables to be only things that make sense. I.e. age should run from 0 to 130.

I found with medical databases real data tends to spread out a bit more than expected so have the "tape measure rule": give them a 1/2 inch to grow. Don't make the clothes so they fit so tightly you can't have a heavy meal.

If they say the number of items can only go to 10, make it 15 or 20. Most of the data entry errors in practice are either a) really bad (extra zeros) or b) too subtle to catch by range checks anyway. So reasonably generous range checks are reasonably effective in catching errors while not making you change the range every week.

4.) Question from the floor: if I have an effective date for a record sometime in the future how do I log it?

Lots of answers from divers hands.

a) don't put it in until it is time

b) make sure all your queries/sorts check for the effective date & don't include future dates.

c) make a new record every time you change the effective date; then make sure your sorts always take the most recent record (for that person or whatever) younger than tomorrow.

5.) What humorous problem can putting a date in the future cause?

A patient billing system of mine was set to send a bill every time anything happened on the account: insurance paid, patient paid, 30 days had passed without one of those first two events, and so on. Whether it saw a payment without a followup bill "covering" the payment it sent a bill. (The billing manager wanted to make sure patients always knew where their account was.)

One week it sent (or tried to, it was stopped by a human) a patient 5 bills. Turned out the system was acting "correctly". The patient's previous payment had been accidentally set to 10 years to the future. Every day the system said, have I send this lass a bill since the last time I got a payment. No. OK, I'll send another.

I changed the data entry form to not let dates be in the future.

6.) What is a BLOB?

A BLOB is a Binary Large OBject: basically any chunk of data like a map, an audio file (i.e. MP3) or a video file.

Informix came up with something called SLOBS (Smart Large OBjects). Then marketing found out what the engineers were calling them & made them stop. A shame.

7.) What do you need to support BLOBs?

You need a way to put them into the database (a put method) and a way to get them back (a get method) and a way to destroy them (a delete method). You need an indexing method to make lookups fast enough.

It is nice to have equality and comparison methods as well. For instance, you might want to have a "west of" or "south of" comparison test for a mapping database. Or you might want a "looks like" or "sounds like" methods for a video or an audio field.

I remember a few years ago telling a photographic database to look for sheep like stuff & it found lots of sheep stuff, except it also found a photo of a bowl of oranges. Because they were round? I don't know.

8.) How does XML relate to relational databases?

XML is tree structured: everything is part of one enormous tree. XML files have to be 1) parsed into their trees and then 2) broken out into parts suitable for different tables to get them to fit into a relational database.

9.) Do people like trees or tables better?

They like to look at trees, but tables are easier to eat off.

More seriously, as Barry pointed out, people understand tree structures more intuitively, so if you can put something into a tree structure it will be easier for people to understand. But the tables used in relational databases can be taken apart & put together in an infinite number of ways, making them far more flexible.

You can think of incoming XML as large complicated food molecules and the process of breaking the trees down into their component tables as like the process of digesting complex carbohydrates and turning them into simple sugars. And we'll stop with this metaphor right now.

10.) Why are zip codes annoying?

Zip codes can be 5 or 9 digits, but if you try to sort them numerically then the 9 and 5 digit numbers sort the "wrong" way, with all the 5 digit coming before all the 9 digit. Some web sites think you have erred if you use a 9 digit zip code (I have automatic form fill turned on so my 9 digit zip gets added automatically.)

Further, if you need Canadian codes, they are partly alphabetic and are six characters long. And in Switzerland the codes aren't even the last thing in the address, they come earlier.

11.) What are serial id's?

Serial id's are usually sequential numbers handed out by the database & guaranteed by it to be unique. Usually they are indexed. You can leave gaps in them. You can go 1, 2, 3, 7, 9, 10, 1001, and so on. And it doesn't make sense to add, subtract, multiply, divide or do anything else to these numbers.

12.) How do you reassure a frugal office manager that we aren't "wasting numbers"?

When I was handing out patient numbers in a medical database I would sometimes have a case where the new patient's record was never finished (not enough data or whatever) or a patient was deleted. Then a patient sequence number would be "wasted". The commendably frugal office manager became concerned we might run out of patient numbers if we kept wasting them.

For her peace of mind (& mine) I computed when we would run out. Since the patient sequence numbers could go up to 2 billion & we had only 90 thousand patients, I worked out it was sometime after the year 3000. (I worked out the exact year, day, & minute too: no point doing calculations halfway, but I do not remember them.)

In honor of this, we have agreed in database SIG that anything that happens in year 3000 or later is "not a problem". (Then come year 3000 they will be waking us up from cold sleep & going "boy did you foul up!")

13.) Why are Dewey Decimal Numbers not "really" numbers?

Because you can't add them: you can't add the Dewey Decimal Number for Science Fiction to the one for Mystery and get the DD# for Horror.

I've never heard of a database that lets you turn off addition & subtraction for a numeric field by the way, which may be a reason to use characters to model Dewey Decimal Numbers and stuff like them.

14.) What is the difference between a null string, an empty string, and a string containing a blank?

This is a very tricky question. Null is usually used to mean "I don't know"; an empty string means "I know but there is nothing there." For instance, I have no middle initial. If there is a field for a middle initial in a form I'm filling out it should start out with a "null" value for my middle initial and then change to an empty value when I fill it in with nothing. This is a bit too subtle for most databases to get right.

In the C language there is a clear difference on the inside. When you first create a string it has no storage associated with it and an address of zero. It is a null string. Then you give it a location, say location one million. Now it is not a null string, but it has nothing there. It is not null, it is empty. Then you put a blank at location one million. Now it is not null and it is not empty, since there is a blank, which is a real character, in it. But to a human being it will *look* empty. Then you replace the blank with a letter. Now it is not null, not empty, and a human looking at it will be able to see that there is something there.

Hope this helps. I used another example as well, but that example was so lame I am not going to repeat it here.

15.) What is the difference between a middle initial and a one letter middle name?

When I was a production manager at a publishing house we had a doctor writing for a journal whose middle name was the single letter "J". He got extremely bent out of shape -- practically J-shaped -- if we printed his name as "First J. Last": it was to be "First J Last". But the software couldn't cope. It saw a single letter in the middle name position & insisted that had to be an abbreviation. Finally we just delegated an editor to remove the offending period with an Exacto knife. There is always a way! (But sometimes the way isn't pretty.)

16.) How best to handle middle initials?

As usual, "it depends". You can include them as part of the first name, give them a separate field, give them several separate fields (in case they have several middle names: which happens a lot), leave the middle name out, ....

Probably the most common solution is to give a single middle name field but leave enough space in it for multiple middle names if that happens. Just put a blank between the multiples, i.e. "Jamieson", "J. J.", "O'Malloy".

17.) When do you include secret sort fields?

When you are really desperate. Sometimes the only way to make records sort correctly is to create a field that is built to have the "right" sort order. Don't show the field on the screen; just build it every time you save the record.

18.) How big can a number be?

Someone mentioned one database where you can set to any size: i.e. 100000 digits. This may waste space.

Mathematica, mathematics software, will let you use "infinite precision": numbers can be set to grow to be as long as they need to be (or as RAM permits).

IN SUM:

OK, we didn't really get far into the chapter. But we did cover a lot of material and had a good time doing it. Next time, Linda will be again heading the SIG up.

PACS Main Page . . . Database SIG Main Page . . . Curriculum 2002