I have always relied on realistic test data every time I develop my databases. My analyst hat and being an application designer/developer make things easy to improve my way of gathering a set of realistic data for use in validating my understanding of certain things that I do in connection with application development.
Having a good set of realistic data always helped greatly to the success of the applications I designed and developed. It allows me to have a better understanding of the business aspect of the realm that is the subject of any software project I have had. However, I realized that it is not fairly easy to come up with a good chunk of realistic test data for application development use, and probably this is also true to others out there. Probably for one, we are not the intrinsic experts on most fields that we deal with when we design and develop applications.
One, can’t just make up a set of realistic data randomly and instantly at that. You need to understand the vast array of reasons, rules, components, variables, constraints, linkages, and circumstances to confidently fill out blanks. You need full understanding and time to understand it and develop it. And you need to produce it with as much variations as possible so refactoring in the future is reduced to a minimum. Though this is no easy task, it is achievable to a certain degree.
Two, you need to be very familiar with the business aspect of the territory you are about to deal with. I say it is difficult as most of us are not the natural experts in the fields we are assigned to but are given assignments to understand decades old, highly specialized and fully refined business processes and we have to come up with something within an unreasonable amount of time and usually with a deadline. For example, to build an application for hospital management, one needs to fully absorb how things go through in hospitals. I don’t have any idea how operations are in hospitals, so when someone ask me if I can build a system for one, I easily say that I still have to figure that out.
So in this article, I want to share my workflow on how I develop or acquire test data for use in any of my application development efforts.
Knowing The Business
I don’t know how to over emphasize that this is the hardest part. Unless we are part of the business, things can go a lot harder than we think. I always allot plenty of time discovering the business. You can immerse yourself, know the lingo, mingle with a systems analyst who happens to know the business, hire a tutor (consultant), or work with someone pretty much knowledgeable to get you up to speed. Get as much materials and persons to fill you in. Always remember, no one knows everything, not even an insider. And there are a lot of ways to look at something. Document and record conversations if allowed. I am assuming though that we all have the capability to absorb all these.
In most of the efforts I had, I’d say for 20-30% or bit more, this phase occupies most of our effort, time and resources. When designing database, this is where I get to know my core tables and what shape and form they will evolve eventually. But this is when the first instances of test data gets identified and created.
In my current area of expertise (I currently am hooked on developing solutions for the academic sector), knowing the business where I specialize has already taken me more than a decade and still I have yet to fully cover the entire array of entities, functions, circumstances and issues one can find in a school environment. And there is still so much to learn while things are constantly evolving in how schools operate. Their cycle of existence is just as dynamic if not more as other types of organizations. It gets better as you deal not only with one school but dozens.
Once you get into a certain comfort level where you can virtually say something about the business you need to go up a notch.
Simulate and Accumulate
Try to go over the business various cycles and processes. This time, simulate and accumulate data while you go at it. The amount of varying data you get depends on how good you are in simulating the business. In a hospital management system, how does the business start? What goes through after you start? When and how do things end? For example, business starts the moment a patient walks in the hospital. The hospital gets the name of the patient and some pertinent information like telephone number, address, medical status, etc. When he is ready to get some medical attention, the patient then gets attended to by doctors. Then the hospital needs to know who attended to the patient, what were their findings, what were their recommendations? Will the patient be confined? Where? Then what happens after, when is the patient sent home? When will he settle his bills? Does the hospital business ends when the patient is sent home?
While you go over the processes, you get indicators of possible sources of data to help you fully understand the business, in this particular example, hospital management (see italics in the previous paragraphs), fill them in with actual data:
telephone number: 999-9999
address: #1, malacanang street, cebu, philippines
medical status: psychologically unstable
doctor who attended to the patient: dr. arrovo, md
findings: medical status confirmed to be true
recommendations: for treatment
place of confinement: psychiatric ward
when is the patient sent home: June 30 2016
Speculate
After you get the initial chunk of data, challenge everything you got. Ask some more. Have it looked by others. In most likelihood, the data chunk you have will change shape after this exercise.
- Do I need only one telephone number? Do I need to know if the number is permanent or temporary?
- Do I need to get only one address?
- Do I only need to keep track of the doctors who took care of the patient? How about nurses? How about non-medical personnel? Do I need to know what each of them did?
Why do we need to speculate? Why can’t we just follow what is stated in the customer’s requirements sheet. You see, in most cases, most customers don’t know what they want. They only realize they need something else other than what they initially thought after we are done with everything. In most cases, we already have deployed the final version of the application. And speculation work reduces the need to refactor so soon.
Of course, we don’t have to provide what is not asked of us. But it is very important that we also cover our bases. We don’t want surprises after we have delivered. What is good about knowing further is that you shield yourself from unwanted surprises near or at the end. Refactoring so soon means you have failed to some degree. And the usual recourse out there is that development teams go for the agreed requirements sheet to shield them from additional customer’s request. And when the customer refuses to provide new funding for these requests, the applications we have developed suffer. In the end, no one wins. We might have delivered what is asked of us, but will our applications be as effective?
If we fail to fully ask possible questions, it might make things very difficult for us to adjust easily. For example, if we stopped asking or failed to ask if we only need to keep track of doctors, our database and application design would be different than if we had asked questions like the ones presented above. We could have allotted just a single field for the doctor’s name only to realize later that we need more.
The art of speculating isn’t easy though. It takes a while to develop your keen sense of the unknown but once you have it, you can’t get rid of it.
Re-Use
Another source of realistic data that might be available for us is data from old production systems. In case we get lucky that this one is available, it might one of the best sources of data that you can get out there. However, do take note that the reason why we are in here is to replace the old system. This means that it is possible that the data you will be getting from this source might be lacking of key pieces of information that will be essential to our efforts in building a new system. Worst, the data you get is messed up. It is very important that you have the ability to know when the data you use has captured most of the things you need to aid you in coming up with better designs.
So how do we use existing data from the system we are about to replace? My preferred approach is to still know the business, do some initial simulation and speculation so I can produce an initial structure of my database. Then the data build up part would come from essentially migrating old data to the new structure. This would almost require one to write a data migration program to accomplish the process. However, in my opinion, this is one of the most rewarding luxuries if we can get an excellent set of old production data to aid us in coming up with a new system. You are in for a rude awakening if it happens that what you have is a messy, problematic set of old data (I’d probably refuse to consider it as ‘old production data’). You might as well throw this thing and start from scratch.
By undergoing through this exercise, you gain the following experience:
- You would see how old data fits in your new database structure.
- You will encounter fields in your new structure not being filled with data from the old. Then you can probably attempt to fill them based on your interpretation of the new system.
- You will also encounter data from the old that doesn’t seem to fit in the new. This happens when you failed to consider and include the same information in the new structure or you have misrepresented it in the new structure.
It would be an awesome feeling when you have finally mapped the old with the new structure. Having a realistic volume of a realistic set of data for use in application development is surely a very welcome treat and would drastically speed up a lot of things.
Generate
Today, there are equally effective tools that can help one produce a set of realistic data, and produce it in any volume we desire. You may also opt to write the data generator on your own. The hardest and crucial part is coming up with a data generation plan that can produce a realistic set of data. Of course, you still need to understand the business and produce the initial database structure. Then you figure out how to generate data by coming up with a plan. The actual generation part will be a no-brainer. You just let the tool does its job and voila, you have instant data.
So this is all about it folks. If you have a different workflow, you might want to share so I can improve mine further. 🙂