Data Migration

The sting in the tail of any development project

Probably one of the most overlooked aspects of any IT project is data migration. You have designed and built a new and improved business system, citing improvements in efficiency, reporting, reliability and so forth. The new data model is a work of art, this is something the development team can be proud of. All you need to do now is get the data from the legacy system or systems, and a myriad well loved and maintained spreadsheets that came into being by frustrated end users creating them to overcome constraints in the old system.

IT is one of the few industries where we work with intangibles. While a proof of concept can convey ideas, screen layouts and wireframes can give stakeholders an idea of how something is shaping up, just how do you communicate the effort required to migrate the data. Why is it that the data migration piece is left until the end of the project, and then the delivery date starts to slip and slip while the enormity of the task starts to dawn on the poor encumbent burdened with the task of migrating the data.

 

If you are reading this from an informed stand point, where these pitfalls do not befall your projects I would encourage you to read on as you might pick up some gem, or new approach you had not considered. Failing that, you might be able to post constructive comments to add value to the article. If, however you are reading this while being in the situation depicted earlier, then despair not. Help is at hand and I would hope that you could draw from this article the ammunition you need to pitch a reasoned argument to the stakeholders and confidently and knowledgably take control of the situation. Should you be at the planning phase of a new development, or IT systems merge and have not given the data migration piece the respect it deserves then this is your Damascus moment.

Right up front you should have decided whether the data migration is worth taking on. If you are replacing a system with a comparatively small number of records, say from a CRM database, the complexity of the new data model may make the migration prohibitively expensive, and it may be cheaper to hire some data entry clerks and hand tap the data into the new system via the new GUI prior to "go live".

Also, you should consider how the legacy systems will be replaced. Is this a big bang, or are you parallel running? In the case of big bang, how can you revert to using the old system without data loss should the need arise? Is this a viable option? In the case of parallel running, which is essentially one big bang migration followed by frequent loads of "changed" data, known as delta loads (delta meaning change). What triggers these delta loads? Is it a batch process, or does it happen in real time? There are of course pros and cons to each approach, and this is a decision that only you and your organisation can make.

Firstly let us state the objectives of the data migration. Usually it goes something like this :
 

"To migrate relevant data from the legacy data stores where possible, and to create exception reports in the event of records which failed to migrate. To provide metrics on the success and failure rates of the migration attempt"

and in the event of parallel running

"To migrate the data from the legacy data stores at a point in time, and then make subsequent delta loads"

Usually the big bang approach always wins, as the immediate costs of the extra coding and testing effort to successfully implement the parallel running is deemed to high. I say usually. There might be some logical business reason where parallel running is imperative, such as in a public sector or banking organisation where data corruption would have legal consequences.

So the next question is, or at least should be, what percentage of correctly migrated data constitues an acceptable level? By making compromises here, timescales can be shortened. This is your greatest tool so use it wisely. If 90% can be delivered at a relatively low cost, then how much more effort is it to automate the remaining 10%. Is it worth it? Back to the data entry clerk scenario again.

Care should be taken to quantify the amount of effort from the customer required to meet these deadlines. After all, it is their data and they should take responsibility for it. Resource will need to be provided to clean existing records, monitor progress with the migration and sample migrated records to check for inconsistencies. Steps can and should be taken to begin the cleansing process way before any development work gets underway. Typical sources of bad data would include :

  • Duplicate records
  • Bad data, eg incorrectly formed addresses, bank account numbers etc.
  • Conflicting data
  • Implied data. Ie Special characters only known to the end users are added to fields to give special meaning to a record, usually to cope with business needs that have arisen since the system was implemented.

While some degree of automation can be used to cleanse this bad data, it is again to be within agreed confidence levels taking into account the risk of getting it wrong. More importantly it requires a high degree of human input from the customer, the level of which will diminish over time as the quality of the data is improved. Software is available for doing partial address matching, bank account number verification and so forth. The value of this software is not short lived as it can be used in the new system to ensure the data is validated at point of entry. Let's face it, if the data structures were designed correctly and proper validation in place in the legacy system it would be a much easier task, and the lifespan of the system more than likely increased.

Knowing that the quality of the data can be improved by partially automated processes and human interaction, surely the sooner this task is started the better for all concerned. Imagine if data analysis was performed as part of the project analysis phase, and work had been going on in background to clean up the data while the development work was being carried out, how much simpler the migration task would be. This is why the migration piece should never be left to the last minute. Any argument that you need to know where you are migrating to before you can even start looking at this is erroneous. To coin the phrase "garbage in, garbage out". If the legacy data is of a poor quality, then in the worst case the new system will not accept it, or if it does it will exist in the new system and be just as bad so what value have you really added to the business?

So what are the stages of a data migration?

The industry recognised acronym for this type of work is ETL (Extract, Transform, Load). So the data is extracted from the legacy system or systems, is transformed to fit the new system and is then loaded into the new system. I think this acronym doesnt fit the model in this case. It should be :

ECCL

  • Extract
  • Cleanse
  • Conform
  • Load

Before we do any migrating we would want to build in some meta data. This meta data will describe the migration, the migration steps, outcomes and metrics which will be very called upon for reporting to the business at a later stage.

The Migration Meta Data

The Migration attempt meta data

Extract

The first stage is where an extract of the data is taken from the legacy system, and given a date and time stamp. This is the data we will attempt to migrate in this iteration. Any migration effort should be able to go through multiple iterations to cater for power outages and delta loads for example. On the next iteration, a new extract is performed and given a new date and time stamp. This step is important. It is the first time in the process when the data becomes your problem. What would happen if the new system had incorrect data with potentially costly implications, and yet in the legacy system the data was correct? Could the data have been corrected in the legacy system AFTER the migration? You get the point. This stage should have 100 percent success as it is literally a bulk copy of all the data in the legacy system, with the key constraints removed but the key data included. Typically the extract tables have the same name as the legacy database and table, prefixed with EXT to denote it is an extract table. Eg EXT_DBCRM_Customers represents an extract table called Customers from the DBCRM database. It should have the same structure as the legacy table, but with all constraints removed. Addtionally a date and time stamp should be added to the end of the table to denote when the extract occured.

At this point it would be useful to point out that some meta data would be useful, particularly as we are dealing with multiple data stores, and multiple iterations. In particular we want to record the event, any useful metric information and of course date and time information.

Cleanse

As mentioned before, ideally the data should be cleansed at source to keep the number of records appearing in the exceptions reports to a minimum. That does not mean however that this data is to be trusted, far from it. The data will still need to go through strict validation rules before it proceeds to the next stage, where it is trusted enought to be conformed for loading into the new system.

This validation is the heaviest part of the migration process, and you will find that for each migration run you do (there will be many as you are developing and tuning the migration code) that machine time has a significant part to play. Often it is the case that the migration test server is an elderly machine nearing end of life, whereas this is really a false economy. There will innevitably be many migration "dress rehearsals" before the big performance, so cost cutting here really is a false economy, while the whole BI team twiddle their thumbs waiting for the end result of the most recent migration attempt. Also it may well be worth putting the effort into producing a "cut down" version of the legacy data store for the first phase of development. This obviously gives you a fairly blinkered view of any data quality issues, and any migration strategy that works for the same hundred or so rows each time will get a nasty surpise when the whole data store is pushed through. The legacy data may be in such a state of disarray that producing a cut down version would be cost prohibitive in itself, and a nice fast server with a few terrabytes of disk space would be the cheaper option. I said terrabytes because for each iteration, you are taking a snapshot of the legacy data, loading it into validation (cleansing) and conforming stages so you can monitor the effects of code changes across multiple iterations. This all eats space, and lots of it so there is no point in scrimping here.

The extracting, cleansing, confirming processes will require to run scripts to do the work. The order these scripts are run in will be of importance and also  the outcome of each script. This takes us back to our meta data requirements. There needs to be some meta data to descripe the scripts, record the outcomes and govern the sequence in which the scripts are executed.

Conform

So now our data is cleansed, and trusted we are ready to conform the data. This is possibly the most complicated part of the process. As part of the analysis phase of the migration you should have conducted a gap analysis the results of which will be invaluable at this point. The structure of the data is currently still as per the old schema, and it is in this phase that we start to think of the new structure and start to prepare our data for loading into the new system.

The steps to be conducted in this phase include :

  • Conforming structure ie splitting records out into child tables where necessary
  • Converting key values ie using the List of values
  • Deriving data to fill gaps

List of values

This is one of the most useful tools in any data migration. Every look up table in the old system is mapped to the look up tables in the new system. This should alert you to the fact that look up data should be loaded into the new system before the migration takes place. This is the starting point database and should be prepared along with the list of values before the migration takes place. The list of values should live in the staging database and is used to conform key lookup data as the data passes through this stage.

Taking the customer details scenario a step further to illustrate the point, lets say that each customer record (which also holds the address information) has a look up value, to denote the type of address. The values that can be stored are :

Legacy address types 

The NEW_KEY field denotes the new value in the new database. For information the new Address_Type table could look like

So in the new database the address type is numeric, and not an alpha character. Also there is no address type for "Home". Using the list of values table for all lookups in this way greatly simplifies the effort to conform the data and prevents lots on unneccessary hard coding in the conform scripts. Obviously any records extracted from the legacy system that contain values that are not in the list of values are sent to the exceptions report and this can then be used in the next iteration to improved the success rate of the migration.

Load

Once the data has been through cleansing and conforming it is ready to be loaded into the new system. At this point in the proceedings there are mainly three things to consider.

  • The order in which to load the data
  • Surrogate keys
  • Exceptions

The order the data is loaded will be governed by the procedure meta data, so this is not of concern. It should be noted though that the order in which the procedures will run must be given thought, and a look at the new schema should tell you all you need to know so I will not labour the point here.

Surrogate keys

For every data item that lives in the old schema, there should be a key value. In this case it will be the id of the customer record. There will be a customer record created in the new system and this in turn will have its own id, probably not the same one. So we need a way of tying the two keys together, and this is where surrogate keys come in.

Consider the following table structure

This tells us that the customer id 57 in the old system is now represented by the key value 10 in the new system. The surrogate key 1 represents this relationship.

By representing the relationship between the legacy and new records in this way it allows us to tie back to the old system and answer (or defend yourself against) the inevitable question of "This customer was right in the old system, it can't have migrated across properly". You can see what data was extracted, how it was cleansed, conformed and loaded and the hard facts speak for themselves. 

 

User login

Password lost?