Task: Specify Data Migration

Define Migration Scope

Purpose:

To define the scope of the migration

The first task to do when you plan to migrate data to a new database is to define the scope of the migration. Some specific elements of this are:

Identifying the sources of the data to be migrated and their location.
Identifying the systems that use the current data sources and the systems that will be using the new database.
Identifying data not currently held electronically and how to enter them into the new system.
Determining if the source data is to be retired after migration. If the source data is to continue being used, then maintenance of the data in both its original source and in the new database can be an issue.

In addition, you should identify the relevant subset of the source data. Data migration projects do not happen independently. Rather, they are spawned from other development efforts, such as projects to evolve or replace a legacy system. The volume of historical data that needs to be migrated is derived from the needs of these development efforts because it is usually not valuable to migrate all the data of the existing system. Below are some ideas of elements that can help determining such scope:

Historical data from the old system may not be able to be converted into the new system's format. For example, it may no longer have the same meaning because of data structure changes.
Historical data from prior years may only need to be kept at a summary level.
There may be mandatory requirements, such as tax regulations requiring detailed historical information to be kept, but not necessarily converted for a specific period of time.

Ideally, the completion of data migration should immediately precede the start of production running of the new system. However, when the data is very volatile or very large or when the validation of the migration results is a lengthy process, amendments to the migrated data may be necessary. This is especially true where manual data has to be collected, verified, and then entered to the automated system; in these circumstances, data capture may take many months.

In such cases, you need to plan and define the maintenance procedures to keep the migrated data up-to-date in any interim period between the completion of the data migration and the initiation of production running. These procedures need to strike a balance between economy (they will not normally be in use for very long) and accuracy (they must not introduce errors into the converted data). They should also incorporate controls and audit trails as applicable.

Understand Data Sources Through Data Profiling

Purpose:

To establish the Data Profile of the different data sources

Now that you have identified the different data sources, you can start to analyze them to establish their Data Profile. The Data Profile is a collection of information on the data content, structure, and quality that will be stored in the Data Migration Specification.

The detailed steps for establishing the Data Profile are as follows:

Collect Information
Analyze Data Sources
Profile Data Sources Attributes

Collect Information

The first step in data profiling is to gather the metadata describing the data sources. This may include source programs, dictionary or repository descriptions, relational catalog information, previous project documentation, and anything else available that could shed light on the meaning of the data. If the system using the data was developed using the RUP, you can use the Data Model, the Use-Cases, and the Use-Case Realizations as sources to understand how data is used by the system. Interviewing the original developers, if they are available, or the database administrator who manages the data can also be useful.

However, documentation (other than information that is automatically maintained as part of the system or is reverse engineered) should be considered suspect. It was valid at some point but typically decays in correctness over time. Legacy systems are often thinly documented at their creation, and documentation often doesn't keep up with the changes made along the way. Even if it is not current, existing metadata is often the only information that is available about data sources and data semantics. The profiling process will expose the dissonance between metadata and the real data and fill in the most important parts of the missing information.

Analyze Data Sources

The second step in data profiling is to develop a map of the data sources. This map shows how data fields are stored and establishes rules for dealing with redefines and repeating groups of data within the data structures.

If the data source is relational, the map can be extracted directly from the database schema. Because these structures are enforced by the DBMS, there is no need to question their validity.

If the data source is anything other than relational, you need to use use the metadata in conjunction with the data to get a normalized form of the data. You need especially to pay attention to "overloaded" attributes. "Overloading" is the process of storing multiple facts in the same attribute.

When this profiling step is completed, you can perform a sample of full extraction of the data sources, in normalized form, to go further in the data profiling process. Usually, this extraction is done with the extraction scripts of the migration components because it is also a good way to test them.

Profile Data Source Attributes

The third step in data profiling is to determine the content, domain, and quality of the data in each attribute and to establish the semantics behind each attribute. It is important to perform this operation with the actual source data, as the documented metadata may be incorrect.

This operation gives you the opportunity to identify:

Attributes documented for one use but used for another
Attributes documented but not used
Inconsistencies between the data content of an attribute and its semantic meaning
Attribute's cardinality to identify dead attributes (those containing just one value)

Legacy and even relational systems commonly use "de-normalization" and data duplication as a result of attempts to improve the performance. They also frequently lack primary and foreign key support. That means that you must analyze the source tables to identify the functional dependencies between attributes and to find primary and foreign key candidates.

When the attribute profiling is finished, it needs to be reviewed at two different levels. The first level is to decide whether or not to migrate the attribute. You may choose not to migrate an attribute if it contains no useful information or if the data is of such poor quality that it cannot be migrated without corrupting the target. The second level is to determine whether the attribute needs to be scrubbed during the migration.

When quality problems have been discovered during the profiling, you need to perform some data cleansing; removing or amending data that is incorrect, duplicated, improperly formatted, or incomplete. This operation is usually called data scrubbing.

Define the Mapping Between Data Sources and Target Database

Identify Manual and Automated Data Migration