Task: Specify Data Migration
This task describes how to migrate a legacy data source to a target database.
Disciplines: Analysis & Design
Purpose
  • The purpose of this task is to specify the data migration by defining its scope, its data profile, and the mapping between the data sources and the target database.
Relationships
RolesPrimary Performer: Additional Performers:
InputsMandatory:
    Optional:
      Outputs
        Steps
        Define Migration Scope
        Purpose:  To define the scope of the migration

        The first task to do when you plan to migrate data to a new database is to define the scope of the migration. Some specific elements of this are:

        • Identifying the sources of the data to be migrated and their location.
        • Identifying the systems that use the current data sources and the systems that will be using the new database.
        • Identifying data not currently held electronically and how to enter them into the new system.
        • Determining if the source data is to be retired after migration. If the source data is to continue being used, then maintenance of the data in both its original source and in the new database can be an issue.

        In addition, you should identify the relevant subset of the source data. Data migration projects do not happen independently. Rather, they are spawned from other development efforts, such as projects to evolve or replace a legacy system. The volume of historical data that needs to be migrated is derived from the needs of these development efforts because it is usually not valuable to migrate all the data of the existing system. Below are some ideas of elements that can help determining such scope:

        • Historical data from the old system may not be able to be converted into the new system's format. For example, it may no longer have the same meaning because of data structure changes.
        • Historical data from prior years may only need to be kept at a summary level.
        • There may be mandatory requirements, such as tax regulations requiring detailed historical information to be kept, but not necessarily converted for a specific period of time.

        Ideally, the completion of data migration should immediately precede the start of production running of the new system. However, when the data is very volatile or very large or when the validation of the migration results is a lengthy process, amendments to the migrated data may be necessary. This is especially true where manual data has to be collected, verified, and then entered to the automated system; in these circumstances, data capture may take many months.

        In such cases, you need to plan and define the maintenance procedures to keep the migrated data up-to-date in any interim period between the completion of the data migration and the initiation of production running. These procedures need to strike a balance between economy (they will not normally be in use for very long) and accuracy (they must not introduce errors into the converted data). They should also incorporate controls and audit trails as applicable.

        Understand Data Sources Through Data Profiling
        Purpose:  To establish the Data Profile of the different data sources

        Now that you have identified the different data sources, you can start to analyze them to establish their Data Profile. The Data Profile is a collection of information on the data content, structure, and quality that will be stored in the Data Migration Specification.

        The detailed steps for establishing the Data Profile are as follows:

        Collect Information

        The first step in data profiling is to gather the metadata describing the data sources. This may include source programs, dictionary or repository descriptions, relational catalog information, previous project documentation, and anything else available that could shed light on the meaning of the data. If the system using the data was developed using the RUP, you can use the Data Model, the Use-Cases, and the Use-Case Realizations as sources to understand how data is used by the system. Interviewing the original developers, if they are available, or the database administrator who manages the data can also be useful.

        However, documentation (other than information that is automatically maintained as part of the system or is reverse engineered) should be considered suspect. It was valid at some point but typically decays in correctness over time. Legacy systems are often thinly documented at their creation, and documentation often doesn't keep up with the changes made along the way. Even if it is not current, existing metadata is often the only information that is available about data sources and data semantics. The profiling process will expose the dissonance between metadata and the real data and fill in the most important parts of the missing information.

        Analyze Data Sources

        The second step in data profiling is to develop a map of the data sources. This map shows how data fields are stored and establishes rules for dealing with redefines and repeating groups of data within the data structures.

        If the data source is relational, the map can be extracted directly from the database schema. Because these structures are enforced by the DBMS, there is no need to question their validity.

        If the data source is anything other than relational, you need to use use the metadata in conjunction with the data to get a normalized form of the data. You need especially to pay attention to "overloaded" attributes. "Overloading" is the process of storing multiple facts in the same attribute.

        When this profiling step is completed, you can perform a sample of full extraction of the data sources, in normalized form, to go further in the data profiling process. Usually, this extraction is done with the extraction scripts of the migration components because it is also a good way to test them.

        Profile Data Source Attributes

        The third step in data profiling is to determine the content, domain, and quality of the data in each attribute and to establish the semantics behind each attribute. It is important to perform this operation with the actual source data, as the documented metadata may be incorrect.

        This operation gives you the opportunity to identify:

        • Attributes documented for one use but used for another
        • Attributes documented but not used
        • Inconsistencies between the data content of an attribute and its semantic meaning
        • Attribute's cardinality to identify dead attributes (those containing just one value)

        Legacy and even relational systems commonly use "de-normalization" and data duplication as a result of attempts to improve the performance. They also frequently lack primary and foreign key support. That means that you must analyze the source tables to identify the functional dependencies between attributes and to find primary and foreign key candidates.

        When the attribute profiling is finished, it needs to be reviewed at two different levels. The first level is to decide whether or not to migrate the attribute. You may choose not to migrate an attribute if it contains no useful information or if the data is of such poor quality that it cannot be migrated without corrupting the target. The second level is to determine whether the attribute needs to be scrubbed during the migration.

        When quality problems have been discovered during the profiling, you need to perform some data cleansing; removing or amending data that is incorrect, duplicated, improperly formatted, or incomplete. This operation is usually called data scrubbing.

        Define the Mapping Between Data Sources and Target Database
        Purpose:  To describe the mapping between the source database elements and the corresponding target database elements

        The two main inputs of this step are the data profile elaborated in the previous step and the Physical Data Model of the target database.

        A popular misconception about data mapping is that it can be performed against logical data models. However, since the source data is in a physical data format, and we need to migrate this data into the new physical data model, it is more practical to map between these physical forms than deal with indirect mappings via a logical data model.

        Important considerations to address are:

        • If there are multiple sources for the same target attribute, how to address conflicts.
        • If there is no source for an attribute, how will the data be provided.

        To successfully perform this step, the Database Designer needs to work with a Business Designer possessing an intimate knowledge of the data that needs to be migrated and a System Analyst with knowledge of both the source and target systems.

        The mapping should be recorded in the Data Migration Specification.

        The two main sources of input to this step are the data profile elaborated in the previous step and the Physical Data Model of the target database.

        Identify Manual and Automated Data Migration
        Purpose:  To identify which parts of the data sources can be automatically migrated and which parts should be migrated manually.

        When you have elaborated the data profiles and data mapping of the various data sources, you should identify which parts of the data migration can be automatically performed and which parts need to be migrated through manual procedures.

        The Guideline: Designing Data Migration Subsystems gives some insights on how to design components to perform automated data migration.

        Whenever possible, entry of manually converted data should be performed using the standard data entry processes of the new system. Occasionally, specially built conversion processes are unavoidable. A typical example is the bulk batch entry of data which would normally be entered interactively, but where volumes are too high to make interactive entry practical.

        The manually migrated data can be identified in the Data Mapping Table contained in the Data Migration Specification.