Purpose:
|
To establish the Data Profile of the different data sources
|
Now that you have identified the different data sources, you can start to analyze them to establish their Data Profile.
The Data Profile is a collection of information on the data content, structure, and quality that will be stored in the
Data Migration Specification.
The detailed steps for establishing the Data Profile are as follows:
The first step in data profiling is to gather the metadata describing the data sources. This may include source
programs, dictionary or repository descriptions, relational catalog information, previous project documentation, and
anything else available that could shed light on the meaning of the data. If the system using the data was developed
using the RUP, you can use the Data Model, the Use-Cases,
and the Use-Case Realizations as sources to understand how data is used by
the system. Interviewing the original developers, if they are available, or the database administrator who manages the
data can also be useful.
However, documentation (other than information that is automatically maintained as part of the system or is reverse
engineered) should be considered suspect. It was valid at some point but typically decays in correctness over time.
Legacy systems are often thinly documented at their creation, and documentation often doesn't keep up with the changes
made along the way. Even if it is not current, existing metadata is often the only information that is available about
data sources and data semantics. The profiling process will expose the dissonance between metadata and the real data
and fill in the most important parts of the missing information.
The second step in data profiling is to develop a map of the data sources. This map shows how data fields are stored
and establishes rules for dealing with redefines and repeating groups of data within the data structures.
If the data source is relational, the map can be extracted directly from the database schema. Because these structures
are enforced by the DBMS, there is no need to question their validity.
If the data source is anything other than relational, you need to use use the metadata in conjunction with the data to
get a normalized form of the data. You need especially to pay attention to "overloaded" attributes. "Overloading" is
the process of storing multiple facts in the same attribute.
When this profiling step is completed, you can perform a sample of full extraction of the data sources, in normalized
form, to go further in the data profiling process. Usually, this extraction is done with the extraction scripts of the
migration components because it is also a good way to test them.
The third step in data profiling is to determine the content, domain, and quality of the data in each attribute and to
establish the semantics behind each attribute. It is important to perform this operation with the actual source data,
as the documented metadata may be incorrect.
This operation gives you the opportunity to identify:
-
Attributes documented for one use but used for another
-
Attributes documented but not used
-
Inconsistencies between the data content of an attribute and its semantic meaning
-
Attribute's cardinality to identify dead attributes (those containing just one value)
Legacy and even relational systems commonly use "de-normalization" and data duplication as a result of attempts to
improve the performance. They also frequently lack primary and foreign key support. That means that you must analyze
the source tables to identify the functional dependencies between attributes and to find primary and foreign key
candidates.
When the attribute profiling is finished, it needs to be reviewed at two different levels. The first level is to decide
whether or not to migrate the attribute. You may choose not to migrate an attribute if it contains no useful
information or if the data is of such poor quality that it cannot be migrated without corrupting the target. The second
level is to determine whether the attribute needs to be scrubbed during the migration.
When quality problems have been discovered during the profiling, you need to perform some data cleansing; removing or
amending data that is incorrect, duplicated, improperly formatted, or incomplete. This operation is usually called data
scrubbing.
|