STEP1       MODERNIZATION
STEP2       ADMINISTRATION
MIGRATE-DB
RELATE-DB
EVOLVE-DB
VIEWER-DB
TEST-DB
ANONYMIZE-DB
METHODOLOGIES

Migrate-DB methodology

LF Surrogate Technique

The LF Surrogate technique allows to re-engineer the database from DDS to DDL without needing to recompile the applications.

For each Physical File that is re-engineered into an SQL table, a Logical File called an LF Surrogate is generated. Existing applications will now address the Surrogate LF that has replaced the Physical File and will continue to function as before, EXCEPT that they will write and read data in the new SQL table.

For this to happen, the LF Surrogate, Transformed LF and SQL Table must have the following characteristics

Characteristics of the Surrogate LF

  • Same Name as the Original PF
  • Same Columns as the Original PF
  • Same Keywords as the Original PFF
  • Same Access Path as the Original PF
  • Same FLI as the Original PF
  • Points to the New SQL Table

Characteristics of the Transformed LF

  • Same Name as the Original LF
  • Same Columns as the Original LF
  • Same (explicit) Keywords as the Original LF
  • Same Access Path as the Original LF
  • Same FLI as the Original LF
  • Points to the New SQL Table(s)

Characteristics of the SQL Table

  • New meaningful System and Long Table Names
  • Column System names are preserved
  • New meaningful Long Column Names
  • Existing columns of the Original PF preserved
  • New columns added (Identity, Auditing, True Dates, FK’s, custom)
  • Preserve, Owner, Grants, Compilation parameters, Triggers and Journaling
  • Create SQL Indexes (only when they share Access Path with the Originating Keyed LF)

Completeness of the process

  • It preserves record formats and applies correct permissions, triggers, constraints, object ownership and compilation options- in the SQL tables
  • It retrieves access paths, generates SQL Indexes, and transforms related Views and Materialized Query Tables
  • It preserves existing journaling scheme or, if preferred, applies automatic schema journaling
  • It ensures existing applications compatibility by building a “Surrogate” Logical File for each Physical File, and creating a transformed Logical File for each existing Logical File
  • It checks for invalid data stored in the PF’s
  • It employs parallel processing and other techniques to minimize the downtime when data is copied from the original PF’s into the new SQL tables

Adds an Identity Column and other Auto-Maintained Columns

As Xcase for i creates new SQL Tables, it will include the all-important Identity columns as well as your choice of other automatically-maintained columns (User and Time Stamps, other auditing columns, true dates, normalized FK’s based on identity parents, custom fields) – without impacting existing applications.

Relate-DB Methodology:

Step One : Database relationships discovery

With its unique, data-centric approach, Xcase for i exploits all the information available in the meta and physical data to automate the discovery of the relationships hidden in the database.

The following outlines the algorithm used for the automatic relationship discovery:

  • Determine which tables are relational (exclude work files)
  • Qualify candidate Parent Keys and Children
  • Match PK and FK elements using metadata and common vocabulary (semantics)
  • Check candidate relationships using physical data
  • Provide a grade and detailed diagnostic to help you validate the results

Time, effort and risk of errors are reduced. Your team only has to validate the small set of relationships selected among the hundreds of thousands of possibilities.

Step Two : Normalizing tables and fixing current integrity issues

Xcase for i helps you identify automatically structural anomalies, duplicated data, integrity issues and orphan data sources.

The database normalization process is fully assisted allowing you to correct more efficiently the database:

  • Child table containing (redundant) Parent or Parent table containing Child in (fixed) columns: Split Child and Parent
  • Child table containing selective redundant data: Automate it and Eliminate it when applications are upgraded
  • Table with extensions: Join brothers
  • Referential integrity violated by the data in the database: Discover them using automatically generated SQL statements
  • Existing applications causing integrity issues in the database: Trace them using automatically generated triggers

Step Three : Alternative Relationships

A Discovered Relationship cannot be implemented “as is” in the following cases:

  • The physical characteristics of the Parent and the Child fields are not identical
    For example the Parent Key is numeric and Child is decimal or they do not share the same length.
  • The Foreign Key of the Child uses «Pseudo nulls»
    When the Parent Key is unknown instead of using a Null value a Blank or Zero are used creating de facto (for SQL) an orphan if that value does not exist in the Parent Key.
  • Referential integrity is violated momentarily
    The Child is created before the Parent
  • The Parent is filtered
    The Parent Key has a filter, for example the record must be flagged as “active” to be a Parent in the relationship.
  • Conditional Parents exist
    Depending on the value in a column of a Child the FK points to Parent A or Parent B
  • Part of the FK is implicit
    A “table of tables” is used to store various possible code values along with the code type (country, tax, region…). In the Child only the code value appears and the code type is omitted and known only by the application. This creates a situation were part of the FK is missing.

In those cases, an Alternative Relationship can be implemented between the Parent Identity and a newly created FK mapping to it in the Child. When properly set this yield to a functionally equivalent relationship producing for the same Parent the same children and for the same Child the same Parent as the original discovered relationship.

A set of automatically generated triggers updates the new FK when the old FK is modified and vice-versa. This way old applications (using the old FK) and new applications (using the new FK) can co-exist as each gets the data it needs.

Note that the triggers produce minimal overhead and are activated only when modifying data. When reading data (which happens on average 20 times more frequently) you immediately benefit from an optimized Identity based relationship.

Note also than when the old applications have been upgraded to use the new relationship, you can get read of the old FK and the triggers.

The following diagram illustrated the “Alternative Relationship” concept.

Anonymization Methodology

Identifying Column Groups

The same data often appears in multiple locations across the database. It is crucial to identify the group of columns belonging to the same “domain” in order to anonymize them consistently

  • Redundant data (DB is not Normal and the same data appears in multiple places)
  • Repeated data (PK, FK chains allowing to link tables)
  • Calculated data (e.g.: The concatenation of Name and Surname of the customer is stored. It needs to be re-calculated after the Name and the Surname have been anonymized to preserve coherence)
  • Anonymize DB allows you to identify column groups by searching the metadata for columns belonging to the same group according to physical attributes, semantic data (column vocabulary found in the name, text, or heading of the column) and even physical data to verify that the columns in the group belong to the same domain.

Which data should be anonymized ?

The Dictionary Methodology

The distinct values appearing in a group of columns to be anonymized are stored in the first column of a translation dictionary. In a second column the anonymized value is stored. This dictionary allows to consistently replace in the Dev/Test database the same value by the same anonymized counterpart. The anonymized value is set by the method selected by the anonymization manager.
Note that it is also possible to use an identifier (such as a customer number) instead of the value of the anonymized column to establish the dictionary. This way you are less prone to inconsistencies produced by possible miss-spellings in the value of the column to be anonymized.

  • Security :
  • Dictionaries are stored in a user configurable secured location not accessible by unauthorized stakeholders
  • Reversibility :
    • Can be used to reverse to the original value if needed and authorized
  • Re-setting Dictionaries :
    • – The anonymization manager can decide when to reset the dictionaries. In that case the new anonymization will be different than the previous
    • – What is « sensitive » is decided by legal, cultural and business concerns depending on the specific organization and country.

Methods of anonymization

shuffle(3)

Shuffle

Randomly shuffle the existing values to be assigned to the anonymized one. Supported for single and composite data such as multi-column addresses

list(4)

List of Values

Use a user-defined list of values to be randomly allocated as the anonymized value

plus-sign-in-a-black-circle(1)

Increment

Allocate and automatically incremented number to the anonymized value

speedometer

Range

Allocate a number from a range to the anonymized value

sql-file-format

Customizable SQL Function

Iban, Social Security, Credit Card… The SQL functions are provided with their code so they can be customized for your specific needs

pi(1)

Constant

Allocate a constant value to the anonymized value (E.g. email = xxxxxx@yyy.com)

three-databases-symbol

Data Set

Use provided data sets such as names and addresses corresponding to your country

grid(1)

Table

Use data from a database table you provide

Data in multiple and heterogeneous databases

Creating unified consistent dictionaries
When your organization uses multiple databases even not belonging to the same RDBMS, you can set a model for each one. If the same column group (for example Customer Name) exists in the different databases, you provide the same Group Name for each. Anonymize-DB will make sure to use an updated dictionary table using the latest additions from the other databases, so that the anonymization is coherent among all the database (e.g.: the same anonymized customer name is used among ALL its instances in ALL the databases).

Isolation between Prod and Dev
in a properly managed DevOps environment non-anonymized Prod data should not be available in Dev.

  • Need for SQL Scripts
    To achieve this, the sample Prod data must be anonymized by an Infrastructure operator before making it available in Prod. Anonymize-DB produces an SQL script which anonymizes the data. The SQL script is run in Prod by the infrastructure operator
  • Non-Proprietary code
    The SQL script produced is pure SQL and does not include any hidden proprietary code thus avoiding any dependency
  • Repeatable process on multiple environments and data sets
    The script can be used for multiple data sets and environments yielding to a repeatable and automatable process

Performance

Anonymization is an additional process to making data available in Dev/Test which should minimally impact the time of delivery of Dev/Test data.
When the analysis of what should be anonymized is properly conducted, you will usually find that only a small subset of the tables and total records of your database needs to be actually anonymized.

  • Use compact Test/Dev data (see Extractor)
    Test/Dev environments typically contain only a small subset of what is present in Prod. Obviously, the smaller the data set, the less time it will take to anonymize it
  • Reversibility :
    • Can be used to reverse to the original value if needed and authorized
  • There is no need to anonymize everything, it must be a smart process. Rely on our experience to build with you the most efficient process

Documentation

Graphical and textual
Documentation of the anonymization process is supplied in both textual form (list of fields to anonymize and anonymization method) as well as in Entity Relationships Diagrams focusing on the current anonymized columns.

This documentation is important for :

    • Checking with application developers and system analysts the validity of the column groups
    • System evolution when new potentially sensitive columns are added to the database
    • Auditing reports