Anonymization Methodology

Identifying Column Groups

The same data often appears in multiple locations across the database. It is crucial to identify the group of columns belonging to the same “domain” in order to anonymize them consistently

  • Redundant data (DB is not Normal and the same data appears in multiple places)
  • Repeated data (PK, FK chains allowing to link tables)
  • Calculated data (e.g.: The concatenation of Name and Surname of the customer is stored. It needs to be re-calculated after the Name and the Surname have been anonymized to preserve coherence)
  • Anonymize DB allows you to identify column groups by searching the metadata for columns belonging to the same group according to physical attributes, semantic data (column vocabulary found in the name, text, or heading of the column) and even physical data to verify that the columns in the group belong to the same domain.

Which data should be anonymized?

The Dictionary Methodology

The distinct values appearing in a group of columns to be anonymized are stored in the first column of a translation dictionary. In a second column the anonymized value is stored. This dictionary allows to consistently replace in the Dev/Test database the same value by the same anonymized counterpart. The anonymized value is set by the method selected by the anonymization manager.
Note that it is also possible to use an identifier (such as a customer number) instead of the value of the anonymized column to establish the dictionary. This way you are less prone to inconsistencies produced by possible miss-spellings in the value of the column to be anonymized.

  • Security:
  • Dictionaries are stored in a user configurable secured location not accessible by unauthorized stakeholders
  • Reversibility:
    • Can be used to reverse to the original value if needed and authorized
  • Re-setting Dictionaries:
    • – The anonymization manager can decide when to reset the dictionaries. In that case the new anonymization will be different than the previous
    • – What is « sensitive » is decided by legal, cultural and business concerns depending on the specific organization and country.

Methods of anonymization



Randomly shuffle the existing values to be assigned to the anonymized one. Supported for single and composite data such as multi-column addresses


List of Values

Use a user-defined list of values to be randomly allocated as the anonymized value



Allocate and automatically incremented number to the anonymized value



Allocate a number from a range to the anonymized value


Customizable SQL Function

Iban, Social Security, Credit Card… The SQL functions are provided with their code so they can be customized for your specific needs



Allocate a constant value to the anonymized value (E.g. email =


Data Set

Use provided data sets such as names and addresses corresponding to your country



Use data from a database table you provide

Data in multiple and heterogeneous databases

Creating unified consistent dictionaries
When your organization uses multiple databases even not belonging to the same RDBMS, you can set a model for each one. If the same column group (for example Customer Name) exists in the different databases, you provide the same Group Name for each. Anonymize-DB will make sure to use an updated dictionary table using the latest additions from the other databases, so that the anonymization is coherent among all the database (e.g.: the same anonymized customer name is used among ALL its instances in ALL the databases).

Isolation between Prod and Dev
in a properly managed DevOps environment non-anonymized Prod data should not be available in Dev.

  • Need for SQL Scripts
    To achieve this, the sample Prod data must be anonymized by an Infrastructure operator before making it available in Prod. Anonymize-DB produces an SQL script which anonymizes the data. The SQL script is run in Prod by the infrastructure operator
  • Non-Proprietary code
    The SQL script produced is pure SQL and does not include any hidden proprietary code thus avoiding any dependency
  • Repeatable process on multiple environments and data sets
    The script can be used for multiple data sets and environments yielding to a repeatable and automatable process


Anonymization is an additional process to making data available in Dev/Test which should minimally impact the time of delivery of Dev/Test data.
When the analysis of what should be anonymized is properly conducted, you will usually find that only a small subset of the tables and total records of your database needs to be actually anonymized.

  • Use compact Test/Dev data (see Extractor)
    Test/Dev environments typically contain only a small subset of what is present in Prod. Obviously, the smaller the data set, the less time it will take to anonymize it
  • Reversibility:
    • Can be used to reverse to the original value if needed and authorized
  • There is no need to anonymize everything, it must be a smart process. Rely on our experience to build with you the most efficient process


Graphical and textual
Documentation of the anonymization process is supplied in both textual form (list of fields to anonymize and anonymization method) as well as in Entity Relationships Diagrams focusing on the current anonymized columns.

This documentation is important for:

    • Checking with application developers and system analysts the validity of the column groups
    • System evolution when new potentially sensitive columns are added to the database
    • Auditing reports