Pseudonymization is a privacy-enhancing technology (PET) that replaces direct or indirect identifiers in a datapoint with other, "pseudonymous" values. This protects the identity of the individual(s) that the data belongs to, as an adversary or a user of the data can no longer establish a relationship to a specific person.
Pseudonymization methods can be classified based on their properties. Reversible pseudonymization methods allow the de-pseudonymization of data, i.e. going back to the original identifier from the pseudonym with the help of additional data like a cryptographic key or a mapping table. Non-reversible pseudonymization methods do not allow this. Deterministic pseudonymization methods will always create the same pseudonym when applied to the same identifier twice. Non-deterministic methods will produce random pseudonyms even when applied to the same identifier multiple times. Structure-preserving pseudonymization methods preserve specific internal structure of identifiers. For example, a deterministic pseudonymization method is also structure-preserving in regards to the equality operator: If the pseudonyms for two identifiers are equal, the original identifiers must have been identical too. This principle can be extended to more advanced structural properties, as discussed below. Format-preserving pseudonymization methods preserve the original format of the data. For example, a format-preserving method could generate a valid zip code pseudonym from an original zip code.
Kodex supports key-based reversible pseudonymization methods
based on format-preserving encryption. The resulting pseudonyms can
be decrypted/depseudonymized using the original key. For example,
merengue pseudonymization method operates on
arbitrary binary data and produces binary pseudonyms that can in
addition preserve the prefix-structure of the data:
- name: pseudonymize-name type: pseudonymize config: key: name method: merengue
Based on the
merengue method Kodex offers various
structure-preserving pseudonymization methods that can operate on
data types such as timestamps, IP addresses and numbers. These
methods preserve the format of the original data and can also
preserve structural information. For example, the following actions
perform structure- and format-preserving pseudonymization of the
above-mentioned data types:
- name: pseudonymize-date type: pseudonymize config: key: date method: structured type: date format: "%(2010-2030)Y-%m-%d" - name: pseudonymize-ip type: pseudonymize config: key: ip method: structured type: ipv4 preserve-prefixes: true - name: pseudonymize-count type: pseudonymize config: key: count method: structured type: integer preserve-prefixes: true type-params: min: 0 max: 10000
Kodex supports non-reversible pseudonymization methods as well, notably keyed hash-based pseudonymization based on keyed-hash message authentication codes (HMAC), as well as unkeyed hash-based pseudonymization (which we strongly discourage). For example, the following action generates an HMAC-based pseudonym:
- name: pseudonymize-id type: pseudonymize config: key: id method: hmac
Like all actions, the
pseudonymize method requires
config parameter, which in turn requires a
key parameter that specifies the attribute that should
be pseudonymized as well as a
method parameter that
specifies the pseudonymization method. Currently the following
methods are supported:
merengue: Non-structure & non-format preserving, reversible, encryption-based pseudonymization.
structured: Structure- & format-preserving, reversible, encryption-based pseudonymization.
hmac: Non-structure & non-format preserving, non-reversible, authenticated hash-based pseudonymization.
Depending on the chosen method, additional configuration parameters are necessary, as described below.
Merengue Pseudonymization Parameters
merengue method accepts an
parameter that specifies the encoding of the resulting byte string.
Currently the only possible value (and the default) is
Structured Pseudonymization Parameters
structured method accepts the following
preserve-prefixes: If true, will preserve the prefixes of a structured data value. For example, when using the
dateformat, dates with a common prefix (e.g. the same year and month) will be mapped to pseudonyms that also share a prefix of the same length. Default to false.
type: Specifies the data type to pseudonymize. Must be one of
type-params: Specifies additional type parameters depending on the chosen type. Currently, only the
integertype requires mandatory
maxtype parameters that specify its range.
format: Specifies the type-dependent format of the data to be pseudonymized (if applicable). Currently, only the
datatype supports the
formatparameter, as described below.
date type can parse and produce dates in
various formats. The
format specifier is modeled after
the standard Unix time format, using
followed by a format code. The following codes are supported:
Y: The year as a four-digit number (e.g. 2021)
m: The month as a two-digit number (e.g. 01 or 12)
d: The day of the month as a two digit number (e.g. 03 or 30)
H: The hour of the time as a two digit number (e.g. 11 or 07)
M: The minute of the time as a two digit number (e.g. 01 or 59)
S: The second of the time as a two digit number (e.g. 01 or 59)
n: The nanoseconds of the time as a nine digit number (e.g. 000000001)
z: The time zone of the time as a signed four digit number (e.g. +0400 or - 0730)
In addition to the common syntax, the format also allows
specifying ranges, which need to be provided in parentheses after
% sign. For example,
limits the year to values between 2001 and 2022. Specifying ranges
can be useful to ensure that pseudonymized timestamp are within an
expected range. However, this requires that all input values be in
that range as well.