Test Data Management

4.1 Test Data Management¶

Effective test data management is fundamental to delivering reliable, high-quality software solutions at DHI. As our numerical modelling products and client project services become increasingly data-intensive, the way we handle test data directly impacts our ability to validate functionality, ensure performance at scale, and protect sensitive information. This section outlines comprehensive guidelines for managing test data throughout its lifecycle, ensuring security, compliance, and operational efficiency.

4.1.1 Understanding Test Data Categories¶

Test data is not monolithic; different testing objectives require different data characteristics. At DHI, we recognise three primary categories of test data, each serving distinct purposes in our quality assurance process.

Functional Test Data forms the backbone of our testing processes. It includes essential datasets for validating core features, well-designed edge cases to test boundary limits, integration data to confirm component interactions, and standardised regression datasets to verify that existing functionalities stay consistent as systems develop. Teams should treat these datasets as dynamic assets, keeping them versioned with the code and updating them to reflect changing requirements.

Performance Test Data addresses a critical gap in many testing strategies: the failure to account for realistic data volumes and usage patterns. This category includes datasets that scale from current production volumes to projected future growth, load patterns that simulate everything from typical daily usage to peak stress scenarios, and growth projections that ensure our solutions remain performant as client needs expand. Given DHI's history of performance issues arising from unexpected data volume increases, this category requires particular attention.

Security and Compliance Test Data is a specialised category that supports security testing, penetration testing, and compliance validation. It includes test cases for authentication and authorisation scenarios, datasets designed to test input validation and injection attack prevention (aligned with OWASP Top 10 requirements), and anonymised or synthetic personal data for testing GDPR compliance features such as data subject rights (access, deletion, portability). This data must never contain actual personal data unless properly anonymised according to GDPR standards.

4.1.2 Core Principles for Test Data Management¶

4.1.2.1 Data Minimisation¶

In alignment with DHI's Data Protection by Design and Default policy and GDPR Article 25, teams must process only the personal data necessary for each specific testing purpose. This principle applies throughout the test data lifecycle:

Collect only the minimum data fields required to validate specific functionality.
Avoid copying entire production databases when subset data would suffice.
Remove unnecessary columns containing personal or sensitive information.
Regularly review and purge test data that no longer serves an active testing purpose.

For example, when testing water meter reading aggregation functionality, include only the meter IDs, consumption values, and timestamps necessary for validation. Customer names, addresses, billing information, and contact details are not required for testing the aggregation logic and should be excluded from the test dataset.

4.1.2.2 Privacy by Design and Default (For Personal Data)¶

Consistent with DHI's Data Protection policy, test environments must implement privacy controls from the outset rather than as an afterthought:

Default to synthetic data: Where possible, generate realistic synthetic data rather than using production data.
Automatic anonymisation: When production data must be used, implement automated anonymisation pipelines that execute before data reaches test environments.
Access restrictions by default: Test data should be accessible only to those with a documented business need, with access automatically revoked after a defined period.

4.1.2.3 Data Quality and Consistency¶

Test data must accurately represent real-world scenarios while maintaining consistency across test cycles:

Representativeness: Data should reflect the variety, distribution, and complexity of production data without compromising privacy.
Data integrity: Maintain referential integrity and business rule compliance in test datasets.
Freshness: Establish refresh cycles to ensure test data remains current with evolving business scenarios.
Documentation: Maintain clear documentation of what each test dataset represents and its intended use cases.

For MIKE applications, this means ensuring test data includes the full range of environmental conditions, model complexities, and geographical variations encountered in production.

4.1.2.4 Environment Parity¶

Test data characteristics should mirror production data characteristics across all testing environments (dev, test, staging), as outlined in Section 3.4:

Volume parity: Where performance is expected to be impacted due to data volume, staging environments should use data volumes comparable to production for final validation.
Complexity parity: Include the same levels of data relationships and dependencies found in production.
Distribution parity: Maintain similar statistical distributions in test data as found in production data.
Scaled environments: Dev and test environments may use reduced data volumes for efficiency, but should maintain representative data complexity.

4.1.2.5 Reusability and Versioning¶

Treat test data as a managed asset with proper version control:

Store standardised test datasets in version control systems alongside code.
For large datasets (> 1 GB), use Azure Blob Storage, as Git is not geared towards large datasets.
Tag test data versions to correspond with application releases.
Create reusable test data packages for common testing scenarios.
Document dependencies between test data versions and application versions.

4.1.2.6 Traceability and Auditability (For High-Risk Personal Data)¶

Maintain complete audit trails for test data access and usage to support compliance and security requirements:

Log all access to test data, including who accessed what data and when.
Track the lineage of test data from source to consumption.
Maintain records of anonymisation and masking operations applied.
Document data refresh cycles and retention periods.

4.1.3 Test Data Sources and Generation¶

This section provides high-level guidelines for choosing and preparing test data that supports reliable testing while protecting sensitive information.

Ensure teams use data that:

Matches the testing purpose (functional, performance, security, scientific).
Protects personal and sensitive information.
Is consistent, reusable, and understood.
Does not introduce avoidable risk or operational overhead.

4.1.3.1 Guiding Principles¶

The guiding principles outlined below establish a framework for test data management at DHI, promoting reliability, security, and operational efficiency across all projects. These principles enable teams to effectively select, prepare, and maintain test data in accordance with business objectives, regulatory compliance, and industry best practices, while mitigating risk and minimising administrative overhead. By upholding these standards, teams ensure consistent quality, safeguard sensitive information, and reinforce a culture of ongoing improvement within our testing processes.

Privacy first: Prefer data that never contains personal data (synthetic, simulated).
Fit for purpose: Do not over-engineer realism where not needed.
Minimisation: Only include fields required for the specific test objective.
Consistency: Stable core datasets for regression; flexible datasets for exploratory testing.
Traceability: Always know the source and transformation status.
Separation: Do not mix debugging datasets with reusable test suites.
Reproducibility: Datasets used in automation must be regenerable or versioned.

4.1.3.2 Acceptable Source Types¶

Synthetic (generated): Default for functional, edge, negative, and security injection tests.
Transformed production (anonymised): Only when synthetic cannot reproduce behaviour (e.g., complex performance patterns).
Public/open data: When domain realism matters (e.g., hydrological reference sets).
Simulated/modelled: For time-series, scientific, stress, and extreme event scenarios.
Hybrid: Combine synthetic structure with statistical distributions from approved anonymised sources when a balance is needed.
Client-generated test data: Acceptable when provided by clients, but must be protected and managed according to contractual agreements managed by the Project Manager or Business Owner.

4.1.3.3 Source Selection¶

Source selection is a critical step in the test data management process, ensuring that datasets are chosen with careful consideration of their origin, suitability, and compliance requirements. Selecting the appropriate source type supports traceability and reproducibility while safeguarding against the inadvertent exposure of sensitive information and maintaining the integrity of reusable test suites.

Select the simplest source that:

Satisfies the test objective.
Does not expose raw personal or sensitive data.
Can be reused across environments where needed.
Can be refreshed or regenerated without manual effort.

If moving beyond synthetic data:

State why (gap in realism, bug reproduction, performance profiling).
Confirm anonymisation or de-identification is applied.
Limit scope (subset, time window, representative slice).

4.1.3.4 Shared Test Dataset Standards¶

When sharing test datasets across teams or environments, it is essential to adhere to a set of minimum expectations to maintain data integrity, privacy, and usefulness. These foundational standards help ensure that all shared test data is fit for its intended purpose, protects sensitive information, and remains manageable and reusable throughout the testing lifecycle.

Teams must ensure:

Purpose and source are documented (a one-line description is sufficient).
No raw personal data is included (unless formally approved and isolated for short-lived debugging).
Relationships are intact (no broken references for integration/system tests).
Dataset size is proportionate (avoid full production copies).
Retention is defined (e.g., regenerated per release, or retired after use).
Access is limited to those needing it for the test objective.