Monitoring and Observability¶

Version History¶

Version	Date	Description/Updates
1.0	7.04.2026	First observability and monitoring guidelines release.

Introduction¶

The document describes how monitoring and observability should be implemented.

Monitoring and Observability in Software Development are crucial practices that help teams maintain health, performance, and reliability of their applications and infrastructure. While closely related, these terms are distinct in scope and purpose.

Contacts¶

Guidelines owner/driver: Bogdan Kurbetyev
Task force:
Stakeholders:

Roles to Apply¶

Development roles:

Architects
Backend developers
Frontend developers
Infrastructure/DevOps Engineers
Product Manager/Product Owners

Enforcement Levels¶

Section	Enforcement Level	When/what
Minimum Production Baseline	Important	New and existing projects
Monitoring (reactive)	Important	New and existing projects
Observability (proactive)	Important	New and existing projects
Usage & Monetization Intelligence	Optional	Only when projects demand it
Checklist	Recommended	New and existing projects

Importance Overview¶

Section	Who	When
Minimum Production Baseline	All roles	Before any production deployment
Monitoring (reactive)	Architects, Infrastructure/DevOps Engineers	When setting up a new project or service
Observability (proactive)	Architects, Infrastructure/DevOps Engineers	When setting up distributed or complex systems
Usage & Monetization Intelligence	Architects, Product Manager/Product Owners	When planning product analytics and licensing
Checklist	Architects, Product Manager/Product Owners	When planning monitoring strategy for a project

References¶

Minimum Production Baseline¶

[!IMPORTANT] These requirements are mandatory for all production deployments at DHI. More extensive checklist is available at the bottom of the this document.

Logging (Required)¶

[ ] Structured logging with timestamps and correlation IDs
[ ] Log levels: ERROR, WARN, INFO, DEBUG properly configured
[ ] Unhandled exception logging enabled
[ ] Sensitive data (tokens, passwords, PII) masked or excluded

Metrics (Required)¶

[ ] Error rate monitored
[ ] CPU, memory, and disk utilization tracked
[ ] Request latency (p50, p95, p99) measured
[ ] Health endpoint (/health or equivalent) available to internal teams

Alerting (Required)¶

[ ] Alert for service unavailability (health check failures)
[ ] Alert for error rate exceeding threshold (e.g., >1%)
[ ] Alert for latency exceeding SLO (e.g., p95 > 500ms)
[ ] On-call contact or escalation path documented

Tracing (Recommended for Distributed Systems)¶

[ ] Azure Monitor tracing enabled
[ ] Request correlation across service boundaries

Retention & Compliance¶

[ ] Log retention period defined and compliant with local regulations
[ ] GDPR restrictions adhered to

Terminology¶

Term	Definition
Monitoring	The practice of collecting, processing, and analyzing predefined metrics and logs to detect known issues and system failures. It is reactive in nature, focusing on alerting when specific thresholds or conditions are breached.
Observability	The ability to understand the internal state of a system by examining its outputs (logs, metrics, traces). It is proactive, enabling exploration of unknown issues and understanding why problems occur, especially in distributed systems.
Logs	Immutable, timestamped records of discrete events within a system. Logs provide detailed, textual context about what happened at specific points in time, essential for debugging and auditing.
Metrics	Numeric data aggregated over time that represents system behavior and performance (e.g., CPU usage, memory consumption, request latency, error rates). Metrics enable trend analysis and alerting.
Traces	Records that track the journey of a request as it propagates through various services and components in a distributed system. Traces help identify bottlenecks and understand request flows across microservices.
KPI (Key Performance Indicator)	A measurable value that demonstrates how effectively a system, application, or business process is achieving key objectives. In software, examples include page load times, error rates, and user engagement metrics.
SLI (Service Level Indicator)	A quantitative measure of a specific aspect of service performance (e.g., availability, latency, throughput). SLIs provide the raw data used to evaluate service health.
SLO (Service Level Objective)	A target value or range for an SLI that defines the desired performance level of a service (e.g., "99.9% availability" or "p95 latency < 200ms"). SLOs guide reliability engineering efforts.
SLA (Service Level Agreement)	A formal contract between a service provider and customers that defines the expected service levels, including SLOs, and the consequences (penalties or remedies) if those levels are not met.

DHI Specifics¶

DHI is involved in projects with a distributed architecture, different technologies, and across the entire tech spectrum: web, mobile and desktop. Modern observability and monitoring tools generally offer support for all the above, but it is worth noting the differences.

Web & Distributed¶

DHI digital solutions and services are built on a modern web tech stack using .NET (C#), Python, and TypeScript (React). While this tech stack allows DHI developers to lean on available OSS tools, bugs, vulnerability, and performance issues are expected. These issues can and should be addressed through the implementation of monitoring and observability tooling.

Mobile & Desktop¶

MIKE Software suite is built on a robust foundation of C++, C#, and Fortran with some Python integration. This tech stack offers powerful tools for building high-quality performant software, but this comes with inherent complexity and potential issues. These challenges can and should be addressed through implementing monitoring and observability tooling.

Benefits of Monitoring and Observability¶

The practice of observability and monitoring is crucial for ensuring reliability, performance, security, and excellent user experience.

Faster Issue Detection
Allows developers to spot and address performance issues, errors, security threats and generic bugs before significant impacts for end users.
Faster Troubleshooting
Allows developers to promptly pinpoint root cause of an issue with a comprehensive insight into the application through logs, metrics, and traces, significantly reducing the time to resolution.
Enhanced Security
Allows developers to detect and mitigate security vulnerabilities by monitoring unusual behavior, such as spikes in failed login attempts, unauthorized data access points, and higher than usual traffic indicating DDoS attacks.
Improved User Experience
Achieving higher user satisfaction and customer retention by tracking KPIs such as page load times, error rates, user interaction, and allowing developers to improve software by addressing arising issues.
Improved Resource Utilisation
Reducing infrastructure costs by identifying and addressing bottlenecks and inefficiencies through observing resource usage, such as CPU load, memory utilization, and bandwidth.

Monitoring (reactive)¶

Setting up specific tools and dashboards, metrics, and alerts when a problem occurs so it can be addressed knowing what to look for.

Monitoring is the process of collecting, processing, and analysing predefined metrics from your systems and applications. It's about detecting and alerting system failures or performance issues.

Monitoring can be divided into the following categories.

Application Performance Monitoring
Tracking application performance, low response times, exceeding error rates, and user interactions, such as: Monitors CPU, memory, disk thresholds, network, and other infrastructure resources. Microservices, containers (Docker), Kubernetes, and serverless environments.
Error and Crash Monitoring
Focused on capturing runtime errors and crashes.

Examples¶

A crash occurs when a CPU usage > 90% which then triggers a systems alert.
API response rate exceeds threshold which then triggers services degradation alert.
Disk space falls below 10%, which then triggers infrastructure update requirement alert.

Tools¶

Below is a list of tools to consider for traditional monitoring, focusing on known issues and pre-defined metrics.

[!NOTE] Default DHI (Recommended Approach)

Below tools are already in use at DHI and are recommended to be used for new and existing project.

Tool	Description	Licensing
Microsoft Azure Monitor	Broad, full-stack observability platform	Commercial
Grafana	Dashboarding and Visualization	Open Source & commercial
Prometheus	Time-series Metrics collection	Open Source
Revenera	Monetization, licensing and compliance & more	Commercial

[!WARNING] Restricted

Below options are discouraged but not prohibited and may be used if approved by Project Owners and/or Project Managers or if the project demands it.

Tool	Description	Licensing
Nagios	Network monitoring	Open Source & commercial
Zabbix	Alerts for infrastructure and app metrics	Open Source

What these tools solve:¶

System monitoring
Software business management (Revenera), such as:
Monetisation
Licensing
Compliance

Observability (proactive)¶

Observability is the ability to understand why something happens in a system by analysing logs, metrics, and traces, especially for issues that were not anticipated in advance. Observability is most critical for distributed systems where requests cross multiple services.

Observability helps to understand internal state of a system based on external outputs over time. It's about understanding why something happens in a system over time to find the cause issues through metrics, logs and traces.

The three pillars of observability:

Logs
Immutable, timestamped records of discrete events, e.g. textual records of events happening in the system.
Metrics
Numeric data that represents system behavior over time (e.g., CPU usage, memory, response time).
Traces
A record of the journey of a request as it travels through services or workflows for
observing user behavior or real user sessions.

Examples¶

Investigating a spike in latency with no clear alert triggered.
Understanding user behavior patterns across microservices.
Correlating logs and traces to find the root cause of a failed transaction.

Tools¶

These tools are built for modern, complex, and distributed systems offering insight into the above-mentioned pillars.

[!NOTE] Default DHI (Recommended Approach)

Below tools are already in use at DHI and are recommended to be used for new and existing project.

Tool	Description	Licensing
Microsoft Azure Monitor	Broad, full-stack observability platform	Commercial
Jaeger	Tracing observability platform	Open Source

[!WARNING] Restricted

Below options are discouraged but not prohibited and may be used if approved by Project Owners and/or Project Managers or if the project demands it.

Tool	Description	Licensing
Sentry	Error and crashing reporting	Commercial
Bugsnag	Error and crashing reporting	Commercial
New Relic	Comprehensive observability platform	Commercial
Datadog	Comprehensive observability platform	Commercial
Elastic Stack	Integrations tools	Commercial
Honeycomb	Distributed services observability	Commercial
Lightstep	Logs, metrics and traces insights	Commercial
Zipkin	Traces, logs and metrics monitoring	Open Source
Splunk	Enterprise monitoring solution	Commercial
Dynatrace	AI powered monitoring and observability	Commercial
Better Stack	AI powered incidence response solution	Commercial

What these tools solve¶

Modern software observability
Distributed tracing, such as:
Error logs and reporting
Crash logs and reporting
Application tracing and reporting

Usage & Monetization Intelligence¶

While not strictly observability or monitoring, usage and monetization intelligence is critical for transforming raw usage data into actionable business insights.

Tooling¶

[!NOTE] Default DHI (Recommended Approach)

Below tools are already in use at DHI and are recommended to be used for new and existing project.

Tool	Description	Licensing
Revenera	Monetization intelligence, licensing, usage intelligence	Commercial
Matomo	Software usage analytics	Open Source & commercial

[!WARNING] Restricted

Below options are discouraged but not prohibited and may be used if approved by Project Owners and/or Project Managers or if the project demands it.

Tool	Description	Licensing
Kubit	User behavior and usage intelligence	Commercial
Smartlook	User behavior and usage intelligence	Commercial
Zylo	Management of software assets, licensing and subscriptions	Commercial
Flexera	Management of software assets, licensing and subscriptions	Commercial
Plausible Analytics	Software usage analytics	Open Source & commercial
Fathom Analytics	Software usage analytics	Open Source & commercial
Mixpanel	Software usage analytics	Open Source & commercial
Hotjar	Software usage analytics	Commercial

What these tools solve¶

Tracking feature adoption
Monitoring usage and consumption
Tracking and analysis of install and user base
Licensing enforcement
Subscription control

Checklist¶

Below Key Steps Checklist may be used as a guideline to assist with decision making by PO/PM/architects. It is recommended to make sure that these steps are discussed when decisions are made for the project or product at hand when applicable.

[!NOTE]

This checklist is an optional guideline and not a rule set. It is recommended for reading and understanding. Feel free to adjust it to make it fit and work for your project.

Define Key Metrics & Objectives¶

[ ] Identify critical components (for the project)

For example: - Database performance - API response times - Infrastructure costs

[ ] Define KPIs & Service Level Objectives (SLOs, i.e. targets)

As minimum standards are difficult to quantify across projects and tech stacks, the following critical components should be decided on per project basis.

For example: - Acceptable error rates - Acceptable latency - Critical paths - Acceptable response rates for API - Acceptable execution time for DB queries - Other vital metrics that may be relevant to the project at hand

Select & Configure Tooling¶

[ ] Choose a platform - Refer to 1.3.2 Monitoring Tools or 1.4.2 Observability Tools for tooling options

[ ] Support logging with tracing - Ensure your application generates a detailed, structured logs with timestamps - Ensure your application collects metrics defined by KPI & SLOs

Implementation¶

[ ] Code Instrumentation

For example: - Implement functionality to applications to emit logs, metrics and traces - Implement unhandled exception monitoring - Implement handled exception monitoring - Implement application crash reporting

[ ] Setup Alerts & Digests

For example: - Configure alerts to notify parties when SLOs are at risk or when an issue occurs - Configure scheduled digests of system performance for proactive monitoring

[ ] Emergency Documentation

For example: - Document clear, step-by-step instructions for: - Responding to common alerts - Handling critical issues - Handling emergency scenarios

Data Retention Compliance¶

[ ] Investigate and Adhere to Local laws & GDPR compliance**

As DHI offers its consulting solutions across the globe, defining a concrete set of rules when it comes to data retention and compliance is impossible. As a minimum, POs/PMs must consult local laws and regulations to fully comply with data retention rules.

For example: - Allowed log retention duration (days, months, years etc) - Allowed log content (usernames, passwords, dates, geolocations, IPs, etc): - Ensure the following senstitive data is masked or redacted from log retention: - Tokens - Secrets - Passwords - Session cookies - Envrinment variables - Authentication headers - PII (Personally Identifiable Information)

Conclusion¶

By providing deep insights into alerts signaling problems and ways to resolve them, the abovementioned practices help DHI maintain a stable suite of applications with smooth user experience and low operational costs. Implementing dedicated solutions to track product usage provides significant business advantages to companies and should not be overlooked.