Skip to content

Monitoring and Observability

Version History

Version Date Description/Updates
1.0 7.04.2026 First observability and monitoring guidelines release.

Introduction

The document describes how monitoring and observability should be implemented.

Monitoring and Observability in Software Development are crucial practices that help teams maintain health, performance, and reliability of their applications and infrastructure. While closely related, these terms are distinct in scope and purpose.

Contacts

Roles to Apply

Development roles:

  • Architects
  • Backend developers
  • Frontend developers
  • Infrastructure/DevOps Engineers
  • Product Manager/Product Owners

Enforcement Levels

Section Enforcement Level When/what
Minimum Production Baseline Important New and existing projects
Monitoring (reactive) Important New and existing projects
Observability (proactive) Important New and existing projects
Usage & Monetization Intelligence Optional Only when projects demand it
Checklist Recommended New and existing projects

Importance Overview

Section Who When
Minimum Production Baseline All roles Before any production deployment
Monitoring (reactive) Architects, Infrastructure/DevOps Engineers When setting up a new project or service
Observability (proactive) Architects, Infrastructure/DevOps Engineers When setting up distributed or complex systems
Usage & Monetization Intelligence Architects, Product Manager/Product Owners When planning product analytics and licensing
Checklist Architects, Product Manager/Product Owners When planning monitoring strategy for a project

References

Minimum Production Baseline

[!IMPORTANT] These requirements are mandatory for all production deployments at DHI. More extensive checklist is available at the bottom of the this document.

Logging (Required)

  • [ ] Structured logging with timestamps and correlation IDs
  • [ ] Log levels: ERROR, WARN, INFO, DEBUG properly configured
  • [ ] Unhandled exception logging enabled
  • [ ] Sensitive data (tokens, passwords, PII) masked or excluded

Metrics (Required)

  • [ ] Error rate monitored
  • [ ] CPU, memory, and disk utilization tracked
  • [ ] Request latency (p50, p95, p99) measured
  • [ ] Health endpoint (/health or equivalent) available to internal teams

Alerting (Required)

  • [ ] Alert for service unavailability (health check failures)
  • [ ] Alert for error rate exceeding threshold (e.g., >1%)
  • [ ] Alert for latency exceeding SLO (e.g., p95 > 500ms)
  • [ ] On-call contact or escalation path documented
  • [ ] Azure Monitor tracing enabled
  • [ ] Request correlation across service boundaries

Retention & Compliance

  • [ ] Log retention period defined and compliant with local regulations
  • [ ] GDPR restrictions adhered to

Terminology

Term Definition
Monitoring The practice of collecting, processing, and analyzing predefined metrics and logs to detect known issues and system failures. It is reactive in nature, focusing on alerting when specific thresholds or conditions are breached.
Observability The ability to understand the internal state of a system by examining its outputs (logs, metrics, traces). It is proactive, enabling exploration of unknown issues and understanding why problems occur, especially in distributed systems.
Logs Immutable, timestamped records of discrete events within a system. Logs provide detailed, textual context about what happened at specific points in time, essential for debugging and auditing.
Metrics Numeric data aggregated over time that represents system behavior and performance (e.g., CPU usage, memory consumption, request latency, error rates). Metrics enable trend analysis and alerting.
Traces Records that track the journey of a request as it propagates through various services and components in a distributed system. Traces help identify bottlenecks and understand request flows across microservices.
KPI (Key Performance Indicator) A measurable value that demonstrates how effectively a system, application, or business process is achieving key objectives. In software, examples include page load times, error rates, and user engagement metrics.
SLI (Service Level Indicator) A quantitative measure of a specific aspect of service performance (e.g., availability, latency, throughput). SLIs provide the raw data used to evaluate service health.
SLO (Service Level Objective) A target value or range for an SLI that defines the desired performance level of a service (e.g., "99.9% availability" or "p95 latency < 200ms"). SLOs guide reliability engineering efforts.
SLA (Service Level Agreement) A formal contract between a service provider and customers that defines the expected service levels, including SLOs, and the consequences (penalties or remedies) if those levels are not met.

DHI Specifics

DHI is involved in projects with a distributed architecture, different technologies, and across the entire tech spectrum: web, mobile and desktop. Modern observability and monitoring tools generally offer support for all the above, but it is worth noting the differences.

Web & Distributed

DHI digital solutions and services are built on a modern web tech stack using .NET (C#), Python, and TypeScript (React). While this tech stack allows DHI developers to lean on available OSS tools, bugs, vulnerability, and performance issues are expected. These issues can and should be addressed through the implementation of monitoring and observability tooling.

Mobile & Desktop

MIKE Software suite is built on a robust foundation of C++, C#, and Fortran with some Python integration. This tech stack offers powerful tools for building high-quality performant software, but this comes with inherent complexity and potential issues. These challenges can and should be addressed through implementing monitoring and observability tooling.


Benefits of Monitoring and Observability

The practice of observability and monitoring is crucial for ensuring reliability, performance, security, and excellent user experience.

  • Faster Issue Detection
    Allows developers to spot and address performance issues, errors, security threats and generic bugs before significant impacts for end users.

  • Faster Troubleshooting
    Allows developers to promptly pinpoint root cause of an issue with a comprehensive insight into the application through logs, metrics, and traces, significantly reducing the time to resolution.

  • Enhanced Security
    Allows developers to detect and mitigate security vulnerabilities by monitoring unusual behavior, such as spikes in failed login attempts, unauthorized data access points, and higher than usual traffic indicating DDoS attacks.

  • Improved User Experience
    Achieving higher user satisfaction and customer retention by tracking KPIs such as page load times, error rates, user interaction, and allowing developers to improve software by addressing arising issues.

  • Improved Resource Utilisation
    Reducing infrastructure costs by identifying and addressing bottlenecks and inefficiencies through observing resource usage, such as CPU load, memory utilization, and bandwidth.

Monitoring (reactive)

Setting up specific tools and dashboards, metrics, and alerts when a problem occurs so it can be addressed knowing what to look for.

Monitoring is the process of collecting, processing, and analysing predefined metrics from your systems and applications. It's about detecting and alerting system failures or performance issues.

Monitoring can be divided into the following categories.

  • Application Performance Monitoring
    Tracking application performance, low response times, exceeding error rates, and user interactions, such as: Monitors CPU, memory, disk thresholds, network, and other infrastructure resources. Microservices, containers (Docker), Kubernetes, and serverless environments.

  • Error and Crash Monitoring
    Focused on capturing runtime errors and crashes.

Examples

  • A crash occurs when a CPU usage > 90% which then triggers a systems alert.

  • API response rate exceeds threshold which then triggers services degradation alert.

  • Disk space falls below 10%, which then triggers infrastructure update requirement alert.

Tools

Below is a list of tools to consider for traditional monitoring, focusing on known issues and pre-defined metrics.

[!NOTE] Default DHI (Recommended Approach)

Below tools are already in use at DHI and are recommended to be used for new and existing project.

Tool Description Licensing
Microsoft Azure Monitor Broad, full-stack observability platform Commercial
Grafana Dashboarding and Visualization Open Source & commercial
Prometheus Time-series Metrics collection Open Source
Revenera Monetization, licensing and compliance & more Commercial

[!WARNING] Restricted

Below options are discouraged but not prohibited and may be used if approved by Project Owners and/or Project Managers or if the project demands it.

Tool Description Licensing
Nagios Network monitoring Open Source & commercial
Zabbix Alerts for infrastructure and app metrics Open Source

What these tools solve:

  • System monitoring
  • Software business management (Revenera), such as:
  • Monetisation
  • Licensing
  • Compliance

Observability (proactive)

Observability is the ability to understand why something happens in a system by analysing logs, metrics, and traces, especially for issues that were not anticipated in advance. Observability is most critical for distributed systems where requests cross multiple services.

Observability helps to understand internal state of a system based on external outputs over time. It's about understanding why something happens in a system over time to find the cause issues through metrics, logs and traces.

The three pillars of observability:

  • Logs
    Immutable, timestamped records of discrete events, e.g. textual records of events happening in the system.

  • Metrics
    Numeric data that represents system behavior over time (e.g., CPU usage, memory, response time).

  • Traces
    A record of the journey of a request as it travels through services or workflows for
    observing user behavior or real user sessions.

Examples

  • Investigating a spike in latency with no clear alert triggered.
  • Understanding user behavior patterns across microservices.
  • Correlating logs and traces to find the root cause of a failed transaction.

Tools

These tools are built for modern, complex, and distributed systems offering insight into the above-mentioned pillars.

[!NOTE] Default DHI (Recommended Approach)

Below tools are already in use at DHI and are recommended to be used for new and existing project.

Tool Description Licensing
Microsoft Azure Monitor Broad, full-stack observability platform Commercial
Jaeger Tracing observability platform Open Source

[!WARNING] Restricted

Below options are discouraged but not prohibited and may be used if approved by Project Owners and/or Project Managers or if the project demands it.

Tool Description Licensing
Sentry Error and crashing reporting Commercial
Bugsnag Error and crashing reporting Commercial
New Relic Comprehensive observability platform Commercial
Datadog Comprehensive observability platform Commercial
Elastic Stack Integrations tools Commercial
Honeycomb Distributed services observability Commercial
Lightstep Logs, metrics and traces insights Commercial
Zipkin Traces, logs and metrics monitoring Open Source
Splunk Enterprise monitoring solution Commercial
Dynatrace AI powered monitoring and observability Commercial
Better Stack AI powered incidence response solution Commercial

What these tools solve

  • Modern software observability
  • Distributed tracing, such as:
  • Error logs and reporting
  • Crash logs and reporting
  • Application tracing and reporting

Usage & Monetization Intelligence

While not strictly observability or monitoring, usage and monetization intelligence is critical for transforming raw usage data into actionable business insights.

Tooling

[!NOTE] Default DHI (Recommended Approach)

Below tools are already in use at DHI and are recommended to be used for new and existing project.

Tool Description Licensing
Revenera Monetization intelligence, licensing, usage intelligence Commercial
Matomo Software usage analytics Open Source & commercial

[!WARNING] Restricted

Below options are discouraged but not prohibited and may be used if approved by Project Owners and/or Project Managers or if the project demands it.

Tool Description Licensing
Kubit User behavior and usage intelligence Commercial
Smartlook User behavior and usage intelligence Commercial
Zylo Management of software assets, licensing and subscriptions Commercial
Flexera Management of software assets, licensing and subscriptions Commercial
Plausible Analytics Software usage analytics Open Source & commercial
Fathom Analytics Software usage analytics Open Source & commercial
Mixpanel Software usage analytics Open Source & commercial
Hotjar Software usage analytics Commercial

What these tools solve

  • Tracking feature adoption
  • Monitoring usage and consumption
  • Tracking and analysis of install and user base
  • Licensing enforcement
  • Subscription control

Checklist

Below Key Steps Checklist may be used as a guideline to assist with decision making by PO/PM/architects. It is recommended to make sure that these steps are discussed when decisions are made for the project or product at hand when applicable.

[!NOTE]

This checklist is an optional guideline and not a rule set. It is recommended for reading and understanding. Feel free to adjust it to make it fit and work for your project.

Define Key Metrics & Objectives

[ ] Identify critical components (for the project)

For example: - Database performance - API response times - Infrastructure costs

[ ] Define KPIs & Service Level Objectives (SLOs, i.e. targets)

As minimum standards are difficult to quantify across projects and tech stacks, the following critical components should be decided on per project basis.

For example: - Acceptable error rates - Acceptable latency - Critical paths - Acceptable response rates for API - Acceptable execution time for DB queries - Other vital metrics that may be relevant to the project at hand

Select & Configure Tooling

[ ] Choose a platform - Refer to 1.3.2 Monitoring Tools or 1.4.2 Observability Tools for tooling options

[ ] Support logging with tracing - Ensure your application generates a detailed, structured logs with timestamps - Ensure your application collects metrics defined by KPI & SLOs

Implementation

[ ] Code Instrumentation

For example: - Implement functionality to applications to emit logs, metrics and traces - Implement unhandled exception monitoring - Implement handled exception monitoring - Implement application crash reporting

[ ] Setup Alerts & Digests

For example: - Configure alerts to notify parties when SLOs are at risk or when an issue occurs - Configure scheduled digests of system performance for proactive monitoring

[ ] Emergency Documentation

For example: - Document clear, step-by-step instructions for: - Responding to common alerts - Handling critical issues - Handling emergency scenarios

Data Retention Compliance

[ ] Investigate and Adhere to Local laws & GDPR compliance**

As DHI offers its consulting solutions across the globe, defining a concrete set of rules when it comes to data retention and compliance is impossible. As a minimum, POs/PMs must consult local laws and regulations to fully comply with data retention rules.

For example: - Allowed log retention duration (days, months, years etc) - Allowed log content (usernames, passwords, dates, geolocations, IPs, etc): - Ensure the following senstitive data is masked or redacted from log retention: - Tokens - Secrets - Passwords - Session cookies - Envrinment variables - Authentication headers - PII (Personally Identifiable Information)


Conclusion

By providing deep insights into alerts signaling problems and ways to resolve them, the abovementioned practices help DHI maintain a stable suite of applications with smooth user experience and low operational costs. Implementing dedicated solutions to track product usage provides significant business advantages to companies and should not be overlooked.