Monitoring and Observability¶
Version History¶
| Version | Date | Description/Updates |
|---|---|---|
| 1.0 | 7.04.2026 | First observability and monitoring guidelines release. |
Introduction¶
The document describes how monitoring and observability should be implemented.
Monitoring and Observability in Software Development are crucial practices that help teams maintain health, performance, and reliability of their applications and infrastructure. While closely related, these terms are distinct in scope and purpose.
Contacts¶
- Guidelines owner/driver: Bogdan Kurbetyev
- Task force:
- Stakeholders:
Roles to Apply¶
Development roles:
- Architects
- Backend developers
- Frontend developers
- Infrastructure/DevOps Engineers
- Product Manager/Product Owners
Enforcement Levels¶
| Section | Enforcement Level | When/what |
|---|---|---|
| Minimum Production Baseline | Important | New and existing projects |
| Monitoring (reactive) | Important | New and existing projects |
| Observability (proactive) | Important | New and existing projects |
| Usage & Monetization Intelligence | Optional | Only when projects demand it |
| Checklist | Recommended | New and existing projects |
Importance Overview¶
| Section | Who | When |
|---|---|---|
| Minimum Production Baseline | All roles | Before any production deployment |
| Monitoring (reactive) | Architects, Infrastructure/DevOps Engineers | When setting up a new project or service |
| Observability (proactive) | Architects, Infrastructure/DevOps Engineers | When setting up distributed or complex systems |
| Usage & Monetization Intelligence | Architects, Product Manager/Product Owners | When planning product analytics and licensing |
| Checklist | Architects, Product Manager/Product Owners | When planning monitoring strategy for a project |
References¶
Minimum Production Baseline¶
[!IMPORTANT] These requirements are mandatory for all production deployments at DHI. More extensive checklist is available at the bottom of the this document.
Logging (Required)¶
- [ ] Structured logging with timestamps and correlation IDs
- [ ] Log levels: ERROR, WARN, INFO, DEBUG properly configured
- [ ] Unhandled exception logging enabled
- [ ] Sensitive data (tokens, passwords, PII) masked or excluded
Metrics (Required)¶
- [ ] Error rate monitored
- [ ] CPU, memory, and disk utilization tracked
- [ ] Request latency (p50, p95, p99) measured
- [ ] Health endpoint (
/healthor equivalent) available to internal teams
Alerting (Required)¶
- [ ] Alert for service unavailability (health check failures)
- [ ] Alert for error rate exceeding threshold (e.g., >1%)
- [ ] Alert for latency exceeding SLO (e.g., p95 > 500ms)
- [ ] On-call contact or escalation path documented
Tracing (Recommended for Distributed Systems)¶
- [ ] Azure Monitor tracing enabled
- [ ] Request correlation across service boundaries
Retention & Compliance¶
- [ ] Log retention period defined and compliant with local regulations
- [ ] GDPR restrictions adhered to
Terminology¶
| Term | Definition |
|---|---|
| Monitoring | The practice of collecting, processing, and analyzing predefined metrics and logs to detect known issues and system failures. It is reactive in nature, focusing on alerting when specific thresholds or conditions are breached. |
| Observability | The ability to understand the internal state of a system by examining its outputs (logs, metrics, traces). It is proactive, enabling exploration of unknown issues and understanding why problems occur, especially in distributed systems. |
| Logs | Immutable, timestamped records of discrete events within a system. Logs provide detailed, textual context about what happened at specific points in time, essential for debugging and auditing. |
| Metrics | Numeric data aggregated over time that represents system behavior and performance (e.g., CPU usage, memory consumption, request latency, error rates). Metrics enable trend analysis and alerting. |
| Traces | Records that track the journey of a request as it propagates through various services and components in a distributed system. Traces help identify bottlenecks and understand request flows across microservices. |
| KPI (Key Performance Indicator) | A measurable value that demonstrates how effectively a system, application, or business process is achieving key objectives. In software, examples include page load times, error rates, and user engagement metrics. |
| SLI (Service Level Indicator) | A quantitative measure of a specific aspect of service performance (e.g., availability, latency, throughput). SLIs provide the raw data used to evaluate service health. |
| SLO (Service Level Objective) | A target value or range for an SLI that defines the desired performance level of a service (e.g., "99.9% availability" or "p95 latency < 200ms"). SLOs guide reliability engineering efforts. |
| SLA (Service Level Agreement) | A formal contract between a service provider and customers that defines the expected service levels, including SLOs, and the consequences (penalties or remedies) if those levels are not met. |
DHI Specifics¶
DHI is involved in projects with a distributed architecture, different technologies, and across the entire tech spectrum: web, mobile and desktop. Modern observability and monitoring tools generally offer support for all the above, but it is worth noting the differences.
Web & Distributed¶
DHI digital solutions and services are built on a modern web tech stack using .NET (C#), Python, and TypeScript (React). While this tech stack allows DHI developers to lean on available OSS tools, bugs, vulnerability, and performance issues are expected. These issues can and should be addressed through the implementation of monitoring and observability tooling.
Mobile & Desktop¶
MIKE Software suite is built on a robust foundation of C++, C#, and Fortran with some Python integration. This tech stack offers powerful tools for building high-quality performant software, but this comes with inherent complexity and potential issues. These challenges can and should be addressed through implementing monitoring and observability tooling.
Benefits of Monitoring and Observability¶
The practice of observability and monitoring is crucial for ensuring reliability, performance, security, and excellent user experience.
-
Faster Issue Detection
Allows developers to spot and address performance issues, errors, security threats and generic bugs before significant impacts for end users. -
Faster Troubleshooting
Allows developers to promptly pinpoint root cause of an issue with a comprehensive insight into the application through logs, metrics, and traces, significantly reducing the time to resolution. -
Enhanced Security
Allows developers to detect and mitigate security vulnerabilities by monitoring unusual behavior, such as spikes in failed login attempts, unauthorized data access points, and higher than usual traffic indicating DDoS attacks. -
Improved User Experience
Achieving higher user satisfaction and customer retention by tracking KPIs such as page load times, error rates, user interaction, and allowing developers to improve software by addressing arising issues. -
Improved Resource Utilisation
Reducing infrastructure costs by identifying and addressing bottlenecks and inefficiencies through observing resource usage, such as CPU load, memory utilization, and bandwidth.
Monitoring (reactive)¶
Setting up specific tools and dashboards, metrics, and alerts when a problem occurs so it can be addressed knowing what to look for.
Monitoring is the process of collecting, processing, and analysing predefined metrics from your systems and applications. It's about detecting and alerting system failures or performance issues.
Monitoring can be divided into the following categories.
-
Application Performance Monitoring
Tracking application performance, low response times, exceeding error rates, and user interactions, such as: Monitors CPU, memory, disk thresholds, network, and other infrastructure resources. Microservices, containers (Docker), Kubernetes, and serverless environments. -
Error and Crash Monitoring
Focused on capturing runtime errors and crashes.
Examples¶
-
A crash occurs when a CPU usage > 90% which then triggers a systems alert.
-
API response rate exceeds threshold which then triggers services degradation alert.
-
Disk space falls below 10%, which then triggers infrastructure update requirement alert.
Tools¶
Below is a list of tools to consider for traditional monitoring, focusing on known issues and pre-defined metrics.
[!NOTE] Default DHI (Recommended Approach)
Below tools are already in use at DHI and are recommended to be used for new and existing project.
| Tool | Description | Licensing |
|---|---|---|
| Microsoft Azure Monitor | Broad, full-stack observability platform | Commercial |
| Grafana | Dashboarding and Visualization | Open Source & commercial |
| Prometheus | Time-series Metrics collection | Open Source |
| Revenera | Monetization, licensing and compliance & more | Commercial |
[!WARNING] Restricted
Below options are discouraged but not prohibited and may be used if approved by Project Owners and/or Project Managers or if the project demands it.
| Tool | Description | Licensing |
|---|---|---|
| Nagios | Network monitoring | Open Source & commercial |
| Zabbix | Alerts for infrastructure and app metrics | Open Source |
What these tools solve:¶
- System monitoring
- Software business management (Revenera), such as:
- Monetisation
- Licensing
- Compliance
Observability (proactive)¶
Observability is the ability to understand why something happens in a system by analysing logs, metrics, and traces, especially for issues that were not anticipated in advance. Observability is most critical for distributed systems where requests cross multiple services.
Observability helps to understand internal state of a system based on external outputs over time. It's about understanding why something happens in a system over time to find the cause issues through metrics, logs and traces.
The three pillars of observability:
-
Logs
Immutable, timestamped records of discrete events, e.g. textual records of events happening in the system. -
Metrics
Numeric data that represents system behavior over time (e.g., CPU usage, memory, response time). -
Traces
A record of the journey of a request as it travels through services or workflows for
observing user behavior or real user sessions.
Examples¶
- Investigating a spike in latency with no clear alert triggered.
- Understanding user behavior patterns across microservices.
- Correlating logs and traces to find the root cause of a failed transaction.
Tools¶
These tools are built for modern, complex, and distributed systems offering insight into the above-mentioned pillars.
[!NOTE] Default DHI (Recommended Approach)
Below tools are already in use at DHI and are recommended to be used for new and existing project.
| Tool | Description | Licensing |
|---|---|---|
| Microsoft Azure Monitor | Broad, full-stack observability platform | Commercial |
| Jaeger | Tracing observability platform | Open Source |
[!WARNING] Restricted
Below options are discouraged but not prohibited and may be used if approved by Project Owners and/or Project Managers or if the project demands it.
| Tool | Description | Licensing |
|---|---|---|
| Sentry | Error and crashing reporting | Commercial |
| Bugsnag | Error and crashing reporting | Commercial |
| New Relic | Comprehensive observability platform | Commercial |
| Datadog | Comprehensive observability platform | Commercial |
| Elastic Stack | Integrations tools | Commercial |
| Honeycomb | Distributed services observability | Commercial |
| Lightstep | Logs, metrics and traces insights | Commercial |
| Zipkin | Traces, logs and metrics monitoring | Open Source |
| Splunk | Enterprise monitoring solution | Commercial |
| Dynatrace | AI powered monitoring and observability | Commercial |
| Better Stack | AI powered incidence response solution | Commercial |
What these tools solve¶
- Modern software observability
- Distributed tracing, such as:
- Error logs and reporting
- Crash logs and reporting
- Application tracing and reporting
Usage & Monetization Intelligence¶
While not strictly observability or monitoring, usage and monetization intelligence is critical for transforming raw usage data into actionable business insights.
Tooling¶
[!NOTE] Default DHI (Recommended Approach)
Below tools are already in use at DHI and are recommended to be used for new and existing project.
| Tool | Description | Licensing |
|---|---|---|
| Revenera | Monetization intelligence, licensing, usage intelligence | Commercial |
| Matomo | Software usage analytics | Open Source & commercial |
[!WARNING] Restricted
Below options are discouraged but not prohibited and may be used if approved by Project Owners and/or Project Managers or if the project demands it.
| Tool | Description | Licensing |
|---|---|---|
| Kubit | User behavior and usage intelligence | Commercial |
| Smartlook | User behavior and usage intelligence | Commercial |
| Zylo | Management of software assets, licensing and subscriptions | Commercial |
| Flexera | Management of software assets, licensing and subscriptions | Commercial |
| Plausible Analytics | Software usage analytics | Open Source & commercial |
| Fathom Analytics | Software usage analytics | Open Source & commercial |
| Mixpanel | Software usage analytics | Open Source & commercial |
| Hotjar | Software usage analytics | Commercial |
What these tools solve¶
- Tracking feature adoption
- Monitoring usage and consumption
- Tracking and analysis of install and user base
- Licensing enforcement
- Subscription control
Checklist¶
Below Key Steps Checklist may be used as a guideline to assist with decision making by PO/PM/architects. It is recommended to make sure that these steps are discussed when decisions are made for the project or product at hand when applicable.
[!NOTE]
This checklist is an optional guideline and not a rule set. It is recommended for reading and understanding. Feel free to adjust it to make it fit and work for your project.
Define Key Metrics & Objectives¶
[ ] Identify critical components (for the project)
For example: - Database performance - API response times - Infrastructure costs
[ ] Define KPIs & Service Level Objectives (SLOs, i.e. targets)
As minimum standards are difficult to quantify across projects and tech stacks, the following critical components should be decided on per project basis.
For example: - Acceptable error rates - Acceptable latency - Critical paths - Acceptable response rates for API - Acceptable execution time for DB queries - Other vital metrics that may be relevant to the project at hand
Select & Configure Tooling¶
[ ] Choose a platform - Refer to 1.3.2 Monitoring Tools or 1.4.2 Observability Tools for tooling options
[ ] Support logging with tracing - Ensure your application generates a detailed, structured logs with timestamps - Ensure your application collects metrics defined by KPI & SLOs
Implementation¶
[ ] Code Instrumentation
For example: - Implement functionality to applications to emit logs, metrics and traces - Implement unhandled exception monitoring - Implement handled exception monitoring - Implement application crash reporting
[ ] Setup Alerts & Digests
For example: - Configure alerts to notify parties when SLOs are at risk or when an issue occurs - Configure scheduled digests of system performance for proactive monitoring
[ ] Emergency Documentation
For example: - Document clear, step-by-step instructions for: - Responding to common alerts - Handling critical issues - Handling emergency scenarios
Data Retention Compliance¶
[ ] Investigate and Adhere to Local laws & GDPR compliance**
As DHI offers its consulting solutions across the globe, defining a concrete set of rules when it comes to data retention and compliance is impossible. As a minimum, POs/PMs must consult local laws and regulations to fully comply with data retention rules.
For example: - Allowed log retention duration (days, months, years etc) - Allowed log content (usernames, passwords, dates, geolocations, IPs, etc): - Ensure the following senstitive data is masked or redacted from log retention: - Tokens - Secrets - Passwords - Session cookies - Envrinment variables - Authentication headers - PII (Personally Identifiable Information)
Conclusion¶
By providing deep insights into alerts signaling problems and ways to resolve them, the abovementioned practices help DHI maintain a stable suite of applications with smooth user experience and low operational costs. Implementing dedicated solutions to track product usage provides significant business advantages to companies and should not be overlooked.