unleash.unleash/logging-levels.md at 0a752fbf471d80f7d2b180b11231c8c3511a05a9

mirror of https://github.com/Unleash/unleash.git synced 2025-07-12 13:48:35 +02:00

Christopher Kolstad a672292425

feat(docs): Added ADR for logging levels (#9580 )

This PR starts an ADR for logging levels, and how to make your
coworkers' days who are on-call less noisy.

This is my initial suggestion. Please feel free to help me with fleshing
out both the table for what log levels means to you or if you disagree
with what I've put as a background.

My impression is that we are pretty good at this, but we could be
better, and having an ADR to lean on when pushing back on PR's log
levels could help us.

Thanks to @kwasniew for the push to start the ADR.

Examples: 
*
038c10f612/src/lib/db/account-store.ts (L164-L173)

And probably the storeFootprint method in our
client-feature-toggle-delta
*
038c10f612/src/lib/features/client-feature-toggles/delta/client-feature-toggle-delta.ts (L363-L370)


*
293304a5d6/src/traffic-data/traffic-data-usage-service.ts (L69)

---------

Co-authored-by: Melinda Fekete <melinda.fekete@getunleash.io>

2025-03-25 14:06:37 +00:00

3.6 KiB

Raw Blame History

title
ADR: Logging levels

Date: 2025-03-20

Background

Our log levels carry semantic information. Log lines logged at the error level triggers SRE alerts if they exceed more than 1 per hour. Though we are pretty good at not excessively logging at ERROR, we do have cases where SRE alerts gets triggered, but by the time SRE can log on and check the deployment, everything is fine again. This means we never had an ERROR, we should have had a WARN message.

This ADR aims to solidify an understanding that levels are important to use correctly to avoid mental load and on-call alerts for things we can't do anything about.

Decision

We should agree on the semantic information carried in each level, and which levels are ok to ignore while scanning logs from running applications.

Current suggestion

Log level	Frequency in healthy application	Standard Availability	Configurable
ERROR	0	All environments	NO
WARN	1-10	All environments	NO
INFO	10-100	Default deploy config sets LOG_LEVEL=info	YES
DEBUG	100 - 1000	Local development	YES (specific deployments)
TRACE	1000 - 10000	NO	YES (specific deployments)

Change

Previously we might've logged an ERROR for a self-healable issue, this should change to WARN, and not be an ERROR.

The only things that should be logged at ERROR are exceptional behaviour that we need to fix immediately, everything else should be downgraded to WARN. In order to reduce WARN cardinality, this might mean that some messages at WARN today should be downgraded to INFO.

Examples

traffic-data-service in enterprise

await Promise.all(promises)
            .then(() => {
                this.logger.debug('Traffic data usage saved');
            })
            .catch((err) => {
                this.logger.error('Failed to save traffic data usage', err);
            });

async markSeenAt(secrets: string[]): Promise<void> {
        const now = new Date();
        try {
            await this.db('personal_access_tokens')
                .whereIn('secret', secrets)
                .update({ seen_at: now });
        } catch (err) {
            this.logger.error('Could not update lastSeen, error: ', err);
        }
    }

Not being able to update lastSeen is not something we can do anything about as on-calls, so this is also a good candidate for downgrading to WARN.

3.6 KiB Raw Blame History