This PR adds prometheus metrics for when users attempt to exceed the
limits for a given resource.
The implementation sets up a second function exported from the
ExceedsLimitError file that records metrics and then throws the error.
This could also be a static method on the class, but I'm not sure that'd
be better.
This adds an extended metrics format to the metrics ingested by Unleash
and sent by running SDKs in the wild. Notably, we don't store this
information anywhere new in this PR, this is just streamed out to
Victoria metrics - the point of this project is insight, not analysis.
Two things to look out for in this PR:
- I've chosen to take extend the registration event and also send that
when we receive metrics. This means that the new data is received on
startup and on heartbeat. This takes us in the direction of collapsing
these two calls into one at a later point
- I've wrapped the existing metrics events in some "type safety", it
ain't much because we have 0 type safety on the event emitter so this
also has some if checks that look funny in TS that actually check if the
data shape is correct. Existing tests that check this are more or less
preserved
Now we are also sending project id to prometheus, also querying from
database. This sets us up for grafana dashboard.
Also put the metrics behind flag, just incase it causes cpu/memory
issues.
This PR adds a counter in Prometheus for counting the number of
"environment disabled" events we get per project. The purpose of this is
to establish a baseline for one of the "project management UI" project's
key results.
## On gauges vs counters
This PR uses a counter. Using a gauge would give you the total number of
envs disabled, not the number of disable events. The difference is
subtle, but important.
For projects that were created before the new feature, the gauge might
be appropriate. Because each disabled env would require at least one
disabled event, we can get a floor of how many events were triggered for
each project.
However, for projects created after we introduce the planned change,
we're not interested in the total envs anymore, because you can disable
a hundred envs on creation with a single action. In this case, a gauge
showing 100 disabled envs would be misleading, because it didn't take
100 events to disable them.
So the interesting metric here is how many times did you specifically
disable an environment in project settings, hence the counter.
## Assumptions and future plans
To make this easier on ourselves, we make the follow assumption: people
primarily disable envs **when creating a project**.
This means that there might be a few lagging indicators granting some
projects a smaller number of events than expected, but we may be able to
filter those out.
Further, if we had a metric for each project and its creation date, we
could correlate that with the metrics to answer the question "how many
envs do people disable in the first week? Two weeks? A month?". Or
worded differently: after creating a project, how long does it take for
people to configure environments?
Similarly, if we gather that data, it will also make filtering out the
number of events for projects created **after** the new changes have
been released much easier.
The good news: Because the project creation metric with dates is a
static aggregate, it can be applied at any time, even retroactively, to
see the effects.
## About the changes
App stats is mainly used to cap the number of applications reported to
Unleash based on the last 7 days information:
cc2ccb1134/src/lib/middleware/response-time-metrics.ts (L24-L28)
Instead of getting all stats, just calculate appCount statistics
Use scheduler service instead of setInterval
Lots of work here, mostly because I didn't want to turn off the
`noImplicitAnyLet` lint. This PR tries its best to type all the untyped
lets biome complained about (Don't ask me how many hours that took or
how many lints that was >200...), which in the future will force test
authors to actually type their global variables setup in `beforeAll`.
---------
Co-authored-by: Gastón Fournier <gaston@getunleash.io>
As part of more telemetry on the usage of Unleash.
This PR adds a new `stat_` prefixed table as well as a trigger on the
events table trigger on each insert to increment a counter per
environment per day.
The trigger will trigger on every insert into the events base, but will
filter and only increment the counter for events that actually have the
environment set. (there are events, like user-created, that does not
relate to a specific environment).
Bit wary on this, but since we truncate down to row per (day,
environment) combo, finding conflict and incrementing shouldn't take too
long here.
@ivarconr was it something like this you were considering?
## About the changes
- `getActiveUsers` is using multiple stores, so it is refactored into
read-model
- Refactored Instance stats service into `features` to co-locate related
code
Closes https://linear.app/unleash/issue/UNL-230/active-users-prometheus
### Important files
`src/lib/features/instance-stats/getActiveUsers.ts`
## Discussion points
`getActiveUsers` is coded less _class-based_ then previous similar
read-models. In one file instead of 3 (read-model interface, fake read
model, sql read model). I find types and functions way more readable,
but I'm ready to refactor it to interfaces and classes if consistency is
more important.
## About the changes
This metric will expose an aggregated view of how many client
applications are registered in Unleash. Since applications are ephemeral
we are exposing this metric in different time windows based on when the
application was last seen.
The caveat is that we issue a database query for each new range we want
to add. Hopefully, this should not be a problem because:
a) the amount of ranges we'd expose is small and unlikely to grow
b) this is currently updated at startup time and even if we update it on
a scheduled basis the refresh rate will be rather sparse
## Sample data
This is how metrics will look like
```
# HELP client_apps_total Number of registered client apps aggregated by range by last seen
# TYPE client_apps_total gauge
client_apps_total{range="allTime"} 3
client_apps_total{range="30d"} 3
client_apps_total{range="7d"} 2
```