I've tried to use/add the audit info to all events I could see/find.
This makes this PR necessarily huge, because we do store quite a few
events.
I realise it might not be complete yet, but tests
run green, and I think we now have a pattern to follow for other events.
We encountered an issue with a customer because this query was returning
3 million rows. The problem arose from each instance reporting
approximately 100 features, with a total of 30,000 instances. The query
was joining these, thus multiplying the data. This approach was fine for
a reasonable number of instances, but in this extreme case, it did not
perform well.
This PR modifies the logic; instead of performing outright joins, we are
now grouping features by environment into an array, resulting in just
one row returned per instance.
I tested locally with the same dataset. Previously, loading this large
instance took about 21 seconds; now it has reduced to 2 seconds.
Although this is still significant, the dataset is extensive.
Previously, we were extracting the project from the token, but now we
will retrieve it from the session, which contains the full list of
projects.
This change also resolves an issue we encountered when the token was a
multi-project token, formatted as []:dev:token. Previously, it was
unable to display the exact list of projects. Now, it will show the
exact project names.
## About the changes
- Removes the feature flag for the created_by migrations.
- Adds a configuration option in IServerOption for
`ENABLE_SCHEDULED_CREATED_BY_MIGRATION` that defaults to `false`
- the new configuration option when set on startup enables scheduling of
the two created_by migration services (features+events)
- Removes the dependency on flag provider in EventStore as it's no
longer needed
- Adds a brief description of the new configuration option in
`configuring-unleash.md`
- Sets the events created_by migration interval to 15 minutes, up from
2.
---------
Co-authored-by: Gastón Fournier <gaston@getunleash.io>
## About the changes
This PR establishes a simple yet effective mechanism to avoid DDoS
against our DB while also protecting against memory leaks.
This will enable us to release the flag `queryMissingTokens` to make our
token validation consistent across different nodes
---------
Co-authored-by: Nuno Góis <github@nunogois.com>
The changes to arbitraries here is to make typescript agree with our
schema types. Seems like somewhere between 4.8.4 and 5.4.2, typescript
got stricter.
This column has not been used for 1.5 years and was replace by
**archived_at** column and people still get confused of why this is not
working as name suggests. Removing this column to remove technical debt.
## About the changes
Some of our metrics are not labeled correctly, one example is
`<base-path>/api/frontend/client/metrics` is labeled as
`/client/metrics`. We can see that in internal-backstage/prometheus:
![image](https://github.com/Unleash/unleash/assets/455064/0d8f1f40-8b5b-49d4-8a88-70b523e9be09)
This issue affects all endpoints that fail to validate the request body.
Also, endpoints that are rejected by the authorization-middleware or the
api-token-middleware are reported as `(hidden)`.
To gain more insights on our api usage but being protective of metrics
cardinality we're prefixing `(hidden)` with some well known base urls:
https://github.com/Unleash/unleash/pull/6400/files#diff-1ed998ca46ffc97c9c0d5d400bfd982dbffdb3004b78a230a8a38e7644eee9b6R17-R33
## How to reproduce:
Make an invalid call to metrics (e.g. stop set to null), then check
/internal-backstage/prometheus and find the 400 error. Expected to be at
`path="/api/client/metrics"` but will have `path=""`:
```shell
curl -H"Authorization: *:development.unleash-insecure-client-api-token" -H'Content-type: application/json' localhost:4242/api/client/metrics -d '{
"appName": "bash-test",
"instanceId": "application-name-dacb1234",
"environment": "development",
"bucket": {
"start": "2023-07-27T11:23:44Z",
"stop": null,
"toggles": {
"myCoolToggle": {
"yes": 25,
"no": 42,
"variants": {
"blue": 6,
"green": 15,
"red": 46
}
},
"myOtherToggle": {
"yes": 0,
"no": 100
}
}
}
}'
```
In order to stop privilege escalation via
`/api/admin/projects/:project/users/:userId/roles` and
`/api/admin/projects/:project/groups/:groupId/roles` this PR adds the
same check we added to setAccess methods to the methods updating access
for these two methods.
Also adds tests that verify that we throw an exception if you try to
assign roles you do not have.
Thank you @nunogois for spotting this during testing.
## About the changes
When edge is configured to automatically generate tokens, it requires
the token to be present in all unleash instances.
It's behind a flag which enables us to turn it on on a case by case
scenario.
The risk of this implementation is that we'd be adding load to the
database in the middleware that evaluates tokens (which are present in
mostly all our API calls. We only query when the token is missing but
because the /client and /frontend endpoints which will be the affected
ones are high throughput, we want to be extra careful to avoid DDoSing
ourselves
## Alternatives:
One alternative would be that we merge the two endpoints into one.
Currently, Edge does the following:
If the token is not valid, it tries to create a token using a service
account token and /api/admin/create-token endpoint. Then it uses the
token generated (which is returned from the prior endpoint) to query
/api/frontend. What if we could call /api/frontend with the same service
account we use to create the token? It may sound risky but if the same
application holding the service account token with permission to create
a token, can call /api/frontend via the generated token, shouldn't it be
able to call the endpoint directly?
The purpose of the token is authentication and authorization. With the
two tokens we are authenticating the same app with 2 different
authorization scopes, but because it's the same app we are
authenticating, can't we just use one token and assume that the app has
both scopes?
If the service account already has permissions to create a token and
then use that token for further actions, allowing it to directly call
/api/frontend does not necessarily introduce new security risks. The
only risk is allowing the app to generate new tokens. Which leads to the
third alternative: should we just remove this option from edge?
This PR adds a property issues to application schema, and also adds all
the missing features that have been reported by SDK, but do not exist in
Unleash.
In order to prevent users from being able to assign roles/permissions
they don't have, this PR adds a check that the user performing the
action either is Admin, Project owner or has the same role they are
trying to grant/add.
This addAccess method is only used from Enterprise, so there will be a
separate PR there, updating how we return the roles list for a user, so
that our frontend can only present the roles a user is actually allowed
to grant.
This adds the validation to the backend to ensure that even if the
frontend thinks we're allowed to add any role to any user here, the
backend can be smart enough to stop it.
We should still update frontend as well, so that it doesn't look like we
can add roles we won't be allowed to.
Fixes ##5799 and #5785
When you do not provide a token we should resolve to the "default"
environment to maintain backward compatibility. If you actually provide
a token we should prefer that and even block the request if it is not
valid.
An interesting fact is that "default" environment is not available on a
fresh installation of Unleash. This means that you need to provide a
token to actually get access to toggle configurations.
---------
Co-authored-by: Thomas Heartman <thomas@getunleash.io>
## About the changes
Implements a new store for collected traffic data usage that connects to
the new table `stat_traffic_data` primary key'd on [day, trafficGroup,
status_code_series].
Day being a date
Traffic group being which endpoint is being counted for, ie /api/admin,
/api/frontend etc
Status code series grouping 2xx status responses and 304 into their
respective 200 / 300 series.
No service here, this is for pro/enterprise
## About the changes
App stats is mainly used to cap the number of applications reported to
Unleash based on the last 7 days information:
cc2ccb1134/src/lib/middleware/response-time-metrics.ts (L24-L28)
Instead of getting all stats, just calculate appCount statistics
Use scheduler service instead of setInterval
## About the changes
getAllActive from api-tokens store is the second most frequent query
![image](https://github.com/Unleash/unleash/assets/455064/63c5ae76-bb62-41b2-95b4-82aca59a7c16)
To prevent starving our db connections, we can cache this data that
rarely changes and clear the cache when we see changes. Because we will
only clear changes in the node receiving the change we're only caching
the data for 1 minute.
This should give us some room to test if this solution will work
---------
Co-authored-by: Nuno Góis <github@nunogois.com>
Adds a new Inactive Users list component to admin/users for easier cleanup of users that are counted as inactive: No sign of activity (logins or api token usage) in the last 180 days.
---------
Co-authored-by: David Leek <david@getunleash.io>
## About the changes
the created_by_user_id data migration from resolving events.created_by
(for both events and features) now emits events on how many rows were
updated.
Adds listeners for these events that records these metrics with
prometheus
![image](https://github.com/Unleash/unleash/assets/707867/3bb02645-0919-4a9a-83fe-a07383ac0be1)
## About the changes
Schedules a best-effort task setting the value of
events.created_by_user_id based on what is found in the created_by
column and if it's capable of resolving that to a userid/a system id.
The process is executed in the events-store, it takes a chunk of events
that haven't been processed yet, attempts to join users and api_tokens
tables on created_by = username/email, loops through and tries to figure
out an id to set. Then updates the record.
---------
Co-authored-by: Gastón Fournier <gaston@getunleash.io>
## About the changes
Sets data migration of features and events created_by_user_id to
disabled by default
Map to promise and await all in created by user id migration for features
## About the changes
Adds a scheduled task that every 5 seconds updates 500 entries in the
features table setting `created_by_user_id`.
It does this by looking at the related event, checks created_by and
joins users table for match on username or email, and joins api_tokens
table on username matches. Then picks either a users id if set, or uses
-42 (admin token user)
Only triggers if there is any rows in client instances that have
sdk_version: unleash-edge with version < 17.0.0
The function that checks this memoizes the check for 10 minutes to avoid
scanning the client instances table too often.
## About the changes
Whenever we get a call from an admin token we want to associate it with
the [admin token
user](4d42093a07/src/lib/types/core.ts (L34-L41)).
This should give us the needed audit for this type of calls that
currently were lacking a user id (we only stored a string with the token
name in the event log).
We consciously decided not to use `id` as the property to prevent any
unforeseen side effects. The reason is that only `IUser` type has an id
and adding an id to `IApiUser` might lead to confusion.
Lots of work here, mostly because I didn't want to turn off the
`noImplicitAnyLet` lint. This PR tries its best to type all the untyped
lets biome complained about (Don't ask me how many hours that took or
how many lints that was >200...), which in the future will force test
authors to actually type their global variables setup in `beforeAll`.
---------
Co-authored-by: Gastón Fournier <gaston@getunleash.io>
Was having some trouble running these migration tests locally due to
`dbm` not correctly picking up the passed in config. This fixes it by
setting the custom config property after it has been initialized, always
overriding any wrong values.
PS: I think I found the issue. `dbm` was prioritizing my `DATABASE_URL`
for some reason, as I started having issues when it was set, and stopped
having issues when I unset it.
I still think this is a good change, as it prevents similar
hard-to-debug issues in the future.
To help clarify this, running this locally:
- `export
DATABASE_URL=postgres://unleash_user:passord@localhost:5432/unleash`
- `yarn test dedupe-permissions`
Fails on `main`, but passes on this branch. For some reason the `dbm`
instance prioritizes whatever is set in `DATABASE_URL` instead of the
options that are passed in `getInstance`.
## About the changes
Migrations for:
- Adds column is_system to users
- Inserts unleash_system_user id -1337 to users
includes `is_system: false` in the activeUsers and activeAccounts where filter
Tested by running:
`
select * into users_pre_check from users where id > -1;
delete from users where id > -1;
`
before starting unleash, then inspecting users table after unleash has
started and verifying that an 'admin' user has been created.
---------
Co-authored-by: Christopher Kolstad <chriswk@getunleash.ai>
## About the changes
Adds the new nullable column created_by_user_id to the data used by
feature-tag-store and feature-tag-service. Also updates openapi schemas.
### What
Adds `createdByUserId` to all events exposed by unleash. In addition
this PR updates all tests and usages of the methods in this codebase to
include the required number.
This PR fixes the issue discussed in SR-234, where you would get a 200
OK response even if your POST request to
`/api/admin/projects/<project-name>/access` contains invalid data (and
nothing is persisted).
Adding new project overview endpoint and deprecating the old one.
The new one has extra info about feature types, but does not have
features anymore, because features are coming from search endpoint.
## About the changes
Add user ids to group changes. This also modifies the payload of group created to include only the user id and creates events for SSO sync functionality
This adds more data to the setting events, so that its possible to see
what has changed
Used to look like:
```
{
"id": "maintenance.mode"
}
```
Now it looks like this:
```
{
"id": "maintenance.mode",
"enabled": false
}
```
because this is setting events, the default behaviour is to hide the content.
This PR checks that the unleash instance is an enterprise instance
before fetching change request data. This is to prevent Change Request
usage from preventing OSS users from deleting segments (when they don't
have access to change requests).
This PR also does a little bit of refactoring (which we can remove if
you want)
This PR updates the returned value about segments to also include the CR
title and to be one list item per strategy per change request. This
means that if the same strategy is used multiple times in multiple
change requests, they each get their own line (as has been discussed
with Nicolae).
Because of this, this pr removes a collection step in the query and
fixes some test cases.
The previous check would return `false` if the value was 0, causing a
bug where the usage data wouldn't be included.
This also adds tests to ensure that usage data for CR segments is
propagated correctly because that's where I first encountered the issue.
Before this fix, if the values were 0, the data would display like the
bottom element in the screenshot:
![image](https://github.com/Unleash/unleash/assets/17786332/9642b945-12c4-4217-aec9-7fef4a88e9af)
This PR changes the behavior of the API a little bit. Instead of
removing any strategies from `changeRequestStrategies` that are also
in `strategies`, we keep them in instead.
The reason for this is that the overview of where a segment is used is
incomplete if it shows only strategies but not CRs. Imagine this:
You want to delete a segment, but you're told it's only used in strategy
S.
So you go and remove it from strategy S, but then you're told it's
suddenly used in CRs A, B, and C. This is now a two-step operation
with a bad surprise. Instead, we could show you immediately that this
segment is used in strategy S and CRs A, B, and C.
This PR handles the case where a single strategy is used in multiple
change requests. Instead of listing the strategy several times in the
output, we consolidate the entries and add a new `changeRequestIds`
property. This is a non-empty list that points to all the change
requests it is used in.
This is required for us to be able to link back to the change requests
from the UI overview.
This PR changes the payload of the strategiesBySegment endpoint when the
flag is active. In addition to returning just the strategies, the object
will also contain a new property, called `changeRequestStrategies`
containing the strategies that are used in change requests.
This PR does not update the schema. That can be done later when the
changes go into beta. This also allows us some time to iterate on the
payload without changing the public API.
## Discussion points:
Should `strategies` and `changeRequestStrategies` ever contain
duplicates? Take this scenario:
- Strategy S uses segment T.
- There is an open change request that updates the list of segments for
S to T and a new segment U.
- In this case, strategy S would show up both in `strategies` _and_ in
`changeRequestStrategies`.
We have two options:
1. Filter the list of change request strategies, so that they don't
contain any duplicates (this is currently how it's implemented)
2. Ignore the duplicates and just send both lists as is.
We're doing option 2 for now.
Removing a user from a project was impossible if you only had 1 owner.
It worked fine when having more than an owner. This should fix it and
we'll add tests later
This PR adds the ability to detect which strategies use a specific
segment in active change requests.
It does not wire this functionality up to anything just yet. Follow-up
PRs will integrate this with the segment service and eventually with the
front end.
This PR updates the segment usage counting to also include segment usage
in pending change requests.
The changes include:
- Updating the schema to explicitly call out that change request usage
is included.
- Adding two tests to verify the new features
- Writing an alternate query to count this data
Specifically, it'll update the part of the UI that tells you how many
places a segment is used:
![image](https://github.com/Unleash/unleash/assets/17786332/a77cf932-d735-4a13-ae43-a2840f7106cb)
## Implementation
Implementing this was a little tricky. Previously, we'd just count
distinct instances of feature names and project names on the
feature_strategy table. However, to merge this with change request data,
we can't just count existing usage and change request usage separately,
because that could cause duplicates.
Instead of turning this into a complex DB query, I've broken it up into
a few separate queries and done the merging in JS. I think that's more
readable and it was easier to reason about.
Here's the breakdown:
1. Get the list of pending change requests. We need their IDs and their
project.
2. Get the list of updateStrategy and addStrategy events that have
segment data.
3. Take the result from step 2 and turn it into a dictionary of segment
id to usage data.
4. Query the feature_strategy_segment and feature_strategies table, to
get existing segment usage data
5. Fold that data into the change request data.
6. Perform the preexisting segment query (without counting logic) to get
other segment data
7. Enrich the results of the query from step 2 with usage data.
## Discussion points
I feel like this could be done in a nicer way, so any ideas on how to
achieve that (whether that's as a db query or just breaking up the code
differently) is very welcome.
Second, using multiple queries obviously yields more overhead than just
a single one. However, I do not think this is in the hot path, so I
don't consider performance to be critical here, but I'm open to hearing
opposing thoughts on this of course.
This PR hooks up the changes introduced in #5301 to the API and puts
them behind a feature flag. A new test has been added and the test setup
has been slightly tweaked to allow this test.
When the flag is enabled, the API will now not let you delete a segment
that's used in any active CRs.
Switch the express-openapi implementation from our internal fork to the
upstream version. We have upstreamed our changes and a new version has
been released, so this should be the last step before we can retire our
fork.
Because some of the dependencies have been updated since our internal
fork, we also need to update some of our error handling to reflect this.
This PR adds a cleanup job that removes unknown feature flags from
last_seen_at_metrics table every 24 hours since we no longer have a
foreign key on the name column in the features table.
## About the changes
This makes sure that projects have at least one owner, either a group or
a user. This is to prevent accidentally losing access to a project.
We check this when removing a user/group or when changing the role of a
user/group
**Note**: We can still leave a group empty as the only owner of the
project, but that's okay because we can still add more users to the
group
This PR is the first step in separating the client and admin stores.
Currently our feature toggle services uses the client store to serve
multiple purposes.
Admin API uses the feature toggle service to serve both the feature
toggle list and playground features, while the client API uses the
feature toggle service to serve client features. The admin API can
change often and have very different requirements than the client API,
which changes infrequently and generally keeps the same stable structure
for long periods of time. This architecture is error prone, because when
you need to make changes to the admin API, you can very easily affect
the client API.
I aim to put up a stone wall between the two APIs. Complete separation
between the two APIs, at the cost of some duplication.
In this PR I have created a feature oriented architecture for client
features and disconnected the client API from the feature toggle
service. It now goes through it's own service to it's own store. For
feature toggle service I have duplicated and replaced the functionality
that serves /api/admin/features, I have kept a lot of the ugliness in
the code and haven't removed anything in order to avoid breaking
changes.
Next steps:
* Move playground to admin API
* Remove client-feature-toggle-store from feature-toggle-service