Overview
The past few months have been rather busy with work. Unfortunately I have a handful of incomplete posts and projects that I can't give the time they need at this exact moment, so instead I will briefly talk about some of the lessons I learned along the way to our first deployment.
Context
CarbinX produces small scale carbon capture devices that also function as economizers, saving money by saving electricity. Each unit contains an embedded Linux platform that sends telemetry, monitors for updates, and self adjusts its parameters to maximize the amount of carbon it captures. On the cloud side of things, we have a SaaS consisting of ~14 services and 3 web portals that provide the functionality our customers, contractors, and employees see. All of the code other than the frontends are written in Rust (other than OPA policies I suppose). All of our cloud code runs on a K3s cluster that we operate in house, with fallbacks to AWS in case things go (very) wrong.
Lesson 1: Isolate Deployment from Code
When I started writing our backend Rust services, I was undecided whether I wanted to store the deployment configuration and helm chart alongside the application, or in a separate directory. I initially chose the former, meaning that my services looked something like this:
tenant_service ❯ tree
.
├── Cargo.lock
├── Cargo.toml
├── README.md
├── chef.debian.dockerfile
├── configuration
│ ├── base.yaml
│ ├── docker.yaml
│ ├── local.yaml
│ └── production.yaml
├── deny.toml
├── deployment
│ ├── README.md
│ ├── argocd
│ │ └── app.yaml
│ ├── database
│ │ └── cluster.yaml
│ └── secrets
│ └── ...
├── docker-compose.yaml
├── helm
│ ├── Chart.yaml
│ ├── templates
│ │ ├── _helpers.tpl
│ │ ├── configmap.yaml
│ │ ├── deployment.yaml
│ │ ├── ingressroute.yaml
│ │ ├── service.yaml
│ │ └── servicemonitor.yaml
│ └── values.yaml
├── migrations
│ └── ...
├── src
│ ├── ...
└── tests
└── ...The draw to me was obvious, we can keep anything related to the application in one place. No need for any additional repositories, nice. The obvious downside I soon found out was that when you use Dependabot / ArgoCD-Image-Updater to open PRs, you'll run your CI twice every time your deployment configuration changes. Even though I've reduced the build time for our images to ~2 minutes, these still add up over time, especially if you frequently update chart metadata, like adding CVE scan data.
What I ended up doing instead was just to have a separate repository for each major deployment (customer portal, contractor portal, and employee portal), and maintaining the helm charts there. Here is an idea of what that looks like:
iot_deployment ❯ tree
.
├── telemetry_service
│ ├── README.md
│ ├── argocd
│ │ └── app.yaml
│ ├── cache
│ │ └── redis-values.yaml
│ ├── database
│ │ ├── cluster.yaml
│ └── helm
│ ├── Chart.yaml
│ ├── templates
│ │ ├── _helpers.tpl
│ │ ├── configmap.yaml
│ │ ├── deployment.yaml
│ │ ├── ingressroute.yaml
│ │ ├── sample-data-configmap.yaml
│ │ ├── sample-data-job.yaml
│ │ ├── service.yaml
│ │ └── servicemonitor.yaml
│ ├── values-dev.yaml
│ └── values.yaml
├── tenant_service
│ ├── README.md
│ ├── argocd
│ │ └── app.yaml
│ ├── database
│ │ └── cluster.yaml
│ ├── helm
│ │ ├── Chart.yaml
│ │ ├── templates
│ │ │ ├── _helpers.tpl
│ │ │ ├── configmap.yaml
│ │ │ ├── deployment.yaml
│ │ │ ├── ingressroute.yaml
│ │ │ ├── service.yaml
│ │ │ └── servicemonitor.yaml
│ │ └── values.yaml
│ └── secrets
│ └── ...
*Rest of services here*This provides a clear separation of code and deployment info, and assuming you never delete tags, you can easily trace back what versions of each application were running at any point in time.
Lesson 2: Minimize Image Bloat
When you are running many services, you are running many images. Also, each image is actually 2 images because of Istio sidecars. Also, most of our users are in the same timezone, so traffic is idle for many parts of the day, and very bursty at others. All of these things combined creates high memory pressure, which means less file system caching, page scanning from kswapd, and potentially even reaping processes with the glorious OOM error.
By spending a small amount of time optimizing image size, we were able to reduce the amount of base memory required by each image from ~150MB to ~30MB without affecting the time it takes to build the image. A happy side effect is that your images are much simpler to audit, because they have less stuff, which means a smaller risk surface.
Lesson 3: Choose an OPA Compatible Proxy
Open Policy Agent is a language agnostic tool that handles authorization in your microservices. Prior to OPA, I'd write something similar to the following:
// Whatever your JWT format is according to your authorization server
#[derive(Debug, Serialize, Deserialize)]
struct Claims {
sub: String,
exp: usize,
// ...
#[serde(rename = "X-Company-Admin")]
x_company_admin: bool,
}
async fn admin_only_handler(State(app_state): State<AppState>, headers: HeaderMap) ->
Result<Json<&'static str>, Error> {
let auth_header = headers
.get("authorization")
.and_then(|h| h.to_str().ok())
.ok_or(Error::Unauthorized("Missing authorization header".to_string()))?;
let token = auth_header
.strip_prefix("Bearer ")
.ok_or(Error::Unauthorized("Invalid authorization format".to_string()))?;
let validation = Validation::new(Algorithm::HS256);
// app_state contains the public key (if using PKC) or symmetric key from the
// authorization server, which is fetched at runtime from a trusted endpoint
// defined at build time
let token_data = decode::<Claims>(
token,
&DecodingKey::from_secret(app_state.secret.as_ref()),
&validation,
)
.map_err(|e| Error::Unauthorized(format!("Invalid token: {}", e)))?;
let now = std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.unwrap()
.as_secs() as usize;
if let Some(iat) = token_data.claims.iat {
// 24 hours, should be type safe check but you get the idea
if now - iat > 86400 {
return Err(Error::Unauthorized("Token too old".to_string()));
}
}
if !token_data.claims.x_company_admin {
return Err(Error::Forbidden("Requires company admin access".to_string()));
}
Ok(Json("Admin access granted"))
}Here is (roughly) that same policy in OPA:
package authz
default allow := false
allow if {
# Extract token
auth_header := input.headers.authorization
startswith(auth_header, "Bearer ")
token := substring(auth_header, 7, -1)
# Verify and decode JWT
[valid, _, claims] := io.jwt.decode_verify(
token,
{"secret": "your-secret-key", "alg": "HS256"}
)
valid == true
# Check not expired
now := time.now_ns() / 1000000000
claims.exp >= now
# Check admin claim
claims["X-Company-Admin"] == true
}As with the axum handler, you'd fetch the public key (if using PKC) or symmetric key for verifying tokens from the authorization server ahead of time. I think if I were to look at this before implementing all of it, I would say "I don't think learning a new language just for the sake of authorization policies is a good idea". I think most people would probably agree, and to me, its a valid argument. I think the cost of learning new languages is higher than most people think. So why was I happy with OPA? Two reasons:
- Having all of your authentication and authorization done before the request even reaches the service means less code in microservices. Less code makes codebases easier to manage in your head (i.e. mental models)
- All of your authentication / authorization policies are located in one place, and they are checked into git.
These two pretty much just mean I have to spend less time thinking about authorization in general, a win in my books.
Okay, so OPA removes the need for us to handle authentication and authorization in our code, but that is only part of the battle. The other thing we have to consider is that we need to tell OPA about the request. We have a few options:
- Call the OPA API from our axum handler
- Use a tower-http service that does this for you
- We use a reverse proxy that automatically does this before the request even gets to our application
Option 1 is explicit, which is good (other folks see the authentication / authorization happen in the handler), but the drawback is you still need to write that code for every handler.
Option 2 is nice, because you don't have to remember to validate the token each time, you can just check the claims.
Option 3 is really nice, because your application doesn't even need to know about authorization or authentication. It just receives a set of headers (i.e. containing tenant context), and lets you get to work.
I would recommend option 3 in most cases. Some reverse proxies like Istio gateway (i.e. Envoy) have native integrations for OPA, which reduce the amount of effort needed to get your system up and running. Other systems like Traefik should be able to work with OPA, maybe with forward auth similar to how you'd set up Authentik or Authellia for apps with no native authentication.
Lesson 4: Tenant Isolation is Tricky
For most clients, we use a pooled model for our services. This means that requests for tenant A and tenant B flow through the same services, and ultimately to the same database:
Compare this to the siloed model, which our premium tiered customers get:
When we operate in a siloed model, our life as developers is rather
straightforward. We need to confirm (through testing) that requests to
tenantA.nullspaces.io does indeed get routed to the service in the correct
namespace. Assuming we have proper namespace policies, we can be confident
that once the request lands in that service in the tenantA namespace, it won't
be able to access tenant B data.
In the pooled model, life is a little more interesting. We need some mechanisms in place to prevent tenants from accessing each other's data. I will assume that we are talking about a standard Rest CRUD interface at this point, no asynchronous messaging funny business.
The first thing we might think of is having separate databases for each tenant. This would provide a fairly high degree of isolation, as the cluster is the only shared resource. But now we have to talk about backing up a database per tenant. If we are using purely cloud backups, this should be feasible because they backup the disks directly, but we also have logical replication in our cluster, so that's a no go.
Going to the next level down, perhaps we share a single database among all of our tenants, but create individual tables for each tenant. One problem with this potential solution is that if you have 10 tables per tenant, you'll soon be managing thousands of tables, which also doesn't seem super practical. Further, imagine we have an admin portal where we should be able to list out all devices, regardless of tenant. With the table solution, we'd be running thousands of selects, and still need to join the data in the end, yikes.
Moving another level down, we get to a shared table, where each piece of data is marked with a tenant identifier of some sort. We'd get a schema similar to the following:
CREATE TABLE devices (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
device_name VARCHAR(255) NOT NULL,
location VARCHAR(255),
latitude DOUBLE PRECISION,
longitude DOUBLE PRECISION,
install_date TIMESTAMPTZ,
is_deleted BOOLEAN NOT NULL DEFAULT FALSE,
UNIQUE(tenant_id, device_name)
);This setup makes our onboarding process a whole lot easier. All we need to do when adding a new tenant is to create a new row in our tenants table. The price we pay for a solution like this is that during development, developers will always need to remember to filter by tenant context, for example:
// ...
let devices = sqlx::query_as!::<_, Device>(
"SELECT id, tenant_id, device_name, location, latitude, longitude, install_date
FROM devices WHERE is_deleted = FALSE AND tenant_id = $1"
)
.bind(tenant_id)
.fetch_all(&mut *tx)
.await?;One job as developers is to prevent possible misuse. It isn't good enough to assume everyone will always remember to filter by tenant_id. This is where a solution like RLS comes in.
Postgres has the notion of Row Level Security, which can be used to ensure that you never leak information to other tenants. Each time we run a query, we set a variable that our database then uses to filter our data for us. First the definition:
ALTER TABLE devices ENABLE ROW LEVEL SECURITY;
ALTER TABLE devices FORCE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON devices
USING (tenant_id = current_setting('app.tenant_id', true)::uuid);And how we use it:
// scope our transaction to tenant specific data
sqlx::query(&format!("SET LOCAL app.tenant_id = '{}'", tenant_id))
.execute(&mut *tx)
.await?;
// query for the data
// notice how we no longer need to check for matching tenant_id
let devices = sqlx::query_as!::<_, Device>(
"SELECT id, tenant_id, device_name, location, latitude, longitude, install_date
FROM devices WHERE is_deleted = FALSE"
)
.fetch_all(&mut *tx)
.await?;And if we forget to set the variable, we get nothing back. Much better, and something our test suite should easily catch:
#[tokio::test]
pub async fn test_devices_tenant_isolation() -> Result<()> {
let app = spawn_app().await.context("spawn testing app")?;
let client = &app.api_client;
// Create two tenants
let tenant1_id = Uuid::new_v4();
let tenant2_id = Uuid::new_v4();
// Create devices for each tenant
let device1 = TestDevice::new()
.with_device_name("Tenant 1 Device")
.with_tenant_id(tenant1_id);
let device2 = TestDevice::new()
.with_device_name("Tenant 2 Device")
.with_tenant_id(tenant2_id);
let device1_id = insert_test_device(&app.db_pool, device1)
.await
.context("insert device for tenant 1")?;
let device2_id = insert_test_device(&app.db_pool, device2)
.await
.context("insert device for tenant 2")?;
// Test that tenant1 can only see its own devices
let response = client
.get(&format!("{}/v1/device", app.api_base_url()))
.header("X-Tenant-Id", tenant1_id.to_string())
.send()
.await
.context("get devices for tenant 1")?;
assert_eq!(response.status(), StatusCode::OK);
let paginated: PaginatedResponse<Device> = response.json().await.context("parse response")?;
assert_eq!(paginated.data.len(), 1);
assert_eq!(paginated.data[0].device_name, "Tenant 1 Device");
assert_eq!(paginated.data[0].tenant_id, tenant1_id);
Ok(())
}Great, we are protecting ourselves from accidental misuse, was there any cost to this? If you remember, we discussed an admin portal where we list all devices with a single query, is this still possible? Well, no! So what are our options?
- Create a separate user that has the BYPASSRLS permission, and create a separate instance of the microservice that uses said role.
- Each microservice maintains a connection pool for both the RLS enforced user,
and the user that bypasses RLS. You decide what pool you use depending on the
route you take (i.e.
/v1/admin/devicesuses the pool that bypasses RLS)
Option (2) is nice because we don't need to have 2 separate applications
running, but we do run the risk of using the wrong connection pool when
developing. Assuming no mistake is made, another upside to this is that it's
easy to define network policies in Cilium or Istio that ensure services don't
call the wrong endpoint (i.e. the admin portal can call /v1/admin/* while the
customer portal cannot):
# Cilium
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: tenant-service-admin-access
namespace: production
spec:
endpointSelector:
matchLabels:
app: tenant-service
ingress:
# Admin portal can access admin routes
- fromEndpoints:
- matchLabels:
app: admin-portal
toPorts:
- ports:
- port: '8080'
protocol: TCP
rules:
http:
- method: 'GET|POST|PUT|DELETE'
path: '/v1/admin/.*'
- method: 'GET'
path: '/health'
---
# Istio
# Assume service account is already made
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
name: tenant-service-admin-access
namespace: production
spec:
selector:
matchLabels:
app: tenant-service
action: ALLOW
rules:
- from:
- source:
principals: ['cluster.local/ns/production/sa/admin-portal']
to:
- operation:
methods: ['GET', 'POST', 'PUT', 'DELETE']
paths: ['/v1/admin/*']
- from:
- source:
principals: ['cluster.local/ns/production/sa/admin-portal']
to:
- operation:
methods: ['GET']
paths: ['/health']Option (1) has the benefit of having a clean separation of concerns. The admin portal shares no services with the customer portal, meaning the customer portal should always have RLS enabled for every request. This makes auditing a little simpler, and also makes deploying changes to our admin flow less worrisome, because the customer deployment is completely separate.
Ultimately I picked option (1) to begin with, because it seemed simpler to implement, and provides clear boundaries in our portals. Because our services are fairly lightweight (30MB), the resource cost is fairly low. I do believe there is probably a better method that I haven't found yet, like disabling RLS for particular transactions (that we can write a wrapper for) or something like that. For now, I am happy with this, and when time permits, I will come back and see if there is a more elegant solution.