Zero-Trust on Istio: The Complete Architecture

Zero-trust is a property that emerges when several independent checks each verify a request on their own, without assuming the layers around them already did the work. It often gets described as a product you install or a setting you switch on, which badly undersells how much has to line up for it to actually hold.

I learned that building an HR directory designed to be attacked. The rule I wanted it to hold was simple: if any single pod gets compromised, it still can't read data it has no business touching. Salaries, bank details, device serial numbers, the things an attacker would actually go after.

No single mechanism enforces that rule. mTLS proves which service is on the other end of a connection, but it has nothing to say about whether that service should be making the call. A signed identity token establishes who the user is, yet a service that trusts raw headers will believe whatever an attacker sends it. Authorization decides what a user is allowed to do, and then an application bug skips the check and the database returns the whole table anyway. Every layer has a blind spot, and the thing that covers it is another layer that doesn't share the same one.

So the architecture is a stack: every hop independently verified, every request authenticated and authorized, each layer built on the assumption that the one before it might fail. Remove a single layer and the others are left exposed in the exact spots they were relying on it to cover.

This post is the architecture overview for the series: the application, the five security layers, how they compose into a single request flow, and what specifically breaks when you remove one. It all runs end-to-end on Istio on a local Kind cluster with real config. Later posts go deep into the YAML, Python, and policy files.

What you need to know going in: Istio is a service mesh for Kubernetes. The key idea is that every pod gets an Envoy sidecar proxy. All inbound and outbound traffic passes through the sidecar, which means identity verification, routing rules, and policy enforcement happen at the infrastructure layer before your application code sees the request. If you understand containers and basic networking, the rest will make sense as we go.

The Application

The domain is deliberately simple: an HR Employee Directory. Five microservices, a few database tables, a handful of roles. The security architecture is the interesting part. In the demo there are 4 roles used.

Alice (employee) — on her own profile she sees full PII and exact salary; everyone else only gets name, title, and department.
Mary (manager) — on Alice (her direct report) she sees salary band, not exact pay or SSN; on others, same public view as any employee.
Henry (HR admin) — full HR fields on all employees (salary, SSN, etc.), but no full IT asset serial numbers.
Ivan (IT admin) — only basic employee fields everywhere, but full hardware/asset details including serials.

The services:

Profile Aggregator (ms1-profile-aggregator) is a fan-out orchestrator that calls Employee Records and Device Inventory, merges the results, and returns a single profile response. Gateway-facing, requires authentication.
Employee Records (ms2-employee-details) is the source of truth for employee records, PII, and financial data (salary, bank details). Internal only, not directly reachable from outside.
Device Inventory (ms3-hardware-assets) tracks which devices are assigned to which employees (serial numbers, MAC addresses). Internal only, same as Employee Records.
Holiday Calendar (ms4-holiday-calendar) handles company holidays. Gateway-facing, requires authentication. Read by everyone, written by HR.
Office Directory (ms5-office-locations) serves office location data. Gateway-facing, no authentication required for reads. The only fully public endpoint in the system.

Six roles : employee, manager, hr_admin, it_admin, public_data_admin, and security_auditor. The first four drive the HR/profile flow and are the focus of this series. public_data_admin manages office and holiday data, and security_auditor has read-only access for compliance. The point that matters for security: each role sees a different slice of data from the same endpoints, controlled entirely by policy.

The tiering matters for security:

Tier 1 (gateway-facing): Profile Aggregator, Holiday Calendar, Office Directory. Reachable from the Istio ingress gateway.
Tier 2 (internal only): Employee Records, Device Inventory. Reachable exclusively from Profile Aggregator. No direct external path exists.
Tier 3 (data): PostgreSQL. Reachable only from specific service identities.

Holiday Calendar and Office Directory cannot reach Employee Records or Device Inventory regardless of what code runs inside them. Profile Aggregator also cannot connect directly to PostgreSQL. The mesh intercepts every connection attempt and checks whether the calling service is allowed to talk to the destination. If it's not on the allow-list, the connection is rejected before any application code even sees the request.

An attacker who compromises Holiday Calendar cannot reach Employee Records. The network topology itself is a security boundary, enforced by Istio AuthorizationPolicy.

The Threat Model

Every security layer exists because of a specific assumption about what can go wrong. These are the threats we're designing against:

An external user may send arbitrary headers, including internal trust headers, in their HTTP request.
Any single application pod may be compromised and act maliciously within whatever network access it has.
Application code may have bugs: missing authorization checks, overly broad queries, endpoints that skip validation.
Database queries may return more data than intended if the application doesn't filter correctly.
Signing keys and secrets must not live inside application pods, because any compromised pod leaks whatever it holds.

Each layer in the architecture addresses one or more of these. When we get to "What Happens When You Skip a Layer," you can map each failure directly back to this list.

The Layers

Five distinct security layers, each addressing an attack surface the others can't cover.

1. Transport Security (mTLS)

Mutual TLS encrypts data in transit and verifies the identity of both ends of the connection. Every pod in the mesh presents a SPIFFE identity certificate, and the receiving pod validates it before accepting any bytes.

This gives you encrypted communication and peer identity verification. A rogue service without a valid mesh certificate cannot establish a connection. But mTLS has a specific scope: it verifies that Service A is Service A. It does not determine whether Service A is allowed to call Service B with this particular payload for this particular user. That's a different problem entirely.

2. Identity Propagation (Signed Mesh Tokens)

Once a request enters the mesh and passes authentication at the gateway, the system needs to carry the user's verified identity through every downstream hop. A request from Alice hits Profile Aggregator, which then fans out to Employee Records and Device Inventory. Both downstream services need to know it's Alice, what role she holds, and that the request is legitimate.

This is done via a cryptographically signed JWT, the x-mesh-identity token, minted by Auth Service (auth-service) and signed by Vault Transit. The token carries who the user is (sub), what roles they hold (roles_csv), which services this token is valid for (aud), and who is acting on behalf of the user (act, for service-to-service delegation).

Any downstream service can verify this token using the public key from the Auth Service JWKS endpoint. No callback to Auth Service needed. But the token alone isn't sufficient. If projected headers like x-ms2-user arrive from outside the mesh without being stripped, a service reading those headers trusts attacker-controlled values. That's why header stripping at the gateway is non-negotiable, and why each destination sidecar also strips and re-projects these headers from the verified JWT before the request reaches the application.

3. Request Authorization (Policy Enforcement)

Knowing who someone is doesn't mean they're allowed to do what they're asking. Authorization in this architecture operates at four granularities, and each one catches what the layer above misses:

Network-level: Can this pod even reach that pod? Istio AuthorizationPolicy enforces which service identities are allowed to initiate connections.
Request-level: Is this user, with this role, allowed to call this endpoint? The ExtAuthz check at the gateway applies coarse route-level decisions.
Resource-level: Can this manager see this specific employee's record? Cerbos evaluates fine-grained, context-aware policies.
Field-level: Can this role see salary data, or should it be masked? Cerbos again, with policies that return which fields to include per role.

Why four layers instead of one? Because a single policy engine can't efficiently handle all of these. Network-level happens at the sidecar before any application code runs. Field-level requires knowing the actual resource attributes at query time. Different layers, different information available, different enforcement points.

4. Data Access Security (Row-Level Security)

The database is the last line of defense. PostgreSQL Row-Level Security policies enforce visibility at the engine level based on the current transaction context. Even if application code has a bug, a missing authorization check or a broken filter, RLS ensures the database itself drops rows the user shouldn't see.

Application developers make mistakes. A query like SELECT * FROM employees hitting production without a proper WHERE clause should still not leak the entire company directory. RLS makes that guarantee at the database layer, independent of whatever the application did or didn't check.

There is a trust boundary here: the application sets app.current_user_id on the transaction, and RLS uses that value. If the application sets it incorrectly, RLS enforces the wrong identity. In normal operation, the identity comes from sidecar-projected headers, so the application is a pass-through, not a source. But RLS does not protect against a fully compromised application process that sets arbitrary transaction context. RLS catches bugs (missing filters, broad queries). It does not stop a malicious process that controls its own DB session. The defense against that scenario is the layers above: AuthorizationPolicy limiting which pods can reach the database, scoped DB roles limiting which tables they can query, and Vault ensuring compromised pods can't escalate signing privileges.

5. Secrets Management (Vault)

Secrets in pods are the most common attack surface in Kubernetes. Environment variables, mounted files, ConfigMaps. Assume any compromised pod leaks whatever secrets it holds.

In this architecture, Vault serves a specific critical function: the Transit engine signs mesh identity tokens without ever exposing the private key to any service. Auth Service sends bytes to Vault, Vault signs them, returns the signature. The private key is never exposed to Auth Service or any application pod. If Auth Service is compromised, the attacker can ask Vault to sign things (until you revoke access), but they never get the key itself. They can't take it offline and mint tokens indefinitely.

Beyond signing, Vault is the natural home for database credentials, API keys, TLS certificates, and anything that would otherwise be scattered across pod specs. The pattern matters more than our specific usage: services request secrets through controlled APIs rather than holding them in memory at rest.

How the Layers Compose

These layers form a chain where each one depends on the guarantees of the ones around it. Here's what happens when a request enters the system.

An external request hits the Istio Gateway. TLS terminates. Before anything else, an EnvoyFilter strips all internal trust headers (x-mesh-identity, x-ms*-user, x-role-*), ensuring nothing spoofable survives from outside. This is the single most important security mechanism in the entire architecture. Without it, everything downstream can be bypassed by setting a header.

The gateway's ExtAuthz filter calls Auth Service. Auth Service validates either the session cookie (opaque, database-backed) or the Bearer token (validated against Keycloak's JWKS). If valid, it calls Vault Transit to sign a short-lived mesh identity token with a 5 minute TTL. This token comes back to Envoy as a response header and gets injected into the request.

Envoy routes the request to the target service. At the destination, a RequestAuthentication resource validates the x-mesh-identity token against the Auth Service JWKS endpoint. Istio's native outputClaimToHeaders projects the validated claims into service-specific headers (x-ms2-user, x-ms2-role). The service never touches the JWT directly. It just reads simple headers that the sidecar guarantees are legitimate.

AuthorizationPolicy at the destination checks both the mTLS peer identity (is the calling pod's service account allowed?) and the token claims (is the audience correct? is the delegation chain valid?). A token minted for Employee Records cannot be replayed against Device Inventory because the AuthorizationPolicy checks the aud claim. Both the source principal and the token audience must match for the request to proceed.

The service calls Cerbos for fine-grained authorization. Can this user, with this role, perform this action on this specific resource? Cerbos returns both an allow/deny decision and a list of visible fields.

Finally, the database query executes with RLS active. The application calls set_config('app.current_user_id', ...) on the transaction before querying, and PostgreSQL enforces row visibility at the engine level.

Header stripping, authentication, token minting, token validation with header projection, authorization policy, Cerbos, and RLS. Seven checkpoints. Each one reduces the blast radius of failures in the others.

What Happens When You Skip a Layer

Each layer exists because the others have blind spots.

Skip header stripping? An external attacker sets x-mesh-identity: <forged-token> in their HTTP request. The token won't pass signature validation at the destination sidecar. But the projected headers (x-ms2-user, x-ms2-role) are the real danger. If these arrive from outside without being stripped, and the service reads them directly, the attacker controls the identity. Header stripping ensures no internal trust header survives from outside the mesh, regardless of whether it's a signed token or a raw projected header.

Skip signed tokens (use raw headers instead)? Any compromised pod can forge headers. Without cryptographic signing, there's no way to distinguish a legitimate identity assertion from one injected by a rogue process.

Skip authorization policies? A valid user with a valid token can call any service and access any resource, regardless of whether they should. Authentication without authorization is an access-all-pass.

Skip RLS? A bug in your application's authorization logic, a missing check or a broken filter, becomes a data breach. The database has no independent enforcement, so it returns whatever the query asks for.

Skip Vault (use local keys)? If Auth Service is compromised, the attacker has the signing key. They can mint tokens for any user with any role, indefinitely, until you detect the breach and rotate. With Vault Transit, the key never leaves Vault. You revoke Auth Service access and rotate immediately.

Why a Service Mesh

The core idea: your application service has a simple contract. It reads x-ms2-user and x-ms2-role from request headers and does its business logic. It doesn't know about mTLS certificates, JWT validation, OIDC flows, or policy engines. The sidecar proxy handles all of that.

This matters for three reasons.

First, you can add services to the mesh without rewriting their auth logic. The security infrastructure wraps around them.
Second, when the security architecture evolves (new signing algorithm, new policy engine, new identity provider) application code doesn't change. Only mesh configuration does.
Third, application developers focus on domain logic while security is configured at the infrastructure layer. These concerns don't bleed into each other.

The complexity is real. You need Kubernetes knowledge, you need to understand Envoy, you need to debug sidecar injection and filter ordering. But the tradeoff is concrete: instead of implementing auth in every single service (and maybe getting it wrong in one of them), you configure it once at the infrastructure layer. The complexity is upfront and centralized instead of distributed across every service where it's invisible until it fails.

What's Next

The next post peels back the identity flow, from the gateway EnvoyFilter that strips headers, through the ExtAuthz check, to the signed mesh token landing on a downstream service. Real YAML, real Python, real attack scenarios.

https://blogs.kaustavsarkar.dev/identity-flow-gateway-to-database

Code:

https://github.com/Kaustav-Sarkar/Istio-Defense-In-Depth

Service Mesh Zero-Trust Architecture

The Application

The Threat Model

The Layers

1. Transport Security (mTLS)

2. Identity Propagation (Signed Mesh Tokens)

3. Request Authorization (Policy Enforcement)

4. Data Access Security (Row-Level Security)

5. Secrets Management (Vault)

How the Layers Compose

What Happens When You Skip a Layer

Why a Service Mesh

What's Next

Code:

Comments

Zero-Trust Security on Istio

The Identity Flow: From Gateway to Database

More from this blog

Signing Mesh Tokens with Vault

The Identity Flow: From Gateway to Database

Command Palette

The Application

The Threat Model

The Layers

1. Transport Security (mTLS)

2. Identity Propagation (Signed Mesh Tokens)

3. Request Authorization (Policy Enforcement)

4. Data Access Security (Row-Level Security)

5. Secrets Management (Vault)

How the Layers Compose

What Happens When You Skip a Layer

Why a Service Mesh

What's Next

Code:

Comments

Zero-Trust Security on Istio

The Identity Flow: From Gateway to Database

More from this blog