doc: Add FAQ on invalid internal config.

This commit is contained in:
2026-02-25 18:20:59 +01:00
parent 3f50782f58
commit 17a18c335b

View File

@@ -0,0 +1,246 @@
# FAQ: Internal Services Not Accessible via `envoy-internal`
> **Symptoms**: A service exposed via `envoy-internal` is unreachable from LAN clients even though the Flux HelmRelease, HTTPRoute, and pods all appear healthy.
---
## How Internal DNS Works in This Cluster
This cluster uses two separate ingress paths and two separate DNS mechanisms:
```mermaid
graph TD
subgraph Internet
CF[Cloudflare DNS]
end
subgraph LAN
Win[Windows / LAN Client]
PH[Pi-hole\n10.0.0.156 → k8s-gateway]
end
subgraph Kubernetes - network namespace
EXT[envoy-external\nLB: 10.0.0.158]
INT[envoy-internal\nLB: 10.0.0.157]
EXTDNS[cloudflare-dns\nexternal-dns\n--gateway-name=envoy-external]
K8SGW[k8s-gateway\nCoreDNS plugin\n10.0.0.156:53\nwatches HTTPRoutes]
end
CF -->|CNAME: external.laurivan.com| EXT
EXTDNS -->|creates DNS records| CF
EXTDNS -.->|ignores envoy-internal routes| INT
PH -->|resolves *.laurivan.com| K8SGW
K8SGW -->|reads HTTPRoute hostnames| INT
Win -->|DNS via Pi-hole| PH
Win -->|TCP 443| INT
```
| Gateway | DNS mechanism | Accessible from |
|---|---|---|
| `envoy-external` | `cloudflare-dns` (external-dns) pushes records to Cloudflare | Internet + LAN |
| `envoy-internal` | `k8s-gateway` (CoreDNS) serves DNS at `10.0.0.156` | LAN only (via Pi-hole conditional forwarding) |
> **Key insight**: `cloudflare-dns` runs with `--gateway-name=envoy-external`, so it **only** creates Cloudflare DNS records for routes attached to `envoy-external`. Routes on `envoy-internal` are handled exclusively by `k8s-gateway` — no Cloudflare involvement.
---
## Issue 1: k8s-gateway Fails to Watch HTTPRoutes After Restart
### Root Cause
At cluster startup, `k8s-gateway` can fail to initialize its `HTTPRoute` controller due to a transient API server connection reset:
```
[WARNING] plugin/k8s_gateway: error getting crd httproutes.gateway.networking.k8s.io,
error: unexpected error when reading response body. Please retry.
Original error: read tcp ...: connection reset by peer
```
When this happens, `k8s-gateway` falls back to only watching `Service` resources, returning `NXDOMAIN` for **all** internal hostnames — even ones that were previously working.
### Diagnosis
```bash
# Check if k8s-gateway has HTTPRoute controller initialized
kubectl logs -n network -l app.kubernetes.io/name=k8s-gateway | grep -E "HTTPRoute|error"
# Test DNS resolution directly against k8s-gateway
dig +short myservice.laurivan.com @10.0.0.156
```
If `dig` returns empty and logs show the HTTPRoute controller error, proceed to the fix.
### Fix
Restart the `k8s-gateway` deployment to force a clean re-initialization:
```bash
kubectl rollout restart deployment/k8s-gateway -n network
kubectl rollout status deployment/k8s-gateway -n network
```
After rollout, confirm all controllers initialized successfully:
```bash
kubectl logs -n network -l app.kubernetes.io/name=k8s-gateway | grep "controller initialized"
# Expected output:
# [INFO] plugin/k8s_gateway: GatewayAPI controller initialized
# [INFO] plugin/k8s_gateway: HTTPRoute controller initialized
# [INFO] plugin/k8s_gateway: Service controller initialized
```
Verify DNS resolves:
```bash
dig +short myservice.laurivan.com @10.0.0.156
# Should return 10.0.0.157 (envoy-internal LB IP)
```
---
## Issue 2: New Internal Service Has No DNS Record
### Root Cause
A new HTTPRoute attached to `envoy-internal` is not picked up by `k8s-gateway` after the restart (or `k8s-gateway` is healthy but the route was added after a long-running session). This can also happen if `k8s-gateway` was never restarted after a CRD re-sync.
In contrast to `envoy-external` routes (handled automatically by `cloudflare-dns`), `envoy-internal` routes require `k8s-gateway` to be running and watching HTTPRoutes correctly.
### Diagnosis
```bash
# Does the HTTPRoute exist and is it accepted?
kubectl get httproute -A
kubectl describe httproute <name> -n <namespace>
# Does DNS resolve?
dig +short <hostname>.laurivan.com @10.0.0.156
# Are the pods healthy?
kubectl get pods -n <namespace>
```
### Fix
If the HTTPRoute exists and is `Accepted` but DNS doesn't resolve, restart `k8s-gateway` (see Issue 1 above).
---
## Issue 3: Pi-hole Conditional Forwarding Not Configured
For LAN clients to resolve internal hostnames, Pi-hole must forward `laurivan.com` queries to `k8s-gateway` at `10.0.0.156`.
### Setup
In Pi-hole → **Settings → DNS → Conditional Forwarding**:
| Field | Value |
|---|---|
| Local network CIDR | `10.0.0.0/24` |
| Router / DNS IP | `10.0.0.156` |
| Local domain name | `laurivan.com` |
> Without this, LAN clients will resolve `*.laurivan.com` via public Cloudflare DNS, which has no records for `envoy-internal` services.
---
## Diagnosing "Destination Host Unreachable" from a LAN Client
When you `ping` an internal service VIP (e.g. `10.0.0.157`), you may see:
```
Reply from 10.0.0.147: Destination host unreachable.
```
**This is normal and not an error.** Here's why:
```mermaid
sequenceDiagram
participant Win as Windows Client
participant Node as esxi-2cu-8g-03 (10.0.0.147)<br/>holds L2 ARP lease for 10.0.0.157
participant Envoy as Envoy Proxy (VIP: 10.0.0.157)
Win->>Node: ICMP Echo Request → 10.0.0.157
Note over Node: ARP resolves 10.0.0.157 to Node's MAC<br/>(Cilium L2 announcement)
Node->>Envoy: forwards packet
Note over Envoy: Envoy only handles TCP 80/443<br/>ICMP is not forwarded
Node-->>Win: ICMP "Destination Host Unreachable"
Win->>Node: TCP SYN → 10.0.0.157:443
Node->>Envoy: proxies connection
Envoy-->>Win: TLS handshake + HTTP response ✅
```
- `10.0.0.147` is the **cluster node** (`esxi-2cu-8g-03`) holding the Cilium L2 ARP announcement lease for `10.0.0.157` — it is not a router.
- Envoy Gateway only listens on TCP 80/443. ICMP ping packets are not handled, so the node's kernel returns "host unreachable".
- **Use `curl` instead of `ping` to verify connectivity:**
```bash
# Should return HTTP 200
curl -sk -o /dev/null -w "%{http_code}" https://myservice.laurivan.com
# Or directly by IP with a Host header
curl -sk -o /dev/null -w "%{http_code}" https://10.0.0.157 -H "Host: myservice.laurivan.com"
```
---
## Adding a New Service on `envoy-internal`
### How to expose a service internally (no public DNS)
1. In your HelmRelease (or a standalone `HTTPRoute`), set `parentRefs` to `envoy-internal` in the `network` namespace:
```yaml
route:
app:
hostnames: ["myservice.${SECRET_DOMAIN}"]
parentRefs:
- name: envoy-internal
namespace: network
sectionName: https
rules:
- backendRefs:
- identifier: app
port: 80
```
2. Ensure your Flux `Kustomization` has `postBuild.substituteFrom` set so `${SECRET_DOMAIN}` is substituted:
```yaml
postBuild:
substituteFrom:
- name: cluster-secrets
kind: Secret
```
3. After Flux reconciles, `k8s-gateway` will automatically pick up the new HTTPRoute and start serving DNS for `myservice.laurivan.com``10.0.0.157`.
4. No `DNSEndpoint` CRD or Cloudflare changes are needed — `k8s-gateway` handles it.
> **Do NOT** add a `DNSEndpoint` pointing `myservice.laurivan.com` → `internal.laurivan.com`. That would push the record to public Cloudflare DNS, exposing an internal service to the internet.
---
## Quick Diagnostic Checklist
```bash
# 1. Is the pod running?
kubectl get pods -n <namespace>
# 2. Is the HTTPRoute accepted?
kubectl get httproute -n <namespace> -o yaml | grep -A5 "status:"
# 3. Does k8s-gateway resolve the hostname?
dig +short <hostname>.laurivan.com @10.0.0.156
# 4. Is the k8s-gateway HTTPRoute controller active?
kubectl logs -n network -l app.kubernetes.io/name=k8s-gateway | grep -E "HTTPRoute|error|NXDOMAIN"
# 5. Can you reach the service on TCP?
curl -sk -o /dev/null -w "%{http_code}" https://<hostname>.laurivan.com
# 6. Is the Cilium L2 lease active?
kubectl -n kube-system get lease | grep l2announce
```