From 17a18c335bc3049bf459667d65f5e5d831d82fd6 Mon Sep 17 00:00:00 2001 From: Laur IVAN Date: Wed, 25 Feb 2026 18:20:59 +0100 Subject: [PATCH] doc: Add FAQ on invalid internal config. --- docs/faqs/internal-services-not-accessible.md | 246 ++++++++++++++++++ 1 file changed, 246 insertions(+) create mode 100644 docs/faqs/internal-services-not-accessible.md diff --git a/docs/faqs/internal-services-not-accessible.md b/docs/faqs/internal-services-not-accessible.md new file mode 100644 index 0000000..9b8023c --- /dev/null +++ b/docs/faqs/internal-services-not-accessible.md @@ -0,0 +1,246 @@ +# FAQ: Internal Services Not Accessible via `envoy-internal` + +> **Symptoms**: A service exposed via `envoy-internal` is unreachable from LAN clients even though the Flux HelmRelease, HTTPRoute, and pods all appear healthy. + +--- + +## How Internal DNS Works in This Cluster + +This cluster uses two separate ingress paths and two separate DNS mechanisms: + +```mermaid +graph TD + subgraph Internet + CF[Cloudflare DNS] + end + + subgraph LAN + Win[Windows / LAN Client] + PH[Pi-hole\n10.0.0.156 → k8s-gateway] + end + + subgraph Kubernetes - network namespace + EXT[envoy-external\nLB: 10.0.0.158] + INT[envoy-internal\nLB: 10.0.0.157] + EXTDNS[cloudflare-dns\nexternal-dns\n--gateway-name=envoy-external] + K8SGW[k8s-gateway\nCoreDNS plugin\n10.0.0.156:53\nwatches HTTPRoutes] + end + + CF -->|CNAME: external.laurivan.com| EXT + EXTDNS -->|creates DNS records| CF + EXTDNS -.->|ignores envoy-internal routes| INT + + PH -->|resolves *.laurivan.com| K8SGW + K8SGW -->|reads HTTPRoute hostnames| INT + Win -->|DNS via Pi-hole| PH + Win -->|TCP 443| INT +``` + +| Gateway | DNS mechanism | Accessible from | +|---|---|---| +| `envoy-external` | `cloudflare-dns` (external-dns) pushes records to Cloudflare | Internet + LAN | +| `envoy-internal` | `k8s-gateway` (CoreDNS) serves DNS at `10.0.0.156` | LAN only (via Pi-hole conditional forwarding) | + +> **Key insight**: `cloudflare-dns` runs with `--gateway-name=envoy-external`, so it **only** creates Cloudflare DNS records for routes attached to `envoy-external`. Routes on `envoy-internal` are handled exclusively by `k8s-gateway` — no Cloudflare involvement. + +--- + +## Issue 1: k8s-gateway Fails to Watch HTTPRoutes After Restart + +### Root Cause + +At cluster startup, `k8s-gateway` can fail to initialize its `HTTPRoute` controller due to a transient API server connection reset: + +``` +[WARNING] plugin/k8s_gateway: error getting crd httproutes.gateway.networking.k8s.io, + error: unexpected error when reading response body. Please retry. + Original error: read tcp ...: connection reset by peer +``` + +When this happens, `k8s-gateway` falls back to only watching `Service` resources, returning `NXDOMAIN` for **all** internal hostnames — even ones that were previously working. + +### Diagnosis + +```bash +# Check if k8s-gateway has HTTPRoute controller initialized +kubectl logs -n network -l app.kubernetes.io/name=k8s-gateway | grep -E "HTTPRoute|error" + +# Test DNS resolution directly against k8s-gateway +dig +short myservice.laurivan.com @10.0.0.156 +``` + +If `dig` returns empty and logs show the HTTPRoute controller error, proceed to the fix. + +### Fix + +Restart the `k8s-gateway` deployment to force a clean re-initialization: + +```bash +kubectl rollout restart deployment/k8s-gateway -n network +kubectl rollout status deployment/k8s-gateway -n network +``` + +After rollout, confirm all controllers initialized successfully: + +```bash +kubectl logs -n network -l app.kubernetes.io/name=k8s-gateway | grep "controller initialized" +# Expected output: +# [INFO] plugin/k8s_gateway: GatewayAPI controller initialized +# [INFO] plugin/k8s_gateway: HTTPRoute controller initialized +# [INFO] plugin/k8s_gateway: Service controller initialized +``` + +Verify DNS resolves: + +```bash +dig +short myservice.laurivan.com @10.0.0.156 +# Should return 10.0.0.157 (envoy-internal LB IP) +``` + +--- + +## Issue 2: New Internal Service Has No DNS Record + +### Root Cause + +A new HTTPRoute attached to `envoy-internal` is not picked up by `k8s-gateway` after the restart (or `k8s-gateway` is healthy but the route was added after a long-running session). This can also happen if `k8s-gateway` was never restarted after a CRD re-sync. + +In contrast to `envoy-external` routes (handled automatically by `cloudflare-dns`), `envoy-internal` routes require `k8s-gateway` to be running and watching HTTPRoutes correctly. + +### Diagnosis + +```bash +# Does the HTTPRoute exist and is it accepted? +kubectl get httproute -A +kubectl describe httproute -n + +# Does DNS resolve? +dig +short .laurivan.com @10.0.0.156 + +# Are the pods healthy? +kubectl get pods -n +``` + +### Fix + +If the HTTPRoute exists and is `Accepted` but DNS doesn't resolve, restart `k8s-gateway` (see Issue 1 above). + +--- + +## Issue 3: Pi-hole Conditional Forwarding Not Configured + +For LAN clients to resolve internal hostnames, Pi-hole must forward `laurivan.com` queries to `k8s-gateway` at `10.0.0.156`. + +### Setup + +In Pi-hole → **Settings → DNS → Conditional Forwarding**: + +| Field | Value | +|---|---| +| Local network CIDR | `10.0.0.0/24` | +| Router / DNS IP | `10.0.0.156` | +| Local domain name | `laurivan.com` | + +> Without this, LAN clients will resolve `*.laurivan.com` via public Cloudflare DNS, which has no records for `envoy-internal` services. + +--- + +## Diagnosing "Destination Host Unreachable" from a LAN Client + +When you `ping` an internal service VIP (e.g. `10.0.0.157`), you may see: + +``` +Reply from 10.0.0.147: Destination host unreachable. +``` + +**This is normal and not an error.** Here's why: + +```mermaid +sequenceDiagram + participant Win as Windows Client + participant Node as esxi-2cu-8g-03 (10.0.0.147)
holds L2 ARP lease for 10.0.0.157 + participant Envoy as Envoy Proxy (VIP: 10.0.0.157) + + Win->>Node: ICMP Echo Request → 10.0.0.157 + Note over Node: ARP resolves 10.0.0.157 to Node's MAC
(Cilium L2 announcement) + Node->>Envoy: forwards packet + Note over Envoy: Envoy only handles TCP 80/443
ICMP is not forwarded + Node-->>Win: ICMP "Destination Host Unreachable" + + Win->>Node: TCP SYN → 10.0.0.157:443 + Node->>Envoy: proxies connection + Envoy-->>Win: TLS handshake + HTTP response ✅ +``` + +- `10.0.0.147` is the **cluster node** (`esxi-2cu-8g-03`) holding the Cilium L2 ARP announcement lease for `10.0.0.157` — it is not a router. +- Envoy Gateway only listens on TCP 80/443. ICMP ping packets are not handled, so the node's kernel returns "host unreachable". +- **Use `curl` instead of `ping` to verify connectivity:** + +```bash +# Should return HTTP 200 +curl -sk -o /dev/null -w "%{http_code}" https://myservice.laurivan.com + +# Or directly by IP with a Host header +curl -sk -o /dev/null -w "%{http_code}" https://10.0.0.157 -H "Host: myservice.laurivan.com" +``` + +--- + +## Adding a New Service on `envoy-internal` + +### How to expose a service internally (no public DNS) + +1. In your HelmRelease (or a standalone `HTTPRoute`), set `parentRefs` to `envoy-internal` in the `network` namespace: + +```yaml +route: + app: + hostnames: ["myservice.${SECRET_DOMAIN}"] + parentRefs: + - name: envoy-internal + namespace: network + sectionName: https + rules: + - backendRefs: + - identifier: app + port: 80 +``` + +2. Ensure your Flux `Kustomization` has `postBuild.substituteFrom` set so `${SECRET_DOMAIN}` is substituted: + +```yaml +postBuild: + substituteFrom: + - name: cluster-secrets + kind: Secret +``` + +3. After Flux reconciles, `k8s-gateway` will automatically pick up the new HTTPRoute and start serving DNS for `myservice.laurivan.com` → `10.0.0.157`. + +4. No `DNSEndpoint` CRD or Cloudflare changes are needed — `k8s-gateway` handles it. + +> **Do NOT** add a `DNSEndpoint` pointing `myservice.laurivan.com` → `internal.laurivan.com`. That would push the record to public Cloudflare DNS, exposing an internal service to the internet. + +--- + +## Quick Diagnostic Checklist + +```bash +# 1. Is the pod running? +kubectl get pods -n + +# 2. Is the HTTPRoute accepted? +kubectl get httproute -n -o yaml | grep -A5 "status:" + +# 3. Does k8s-gateway resolve the hostname? +dig +short .laurivan.com @10.0.0.156 + +# 4. Is the k8s-gateway HTTPRoute controller active? +kubectl logs -n network -l app.kubernetes.io/name=k8s-gateway | grep -E "HTTPRoute|error|NXDOMAIN" + +# 5. Can you reach the service on TCP? +curl -sk -o /dev/null -w "%{http_code}" https://.laurivan.com + +# 6. Is the Cilium L2 lease active? +kubectl -n kube-system get lease | grep l2announce +```