doc: Add FAQ on invalid internal config.

2026-02-25 18:20:59 +01:00
parent 3f50782f58
commit 17a18c335b
1 changed files with 246 additions and 0 deletions
--- a/docs/faqs/internal-services-not-accessible.md
+++ b/docs/faqs/internal-services-not-accessible.md
@@ -0,0 +1,246 @@
+# FAQ: Internal Services Not Accessible via `envoy-internal`
+
+> **Symptoms**: A service exposed via `envoy-internal` is unreachable from LAN clients even though the Flux HelmRelease, HTTPRoute, and pods all appear healthy.
+
+---
+
+## How Internal DNS Works in This Cluster
+
+This cluster uses two separate ingress paths and two separate DNS mechanisms:
+
+```mermaid
+graph TD
+    subgraph Internet
+        CF[Cloudflare DNS]
+    end
+
+    subgraph LAN
+        Win[Windows / LAN Client]
+        PH[Pi-hole\n10.0.0.156 → k8s-gateway]
+    end
+
+    subgraph Kubernetes - network namespace
+        EXT[envoy-external\nLB: 10.0.0.158]
+        INT[envoy-internal\nLB: 10.0.0.157]
+        EXTDNS[cloudflare-dns\nexternal-dns\n--gateway-name=envoy-external]
+        K8SGW[k8s-gateway\nCoreDNS plugin\n10.0.0.156:53\nwatches HTTPRoutes]
+    end
+
+    CF -->|CNAME: external.laurivan.com| EXT
+    EXTDNS -->|creates DNS records| CF
+    EXTDNS -.->|ignores envoy-internal routes| INT
+
+    PH -->|resolves *.laurivan.com| K8SGW
+    K8SGW -->|reads HTTPRoute hostnames| INT
+    Win -->|DNS via Pi-hole| PH
+    Win -->|TCP 443| INT
+```
+
+| Gateway | DNS mechanism | Accessible from |
+|---|---|---|
+| `envoy-external` | `cloudflare-dns` (external-dns) pushes records to Cloudflare | Internet + LAN |
+| `envoy-internal` | `k8s-gateway` (CoreDNS) serves DNS at `10.0.0.156` | LAN only (via Pi-hole conditional forwarding) |
+
+> **Key insight**: `cloudflare-dns` runs with `--gateway-name=envoy-external`, so it **only** creates Cloudflare DNS records for routes attached to `envoy-external`. Routes on `envoy-internal` are handled exclusively by `k8s-gateway` — no Cloudflare involvement.
+
+---
+
+## Issue 1: k8s-gateway Fails to Watch HTTPRoutes After Restart
+
+### Root Cause
+
+At cluster startup, `k8s-gateway` can fail to initialize its `HTTPRoute` controller due to a transient API server connection reset:
+
+```
+[WARNING] plugin/k8s_gateway: error getting crd httproutes.gateway.networking.k8s.io,
+  error: unexpected error when reading response body. Please retry.
+  Original error: read tcp ...: connection reset by peer
+```
+
+When this happens, `k8s-gateway` falls back to only watching `Service` resources, returning `NXDOMAIN` for **all** internal hostnames — even ones that were previously working.
+
+### Diagnosis
+
+```bash
+# Check if k8s-gateway has HTTPRoute controller initialized
+kubectl logs -n network -l app.kubernetes.io/name=k8s-gateway | grep -E "HTTPRoute|error"
+
+# Test DNS resolution directly against k8s-gateway
+dig +short myservice.laurivan.com @10.0.0.156
+```
+
+If `dig` returns empty and logs show the HTTPRoute controller error, proceed to the fix.
+
+### Fix
+
+Restart the `k8s-gateway` deployment to force a clean re-initialization:
+
+```bash
+kubectl rollout restart deployment/k8s-gateway -n network
+kubectl rollout status deployment/k8s-gateway -n network
+```
+
+After rollout, confirm all controllers initialized successfully:
+
+```bash
+kubectl logs -n network -l app.kubernetes.io/name=k8s-gateway | grep "controller initialized"
+# Expected output:
+# [INFO] plugin/k8s_gateway: GatewayAPI controller initialized
+# [INFO] plugin/k8s_gateway: HTTPRoute controller initialized
+# [INFO] plugin/k8s_gateway: Service controller initialized
+```
+
+Verify DNS resolves:
+
+```bash
+dig +short myservice.laurivan.com @10.0.0.156
+# Should return 10.0.0.157 (envoy-internal LB IP)
+```
+
+---
+
+## Issue 2: New Internal Service Has No DNS Record
+
+### Root Cause
+
+A new HTTPRoute attached to `envoy-internal` is not picked up by `k8s-gateway` after the restart (or `k8s-gateway` is healthy but the route was added after a long-running session). This can also happen if `k8s-gateway` was never restarted after a CRD re-sync.
+
+In contrast to `envoy-external` routes (handled automatically by `cloudflare-dns`), `envoy-internal` routes require `k8s-gateway` to be running and watching HTTPRoutes correctly.
+
+### Diagnosis
+
+```bash
+# Does the HTTPRoute exist and is it accepted?
+kubectl get httproute -A
+kubectl describe httproute <name> -n <namespace>
+
+# Does DNS resolve?
+dig +short <hostname>.laurivan.com @10.0.0.156
+
+# Are the pods healthy?
+kubectl get pods -n <namespace>
+```
+
+### Fix
+
+If the HTTPRoute exists and is `Accepted` but DNS doesn't resolve, restart `k8s-gateway` (see Issue 1 above).
+
+---
+
+## Issue 3: Pi-hole Conditional Forwarding Not Configured
+
+For LAN clients to resolve internal hostnames, Pi-hole must forward `laurivan.com` queries to `k8s-gateway` at `10.0.0.156`.
+
+### Setup
+
+In Pi-hole → **Settings → DNS → Conditional Forwarding**:
+
+| Field | Value |
+|---|---|
+| Local network CIDR | `10.0.0.0/24` |
+| Router / DNS IP | `10.0.0.156` |
+| Local domain name | `laurivan.com` |
+
+> Without this, LAN clients will resolve `*.laurivan.com` via public Cloudflare DNS, which has no records for `envoy-internal` services.
+
+---
+
+## Diagnosing "Destination Host Unreachable" from a LAN Client
+
+When you `ping` an internal service VIP (e.g. `10.0.0.157`), you may see:
+
+```
+Reply from 10.0.0.147: Destination host unreachable.
+```
+
+**This is normal and not an error.** Here's why:
+
+```mermaid
+sequenceDiagram
+    participant Win as Windows Client
+    participant Node as esxi-2cu-8g-03 (10.0.0.147)<br/>holds L2 ARP lease for 10.0.0.157
+    participant Envoy as Envoy Proxy (VIP: 10.0.0.157)
+
+    Win->>Node: ICMP Echo Request → 10.0.0.157
+    Note over Node: ARP resolves 10.0.0.157 to Node's MAC<br/>(Cilium L2 announcement)
+    Node->>Envoy: forwards packet
+    Note over Envoy: Envoy only handles TCP 80/443<br/>ICMP is not forwarded
+    Node-->>Win: ICMP "Destination Host Unreachable"
+
+    Win->>Node: TCP SYN → 10.0.0.157:443
+    Node->>Envoy: proxies connection
+    Envoy-->>Win: TLS handshake + HTTP response ✅
+```
+
+- `10.0.0.147` is the **cluster node** (`esxi-2cu-8g-03`) holding the Cilium L2 ARP announcement lease for `10.0.0.157` — it is not a router.
+- Envoy Gateway only listens on TCP 80/443. ICMP ping packets are not handled, so the node's kernel returns "host unreachable".
+- **Use `curl` instead of `ping` to verify connectivity:**
+
+```bash
+# Should return HTTP 200
+curl -sk -o /dev/null -w "%{http_code}" https://myservice.laurivan.com
+
+# Or directly by IP with a Host header
+curl -sk -o /dev/null -w "%{http_code}" https://10.0.0.157 -H "Host: myservice.laurivan.com"
+```
+
+---
+
+## Adding a New Service on `envoy-internal`
+
+### How to expose a service internally (no public DNS)
+
+1. In your HelmRelease (or a standalone `HTTPRoute`), set `parentRefs` to `envoy-internal` in the `network` namespace:
+
+```yaml
+route:
+  app:
+    hostnames: ["myservice.${SECRET_DOMAIN}"]
+    parentRefs:
+      - name: envoy-internal
+        namespace: network
+        sectionName: https
+    rules:
+      - backendRefs:
+          - identifier: app
+            port: 80
+```
+
+2. Ensure your Flux `Kustomization` has `postBuild.substituteFrom` set so `${SECRET_DOMAIN}` is substituted:
+
+```yaml
+postBuild:
+  substituteFrom:
+    - name: cluster-secrets
+      kind: Secret
+```
+
+3. After Flux reconciles, `k8s-gateway` will automatically pick up the new HTTPRoute and start serving DNS for `myservice.laurivan.com` → `10.0.0.157`.
+
+4. No `DNSEndpoint` CRD or Cloudflare changes are needed — `k8s-gateway` handles it.
+
+> **Do NOT** add a `DNSEndpoint` pointing `myservice.laurivan.com` → `internal.laurivan.com`. That would push the record to public Cloudflare DNS, exposing an internal service to the internet.
+
+---
+
+## Quick Diagnostic Checklist
+
+```bash
+# 1. Is the pod running?
+kubectl get pods -n <namespace>
+
+# 2. Is the HTTPRoute accepted?
+kubectl get httproute -n <namespace> -o yaml | grep -A5 "status:"
+
+# 3. Does k8s-gateway resolve the hostname?
+dig +short <hostname>.laurivan.com @10.0.0.156
+
+# 4. Is the k8s-gateway HTTPRoute controller active?
+kubectl logs -n network -l app.kubernetes.io/name=k8s-gateway | grep -E "HTTPRoute|error|NXDOMAIN"
+
+# 5. Can you reach the service on TCP?
+curl -sk -o /dev/null -w "%{http_code}" https://<hostname>.laurivan.com
+
+# 6. Is the Cilium L2 lease active?
+kubectl -n kube-system get lease | grep l2announce
+```