Files
talos-cluster/docs/faqs/internal-services-not-accessible.md

7.9 KiB

FAQ: Internal Services Not Accessible via envoy-internal

Symptoms: A service exposed via envoy-internal is unreachable from LAN clients even though the Flux HelmRelease, HTTPRoute, and pods all appear healthy.


How Internal DNS Works in This Cluster

This cluster uses two separate ingress paths and two separate DNS mechanisms:

graph TD
    subgraph Internet
        CF[Cloudflare DNS]
    end

    subgraph LAN
        Win[Windows / LAN Client]
        PH[Pi-hole\n10.0.0.156 → k8s-gateway]
    end

    subgraph Kubernetes - network namespace
        EXT[envoy-external\nLB: 10.0.0.158]
        INT[envoy-internal\nLB: 10.0.0.157]
        EXTDNS[cloudflare-dns\nexternal-dns\n--gateway-name=envoy-external]
        K8SGW[k8s-gateway\nCoreDNS plugin\n10.0.0.156:53\nwatches HTTPRoutes]
    end

    CF -->|CNAME: external.laurivan.com| EXT
    EXTDNS -->|creates DNS records| CF
    EXTDNS -.->|ignores envoy-internal routes| INT

    PH -->|resolves *.laurivan.com| K8SGW
    K8SGW -->|reads HTTPRoute hostnames| INT
    Win -->|DNS via Pi-hole| PH
    Win -->|TCP 443| INT
Gateway DNS mechanism Accessible from
envoy-external cloudflare-dns (external-dns) pushes records to Cloudflare Internet + LAN
envoy-internal k8s-gateway (CoreDNS) serves DNS at 10.0.0.156 LAN only (via Pi-hole conditional forwarding)

Key insight: cloudflare-dns runs with --gateway-name=envoy-external, so it only creates Cloudflare DNS records for routes attached to envoy-external. Routes on envoy-internal are handled exclusively by k8s-gateway — no Cloudflare involvement.


Issue 1: k8s-gateway Fails to Watch HTTPRoutes After Restart

Root Cause

At cluster startup, k8s-gateway can fail to initialize its HTTPRoute controller due to a transient API server connection reset:

[WARNING] plugin/k8s_gateway: error getting crd httproutes.gateway.networking.k8s.io,
  error: unexpected error when reading response body. Please retry.
  Original error: read tcp ...: connection reset by peer

When this happens, k8s-gateway falls back to only watching Service resources, returning NXDOMAIN for all internal hostnames — even ones that were previously working.

Diagnosis

# Check if k8s-gateway has HTTPRoute controller initialized
kubectl logs -n network -l app.kubernetes.io/name=k8s-gateway | grep -E "HTTPRoute|error"

# Test DNS resolution directly against k8s-gateway
dig +short myservice.laurivan.com @10.0.0.156

If dig returns empty and logs show the HTTPRoute controller error, proceed to the fix.

Fix

Restart the k8s-gateway deployment to force a clean re-initialization:

kubectl rollout restart deployment/k8s-gateway -n network
kubectl rollout status deployment/k8s-gateway -n network

After rollout, confirm all controllers initialized successfully:

kubectl logs -n network -l app.kubernetes.io/name=k8s-gateway | grep "controller initialized"
# Expected output:
# [INFO] plugin/k8s_gateway: GatewayAPI controller initialized
# [INFO] plugin/k8s_gateway: HTTPRoute controller initialized
# [INFO] plugin/k8s_gateway: Service controller initialized

Verify DNS resolves:

dig +short myservice.laurivan.com @10.0.0.156
# Should return 10.0.0.157 (envoy-internal LB IP)

Issue 2: New Internal Service Has No DNS Record

Root Cause

A new HTTPRoute attached to envoy-internal is not picked up by k8s-gateway after the restart (or k8s-gateway is healthy but the route was added after a long-running session). This can also happen if k8s-gateway was never restarted after a CRD re-sync.

In contrast to envoy-external routes (handled automatically by cloudflare-dns), envoy-internal routes require k8s-gateway to be running and watching HTTPRoutes correctly.

Diagnosis

# Does the HTTPRoute exist and is it accepted?
kubectl get httproute -A
kubectl describe httproute <name> -n <namespace>

# Does DNS resolve?
dig +short <hostname>.laurivan.com @10.0.0.156

# Are the pods healthy?
kubectl get pods -n <namespace>

Fix

If the HTTPRoute exists and is Accepted but DNS doesn't resolve, restart k8s-gateway (see Issue 1 above).


Issue 3: Pi-hole Conditional Forwarding Not Configured

For LAN clients to resolve internal hostnames, Pi-hole must forward laurivan.com queries to k8s-gateway at 10.0.0.156.

Setup

In Pi-hole → Settings → DNS → Conditional Forwarding:

Field Value
Local network CIDR 10.0.0.0/24
Router / DNS IP 10.0.0.156
Local domain name laurivan.com

Without this, LAN clients will resolve *.laurivan.com via public Cloudflare DNS, which has no records for envoy-internal services.


Diagnosing "Destination Host Unreachable" from a LAN Client

When you ping an internal service VIP (e.g. 10.0.0.157), you may see:

Reply from 10.0.0.147: Destination host unreachable.

This is normal and not an error. Here's why:

sequenceDiagram
    participant Win as Windows Client
    participant Node as esxi-2cu-8g-03 (10.0.0.147)<br/>holds L2 ARP lease for 10.0.0.157
    participant Envoy as Envoy Proxy (VIP: 10.0.0.157)

    Win->>Node: ICMP Echo Request → 10.0.0.157
    Note over Node: ARP resolves 10.0.0.157 to Node's MAC<br/>(Cilium L2 announcement)
    Node->>Envoy: forwards packet
    Note over Envoy: Envoy only handles TCP 80/443<br/>ICMP is not forwarded
    Node-->>Win: ICMP "Destination Host Unreachable"

    Win->>Node: TCP SYN → 10.0.0.157:443
    Node->>Envoy: proxies connection
    Envoy-->>Win: TLS handshake + HTTP response ✅
  • 10.0.0.147 is the cluster node (esxi-2cu-8g-03) holding the Cilium L2 ARP announcement lease for 10.0.0.157 — it is not a router.
  • Envoy Gateway only listens on TCP 80/443. ICMP ping packets are not handled, so the node's kernel returns "host unreachable".
  • Use curl instead of ping to verify connectivity:
# Should return HTTP 200
curl -sk -o /dev/null -w "%{http_code}" https://myservice.laurivan.com

# Or directly by IP with a Host header
curl -sk -o /dev/null -w "%{http_code}" https://10.0.0.157 -H "Host: myservice.laurivan.com"

Adding a New Service on envoy-internal

How to expose a service internally (no public DNS)

  1. In your HelmRelease (or a standalone HTTPRoute), set parentRefs to envoy-internal in the network namespace:
route:
  app:
    hostnames: ["myservice.${SECRET_DOMAIN}"]
    parentRefs:
      - name: envoy-internal
        namespace: network
        sectionName: https
    rules:
      - backendRefs:
          - identifier: app
            port: 80
  1. Ensure your Flux Kustomization has postBuild.substituteFrom set so ${SECRET_DOMAIN} is substituted:
postBuild:
  substituteFrom:
    - name: cluster-secrets
      kind: Secret
  1. After Flux reconciles, k8s-gateway will automatically pick up the new HTTPRoute and start serving DNS for myservice.laurivan.com10.0.0.157.

  2. No DNSEndpoint CRD or Cloudflare changes are needed — k8s-gateway handles it.

Do NOT add a DNSEndpoint pointing myservice.laurivan.cominternal.laurivan.com. That would push the record to public Cloudflare DNS, exposing an internal service to the internet.


Quick Diagnostic Checklist

# 1. Is the pod running?
kubectl get pods -n <namespace>

# 2. Is the HTTPRoute accepted?
kubectl get httproute -n <namespace> -o yaml | grep -A5 "status:"

# 3. Does k8s-gateway resolve the hostname?
dig +short <hostname>.laurivan.com @10.0.0.156

# 4. Is the k8s-gateway HTTPRoute controller active?
kubectl logs -n network -l app.kubernetes.io/name=k8s-gateway | grep -E "HTTPRoute|error|NXDOMAIN"

# 5. Can you reach the service on TCP?
curl -sk -o /dev/null -w "%{http_code}" https://<hostname>.laurivan.com

# 6. Is the Cilium L2 lease active?
kubectl -n kube-system get lease | grep l2announce