247 lines
7.9 KiB
Markdown
247 lines
7.9 KiB
Markdown
# FAQ: Internal Services Not Accessible via `envoy-internal`
|
|
|
|
> **Symptoms**: A service exposed via `envoy-internal` is unreachable from LAN clients even though the Flux HelmRelease, HTTPRoute, and pods all appear healthy.
|
|
|
|
---
|
|
|
|
## How Internal DNS Works in This Cluster
|
|
|
|
This cluster uses two separate ingress paths and two separate DNS mechanisms:
|
|
|
|
```mermaid
|
|
graph TD
|
|
subgraph Internet
|
|
CF[Cloudflare DNS]
|
|
end
|
|
|
|
subgraph LAN
|
|
Win[Windows / LAN Client]
|
|
PH[Pi-hole\n10.0.0.156 → k8s-gateway]
|
|
end
|
|
|
|
subgraph Kubernetes - network namespace
|
|
EXT[envoy-external\nLB: 10.0.0.158]
|
|
INT[envoy-internal\nLB: 10.0.0.157]
|
|
EXTDNS[cloudflare-dns\nexternal-dns\n--gateway-name=envoy-external]
|
|
K8SGW[k8s-gateway\nCoreDNS plugin\n10.0.0.156:53\nwatches HTTPRoutes]
|
|
end
|
|
|
|
CF -->|CNAME: external.laurivan.com| EXT
|
|
EXTDNS -->|creates DNS records| CF
|
|
EXTDNS -.->|ignores envoy-internal routes| INT
|
|
|
|
PH -->|resolves *.laurivan.com| K8SGW
|
|
K8SGW -->|reads HTTPRoute hostnames| INT
|
|
Win -->|DNS via Pi-hole| PH
|
|
Win -->|TCP 443| INT
|
|
```
|
|
|
|
| Gateway | DNS mechanism | Accessible from |
|
|
|---|---|---|
|
|
| `envoy-external` | `cloudflare-dns` (external-dns) pushes records to Cloudflare | Internet + LAN |
|
|
| `envoy-internal` | `k8s-gateway` (CoreDNS) serves DNS at `10.0.0.156` | LAN only (via Pi-hole conditional forwarding) |
|
|
|
|
> **Key insight**: `cloudflare-dns` runs with `--gateway-name=envoy-external`, so it **only** creates Cloudflare DNS records for routes attached to `envoy-external`. Routes on `envoy-internal` are handled exclusively by `k8s-gateway` — no Cloudflare involvement.
|
|
|
|
---
|
|
|
|
## Issue 1: k8s-gateway Fails to Watch HTTPRoutes After Restart
|
|
|
|
### Root Cause
|
|
|
|
At cluster startup, `k8s-gateway` can fail to initialize its `HTTPRoute` controller due to a transient API server connection reset:
|
|
|
|
```
|
|
[WARNING] plugin/k8s_gateway: error getting crd httproutes.gateway.networking.k8s.io,
|
|
error: unexpected error when reading response body. Please retry.
|
|
Original error: read tcp ...: connection reset by peer
|
|
```
|
|
|
|
When this happens, `k8s-gateway` falls back to only watching `Service` resources, returning `NXDOMAIN` for **all** internal hostnames — even ones that were previously working.
|
|
|
|
### Diagnosis
|
|
|
|
```bash
|
|
# Check if k8s-gateway has HTTPRoute controller initialized
|
|
kubectl logs -n network -l app.kubernetes.io/name=k8s-gateway | grep -E "HTTPRoute|error"
|
|
|
|
# Test DNS resolution directly against k8s-gateway
|
|
dig +short myservice.laurivan.com @10.0.0.156
|
|
```
|
|
|
|
If `dig` returns empty and logs show the HTTPRoute controller error, proceed to the fix.
|
|
|
|
### Fix
|
|
|
|
Restart the `k8s-gateway` deployment to force a clean re-initialization:
|
|
|
|
```bash
|
|
kubectl rollout restart deployment/k8s-gateway -n network
|
|
kubectl rollout status deployment/k8s-gateway -n network
|
|
```
|
|
|
|
After rollout, confirm all controllers initialized successfully:
|
|
|
|
```bash
|
|
kubectl logs -n network -l app.kubernetes.io/name=k8s-gateway | grep "controller initialized"
|
|
# Expected output:
|
|
# [INFO] plugin/k8s_gateway: GatewayAPI controller initialized
|
|
# [INFO] plugin/k8s_gateway: HTTPRoute controller initialized
|
|
# [INFO] plugin/k8s_gateway: Service controller initialized
|
|
```
|
|
|
|
Verify DNS resolves:
|
|
|
|
```bash
|
|
dig +short myservice.laurivan.com @10.0.0.156
|
|
# Should return 10.0.0.157 (envoy-internal LB IP)
|
|
```
|
|
|
|
---
|
|
|
|
## Issue 2: New Internal Service Has No DNS Record
|
|
|
|
### Root Cause
|
|
|
|
A new HTTPRoute attached to `envoy-internal` is not picked up by `k8s-gateway` after the restart (or `k8s-gateway` is healthy but the route was added after a long-running session). This can also happen if `k8s-gateway` was never restarted after a CRD re-sync.
|
|
|
|
In contrast to `envoy-external` routes (handled automatically by `cloudflare-dns`), `envoy-internal` routes require `k8s-gateway` to be running and watching HTTPRoutes correctly.
|
|
|
|
### Diagnosis
|
|
|
|
```bash
|
|
# Does the HTTPRoute exist and is it accepted?
|
|
kubectl get httproute -A
|
|
kubectl describe httproute <name> -n <namespace>
|
|
|
|
# Does DNS resolve?
|
|
dig +short <hostname>.laurivan.com @10.0.0.156
|
|
|
|
# Are the pods healthy?
|
|
kubectl get pods -n <namespace>
|
|
```
|
|
|
|
### Fix
|
|
|
|
If the HTTPRoute exists and is `Accepted` but DNS doesn't resolve, restart `k8s-gateway` (see Issue 1 above).
|
|
|
|
---
|
|
|
|
## Issue 3: Pi-hole Conditional Forwarding Not Configured
|
|
|
|
For LAN clients to resolve internal hostnames, Pi-hole must forward `laurivan.com` queries to `k8s-gateway` at `10.0.0.156`.
|
|
|
|
### Setup
|
|
|
|
In Pi-hole → **Settings → DNS → Conditional Forwarding**:
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Local network CIDR | `10.0.0.0/24` |
|
|
| Router / DNS IP | `10.0.0.156` |
|
|
| Local domain name | `laurivan.com` |
|
|
|
|
> Without this, LAN clients will resolve `*.laurivan.com` via public Cloudflare DNS, which has no records for `envoy-internal` services.
|
|
|
|
---
|
|
|
|
## Diagnosing "Destination Host Unreachable" from a LAN Client
|
|
|
|
When you `ping` an internal service VIP (e.g. `10.0.0.157`), you may see:
|
|
|
|
```
|
|
Reply from 10.0.0.147: Destination host unreachable.
|
|
```
|
|
|
|
**This is normal and not an error.** Here's why:
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant Win as Windows Client
|
|
participant Node as esxi-2cu-8g-03 (10.0.0.147)<br/>holds L2 ARP lease for 10.0.0.157
|
|
participant Envoy as Envoy Proxy (VIP: 10.0.0.157)
|
|
|
|
Win->>Node: ICMP Echo Request → 10.0.0.157
|
|
Note over Node: ARP resolves 10.0.0.157 to Node's MAC<br/>(Cilium L2 announcement)
|
|
Node->>Envoy: forwards packet
|
|
Note over Envoy: Envoy only handles TCP 80/443<br/>ICMP is not forwarded
|
|
Node-->>Win: ICMP "Destination Host Unreachable"
|
|
|
|
Win->>Node: TCP SYN → 10.0.0.157:443
|
|
Node->>Envoy: proxies connection
|
|
Envoy-->>Win: TLS handshake + HTTP response ✅
|
|
```
|
|
|
|
- `10.0.0.147` is the **cluster node** (`esxi-2cu-8g-03`) holding the Cilium L2 ARP announcement lease for `10.0.0.157` — it is not a router.
|
|
- Envoy Gateway only listens on TCP 80/443. ICMP ping packets are not handled, so the node's kernel returns "host unreachable".
|
|
- **Use `curl` instead of `ping` to verify connectivity:**
|
|
|
|
```bash
|
|
# Should return HTTP 200
|
|
curl -sk -o /dev/null -w "%{http_code}" https://myservice.laurivan.com
|
|
|
|
# Or directly by IP with a Host header
|
|
curl -sk -o /dev/null -w "%{http_code}" https://10.0.0.157 -H "Host: myservice.laurivan.com"
|
|
```
|
|
|
|
---
|
|
|
|
## Adding a New Service on `envoy-internal`
|
|
|
|
### How to expose a service internally (no public DNS)
|
|
|
|
1. In your HelmRelease (or a standalone `HTTPRoute`), set `parentRefs` to `envoy-internal` in the `network` namespace:
|
|
|
|
```yaml
|
|
route:
|
|
app:
|
|
hostnames: ["myservice.${SECRET_DOMAIN}"]
|
|
parentRefs:
|
|
- name: envoy-internal
|
|
namespace: network
|
|
sectionName: https
|
|
rules:
|
|
- backendRefs:
|
|
- identifier: app
|
|
port: 80
|
|
```
|
|
|
|
2. Ensure your Flux `Kustomization` has `postBuild.substituteFrom` set so `${SECRET_DOMAIN}` is substituted:
|
|
|
|
```yaml
|
|
postBuild:
|
|
substituteFrom:
|
|
- name: cluster-secrets
|
|
kind: Secret
|
|
```
|
|
|
|
3. After Flux reconciles, `k8s-gateway` will automatically pick up the new HTTPRoute and start serving DNS for `myservice.laurivan.com` → `10.0.0.157`.
|
|
|
|
4. No `DNSEndpoint` CRD or Cloudflare changes are needed — `k8s-gateway` handles it.
|
|
|
|
> **Do NOT** add a `DNSEndpoint` pointing `myservice.laurivan.com` → `internal.laurivan.com`. That would push the record to public Cloudflare DNS, exposing an internal service to the internet.
|
|
|
|
---
|
|
|
|
## Quick Diagnostic Checklist
|
|
|
|
```bash
|
|
# 1. Is the pod running?
|
|
kubectl get pods -n <namespace>
|
|
|
|
# 2. Is the HTTPRoute accepted?
|
|
kubectl get httproute -n <namespace> -o yaml | grep -A5 "status:"
|
|
|
|
# 3. Does k8s-gateway resolve the hostname?
|
|
dig +short <hostname>.laurivan.com @10.0.0.156
|
|
|
|
# 4. Is the k8s-gateway HTTPRoute controller active?
|
|
kubectl logs -n network -l app.kubernetes.io/name=k8s-gateway | grep -E "HTTPRoute|error|NXDOMAIN"
|
|
|
|
# 5. Can you reach the service on TCP?
|
|
curl -sk -o /dev/null -w "%{http_code}" https://<hostname>.laurivan.com
|
|
|
|
# 6. Is the Cilium L2 lease active?
|
|
kubectl -n kube-system get lease | grep l2announce
|
|
```
|