From 7177aebd37ea9681f08b8c4bcfaa67d514dedc27 Mon Sep 17 00:00:00 2001 From: Kristoffer Dalby Date: Fri, 18 Jul 2025 15:27:40 +0200 Subject: [PATCH] CLAUDE.md: add project file, lets call it a test Signed-off-by: Kristoffer Dalby --- CLAUDE.md | 395 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 395 insertions(+) create mode 100644 CLAUDE.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 00000000..8f2571ab --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,395 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Overview + +Headscale is an open-source implementation of the Tailscale control server written in Go. It provides self-hosted coordination for Tailscale networks (tailnets), managing node registration, IP allocation, policy enforcement, and DERP routing. + +## Development Commands + +### Quick Setup +```bash +# Recommended: Use Nix for dependency management +nix develop + +# Full development workflow +make dev # runs fmt + lint + test + build +``` + +### Essential Commands +```bash +# Build headscale binary +make build + +# Run tests +make test +go test ./... # All unit tests +go test -race ./... # With race detection + +# Run specific integration test +go run ./cmd/hi run "TestName" --postgres + +# Code formatting and linting +make fmt # Format all code (Go, docs, proto) +make lint # Lint all code (Go, proto) +make fmt-go # Format Go code only +make lint-go # Lint Go code only + +# Protocol buffer generation (after modifying proto/) +make generate + +# Clean build artifacts +make clean +``` + +### Integration Testing +```bash +# Use the hi (Headscale Integration) test runner +go run ./cmd/hi doctor # Check system requirements +go run ./cmd/hi run "TestPattern" # Run specific test +go run ./cmd/hi run "TestPattern" --postgres # With PostgreSQL backend + +# Test artifacts are saved to control_logs/ with logs and debug data +``` + +## Project Structure & Architecture + +### Top-Level Organization + +``` +headscale/ +├── cmd/ # Command-line applications +│ ├── headscale/ # Main headscale server binary +│ └── hi/ # Headscale Integration test runner +├── hscontrol/ # Core control plane logic +├── integration/ # End-to-end Docker-based tests +├── proto/ # Protocol buffer definitions +├── gen/ # Generated code (protobuf) +├── docs/ # Documentation +└── packaging/ # Distribution packaging +``` + +### Core Packages (`hscontrol/`) + +**Main Server (`hscontrol/`)** +- `app.go`: Application setup, dependency injection, server lifecycle +- `handlers.go`: HTTP/gRPC API endpoints for management operations +- `grpcv1.go`: gRPC service implementation for headscale API +- `poll.go`: **Critical** - Handles Tailscale MapRequest/MapResponse protocol +- `noise.go`: Noise protocol implementation for secure client communication +- `auth.go`: Authentication flows (web, OIDC, command-line) +- `oidc.go`: OpenID Connect integration for user authentication + +**State Management (`hscontrol/state/`)** +- `state.go`: Central coordinator for all subsystems (database, policy, IP allocation, DERP) +- `node_store.go`: **Performance-critical** - In-memory cache with copy-on-write semantics +- Thread-safe operations with deadlock detection +- Coordinates between database persistence and real-time operations + +**Database Layer (`hscontrol/db/`)** +- `db.go`: Database abstraction, GORM setup, migration management +- `node.go`: Node lifecycle, registration, expiration, IP assignment +- `users.go`: User management, namespace isolation +- `api_key.go`: API authentication tokens +- `preauth_keys.go`: Pre-authentication keys for automated node registration +- `ip.go`: IP address allocation and management +- `policy.go`: Policy storage and retrieval +- Schema migrations in `schema.sql` with extensive test data coverage + +**Policy Engine (`hscontrol/policy/`)** +- `policy.go`: Core ACL evaluation logic, HuJSON parsing +- `v2/`: Next-generation policy system with improved filtering +- `matcher/`: ACL rule matching and evaluation engine +- Determines peer visibility, route approval, and network access rules +- Supports both file-based and database-stored policies + +**Network Management (`hscontrol/`)** +- `derp/`: DERP (Designated Encrypted Relay for Packets) server implementation + - NAT traversal when direct connections fail + - Fallback relay for firewall-restricted environments +- `mapper/`: Converts internal Headscale state to Tailscale's wire protocol format + - `tail.go`: Tailscale-specific data structure generation +- `routes/`: Subnet route management and primary route selection +- `dns/`: DNS record management and MagicDNS implementation + +**Utilities & Support (`hscontrol/`)** +- `types/`: Core data structures, configuration, validation +- `util/`: Helper functions for networking, DNS, key management +- `templates/`: Client configuration templates (Apple, Windows, etc.) +- `notifier/`: Event notification system for real-time updates +- `metrics.go`: Prometheus metrics collection +- `capver/`: Tailscale capability version management + +### Key Subsystem Interactions + +**Node Registration Flow** +1. **Client Connection**: `noise.go` handles secure protocol handshake +2. **Authentication**: `auth.go` validates credentials (web/OIDC/preauth) +3. **State Creation**: `state.go` coordinates IP allocation via `db/ip.go` +4. **Storage**: `db/node.go` persists node, `NodeStore` caches in memory +5. **Network Setup**: `mapper/` generates initial Tailscale network map + +**Ongoing Operations** +1. **Poll Requests**: `poll.go` receives periodic client updates +2. **State Updates**: `NodeStore` maintains real-time node information +3. **Policy Application**: `policy/` evaluates ACL rules for peer relationships +4. **Map Distribution**: `mapper/` sends network topology to all affected clients + +**Route Management** +1. **Advertisement**: Clients announce routes via `poll.go` Hostinfo updates +2. **Storage**: `db/` persists routes, `NodeStore` caches for performance +3. **Approval**: `policy/` auto-approves routes based on ACL rules +4. **Distribution**: `routes/` selects primary routes, `mapper/` distributes to peers + +### Command-Line Tools (`cmd/`) + +**Main Server (`cmd/headscale/`)** +- `headscale.go`: CLI parsing, configuration loading, server startup +- Supports daemon mode, CLI operations (user/node management), database operations + +**Integration Test Runner (`cmd/hi/`)** +- `main.go`: Test execution framework with Docker orchestration +- `run.go`: Individual test execution with artifact collection +- `doctor.go`: System requirements validation +- `docker.go`: Container lifecycle management +- Essential for validating changes against real Tailscale clients + +### Generated & External Code + +**Protocol Buffers (`proto/` → `gen/`)** +- Defines gRPC API for headscale management operations +- Client libraries can generate from these definitions +- Run `make generate` after modifying `.proto` files + +**Integration Testing (`integration/`)** +- `scenario.go`: Docker test environment setup +- `tailscale.go`: Tailscale client container management +- Individual test files for specific functionality areas +- Real end-to-end validation with network isolation + +### Critical Performance Paths + +**High-Frequency Operations** +1. **MapRequest Processing** (`poll.go`): Every 15-60 seconds per client +2. **NodeStore Reads** (`node_store.go`): Every operation requiring node data +3. **Policy Evaluation** (`policy/`): On every peer relationship calculation +4. **Route Lookups** (`routes/`): During network map generation + +**Database Write Patterns** +- **Frequent**: Node heartbeats, endpoint updates, route changes +- **Moderate**: User operations, policy updates, API key management +- **Rare**: Schema migrations, bulk operations + +### Configuration & Deployment + +**Configuration** (`hscontrol/types/config.go`)** +- Database connection settings (SQLite/PostgreSQL) +- Network configuration (IP ranges, DNS settings) +- Policy mode (file vs database) +- DERP relay configuration +- OIDC provider settings + +**Key Dependencies** +- **GORM**: Database ORM with migration support +- **Tailscale Libraries**: Core networking and protocol code +- **Zerolog**: Structured logging throughout the application +- **Buf**: Protocol buffer toolchain for code generation + +### Development Workflow Integration + +The architecture supports incremental development: +- **Unit Tests**: Focus on individual packages (`*_test.go` files) +- **Integration Tests**: Validate cross-component interactions +- **Database Tests**: Extensive migration and data integrity validation +- **Policy Tests**: ACL rule evaluation and edge cases +- **Performance Tests**: NodeStore and high-frequency operation validation + +## Integration Test System + +### Overview +Integration tests use Docker containers running real Tailscale clients against a Headscale server. Tests validate end-to-end functionality including routing, ACLs, node lifecycle, and network coordination. + +### Running Integration Tests + +**System Requirements** +```bash +# Check if your system is ready +go run ./cmd/hi doctor +``` +This verifies Docker, Go, required images, and disk space. + +**Test Execution Patterns** +```bash +# Run a single test (recommended for development) +go run ./cmd/hi run "TestSubnetRouterMultiNetwork" + +# Run with PostgreSQL backend (for database-heavy tests) +go run ./cmd/hi run "TestExpireNode" --postgres + +# Run multiple tests with pattern matching +go run ./cmd/hi run "TestSubnet*" + +# Run all integration tests (CI/full validation) +go test ./integration -timeout 30m +``` + +**Test Categories & Timing** +- **Fast tests** (< 2 min): Basic functionality, CLI operations +- **Medium tests** (2-5 min): Route management, ACL validation +- **Slow tests** (5+ min): Node expiration, HA failover +- **Long-running tests** (10+ min): `TestNodeOnlineStatus` (12 min duration) + +### Test Infrastructure + +**Docker Setup** +- Headscale server container with configurable database backend +- Multiple Tailscale client containers with different versions +- Isolated networks per test scenario +- Automatic cleanup after test completion + +**Test Artifacts** +All test runs save artifacts to `control_logs/TIMESTAMP-ID/`: +``` +control_logs/20250713-213106-iajsux/ +├── hs-testname-abc123.stderr.log # Headscale server logs +├── hs-testname-abc123.stdout.log +├── hs-testname-abc123.db # Database snapshot +├── hs-testname-abc123_metrics.txt # Prometheus metrics +├── hs-testname-abc123-mapresponses/ # Protocol debug data +├── ts-client-xyz789.stderr.log # Tailscale client logs +├── ts-client-xyz789.stdout.log +└── ts-client-xyz789_status.json # Client status dump +``` + +### Test Development Guidelines + +**Timing Considerations** +Integration tests involve real network operations and Docker container lifecycle: + +```go +// ❌ Wrong: Immediate assertions after async operations +client.Execute([]string{"tailscale", "set", "--advertise-routes=10.0.0.0/24"}) +nodes, _ := headscale.ListNodes() +require.Len(t, nodes[0].GetAvailableRoutes(), 1) // May fail due to timing + +// ✅ Correct: Wait for async operations to complete +client.Execute([]string{"tailscale", "set", "--advertise-routes=10.0.0.0/24"}) +require.EventuallyWithT(t, func(c *assert.CollectT) { + nodes, err := headscale.ListNodes() + assert.NoError(c, err) + assert.Len(c, nodes[0].GetAvailableRoutes(), 1) +}, 10*time.Second, 100*time.Millisecond, "route should be advertised") +``` + +**Common Test Patterns** +- **Route Advertisement**: Use `EventuallyWithT` for route propagation +- **Node State Changes**: Wait for NodeStore synchronization +- **ACL Policy Changes**: Allow time for policy recalculation +- **Network Connectivity**: Use ping tests with retries + +**Test Data Management** +```go +// Node identification: Don't assume array ordering +expectedRoutes := map[string]string{"1": "10.33.0.0/16"} +for _, node := range nodes { + nodeIDStr := fmt.Sprintf("%d", node.GetId()) + if route, shouldHaveRoute := expectedRoutes[nodeIDStr]; shouldHaveRoute { + // Test the node that should have the route + } +} +``` + +### Troubleshooting Integration Tests + +**Common Failure Patterns** +1. **Timing Issues**: Test assertions run before async operations complete + - **Solution**: Use `EventuallyWithT` with appropriate timeouts + - **Timeout Guidelines**: 3-5s for route operations, 10s for complex scenarios + +2. **Infrastructure Problems**: Disk space, Docker issues, network conflicts + - **Check**: `go run ./cmd/hi doctor` for system health + - **Clean**: Remove old test containers and networks + +3. **NodeStore Synchronization**: Tests expecting immediate data availability + - **Key Points**: Route advertisements must propagate through poll requests + - **Fix**: Wait for NodeStore updates after Hostinfo changes + +4. **Database Backend Differences**: SQLite vs PostgreSQL behavior differences + - **Use**: `--postgres` flag for database-intensive tests + - **Note**: Some timing characteristics differ between backends + +**Debugging Failed Tests** +1. **Check test artifacts** in `control_logs/` for detailed logs +2. **Examine MapResponse JSON** files for protocol-level debugging +3. **Review Headscale stderr logs** for server-side error messages +4. **Check Tailscale client status** for network-level issues + +**Resource Management** +- Tests require significant disk space (each run ~100MB of logs) +- Docker containers are cleaned up automatically on success +- Failed tests may leave containers running - clean manually if needed +- Use `docker system prune` periodically to reclaim space + +### Best Practices for Test Modifications + +1. **Always test locally** before committing integration test changes +2. **Use appropriate timeouts** - too short causes flaky tests, too long slows CI +3. **Clean up properly** - ensure tests don't leave persistent state +4. **Handle both success and failure paths** in test scenarios +5. **Document timing requirements** for complex test scenarios + +## NodeStore Implementation Details + +**Key Insight from Recent Work**: The NodeStore is a critical performance optimization that caches node data in memory while ensuring consistency with the database. When working with route advertisements or node state changes: + +1. **Timing Considerations**: Route advertisements need time to propagate from clients to server. Use `require.EventuallyWithT()` patterns in tests instead of immediate assertions. + +2. **Synchronization Points**: NodeStore updates happen at specific points like `poll.go:420` after Hostinfo changes. Ensure these are maintained when modifying the polling logic. + +3. **Peer Visibility**: The NodeStore's `peersFunc` determines which nodes are visible to each other. Policy-based filtering is separate from monitoring visibility - expired nodes should remain visible for debugging but marked as expired. + +## Testing Guidelines + +### Integration Test Patterns +```go +// Use EventuallyWithT for async operations +require.EventuallyWithT(t, func(c *assert.CollectT) { + nodes, err := headscale.ListNodes() + assert.NoError(c, err) + // Check expected state +}, 10*time.Second, 100*time.Millisecond, "description") + +// Node route checking by actual node properties, not array position +var routeNode *v1.Node +for _, node := range nodes { + if nodeIDStr := fmt.Sprintf("%d", node.GetId()); expectedRoutes[nodeIDStr] != "" { + routeNode = node + break + } +} +``` + +### Running Problematic Tests +- Some tests require significant time (e.g., `TestNodeOnlineStatus` runs for 12 minutes) +- Infrastructure issues like disk space can cause test failures unrelated to code changes +- Use `--postgres` flag when testing database-heavy scenarios + +## Important Notes + +- **Dependencies**: Use `nix develop` for consistent toolchain (Go, buf, protobuf tools, linting) +- **Protocol Buffers**: Changes to `proto/` require `make generate` and should be committed separately +- **Code Style**: Enforced via golangci-lint with golines (width 88) and gofumpt formatting +- **Database**: Supports both SQLite (development) and PostgreSQL (production/testing) +- **Integration Tests**: Require Docker and can consume significant disk space +- **Performance**: NodeStore optimizations are critical for scale - be careful with changes to state management + +## Debugging Integration Tests + +Test artifacts are preserved in `control_logs/TIMESTAMP-ID/` including: +- Headscale server logs (stderr/stdout) +- Tailscale client logs and status +- Database dumps and network captures +- MapResponse JSON files for protocol debugging + +When tests fail, check these artifacts first before assuming code issues.