Implement automatic client reconnection with exponential backoff and heartbeat timeout

- Add heartbeat timeout to client control connection using server heartbeats for dead connection detection
- Introduce exponential backoff with jitter for reconnection delays
- Add CLI flags: --no-reconnect to disable auto-reconnect, --max-reconnect-delay to configure backoff cap
- Classify authentication errors as fatal (never retried), all others retried automatically
- Configure TCP keepalive on control connections for OS-level dead connection detection
- Update documentation (README.md, CLAUDE.md) to describe reconnection behavior and new flags
- Add unit tests for backoff logic and error classification
This commit is contained in:
kfirfer 2026-02-17 14:35:36 +07:00
parent 042fa78742
commit a13e03372e
No known key found for this signature in database
GPG key ID: B2103FE1471D8A5E
9 changed files with 438 additions and 126 deletions

View file

@ -23,10 +23,10 @@ The codebase is ~400 lines of async Rust using Tokio. No unsafe code (`#![forbid
### Modules
- **`main.rs`** — CLI entry point using clap. Two subcommands: `local` (client) and `server`.
- **`shared.rs`** — Protocol definitions. `ClientMessage`/`ServerMessage` enums serialized as JSON over TCP with null-byte delimiters. `Delimited<U>` wraps any async stream for framed JSON I/O. Key constants: `CONTROL_PORT = 7835`, `MAX_FRAME_LENGTH = 256`, `NETWORK_TIMEOUT = 3s`.
- **`main.rs`** — CLI entry point using clap. Two subcommands: `local` (client) and `server`. The `local` subcommand includes a reconnection loop with exponential backoff (enabled by default, disable with `--no-reconnect`). Authentication errors are classified as fatal via `is_auth_error()` and never retried.
- **`shared.rs`** — Protocol definitions. `ClientMessage`/`ServerMessage` enums serialized as JSON over TCP with null-byte delimiters. `Delimited<U>` wraps any async stream for framed JSON I/O. Key constants: `CONTROL_PORT = 7835`, `MAX_FRAME_LENGTH = 256`, `NETWORK_TIMEOUT = 3s`, `HEARTBEAT_TIMEOUT = 8s`. Also contains `ExponentialBackoff` for reconnection delays and `set_tcp_keepalive()` for OS-level dead connection detection.
- **`auth.rs`** — Optional HMAC-SHA256 challenge-response authentication. Secret is SHA256-hashed before use. Constant-time comparison.
- **`client.rs`** — `Client` connects to server's control port, sends `Hello(port)`, receives assigned port. For each incoming `Connection(uuid)`, opens a new TCP connection, sends `Accept(uuid)`, then bidirectionally proxies between local service and tunnel.
- **`client.rs`** — `Client` connects to server's control port, sends `Hello(port)`, receives assigned port. The `listen()` method wraps `recv()` in a heartbeat timeout (8s) to detect dead connections, returning an error instead of blocking forever. TCP keepalive is set on the control connection. For each incoming `Connection(uuid)`, opens a new TCP connection, sends `Accept(uuid)`, then bidirectionally proxies between local service and tunnel.
- **`server.rs`** — `Server` listens on control port. Allocates tunnel ports (random selection, 150 attempts). Stores pending connections in `DashMap<Uuid, TcpStream>` with 10-second expiry. Sends heartbeats every 500ms.
### Protocol Flow
@ -36,6 +36,7 @@ The codebase is ~400 lines of async Rust using Tokio. No unsafe code (`#![forbid
3. Client sends `Hello(desired_port)`, server responds with `Hello(actual_port)` and starts tunnel listener
4. When external traffic hits the tunnel port, server stores the connection by UUID, sends `Connection(uuid)` to client
5. Client opens a new connection to server, sends `Accept(uuid)`, server pairs streams, bidirectional copy begins
6. If the control connection drops (heartbeat timeout or EOF), the client reconnects automatically with exponential backoff (unless `--no-reconnect` is set)
### Key Patterns
@ -44,6 +45,10 @@ The codebase is ~400 lines of async Rust using Tokio. No unsafe code (`#![forbid
- `Arc<Client>`/`Arc<Server>` shared across spawned Tokio tasks
- `tokio::io::copy_bidirectional` for efficient TCP proxying
- `anyhow::Result` with `.context()` for error propagation
- Heartbeat timeout on client `listen()` loop to detect dead connections (8s timeout, server heartbeats every 500ms)
- Exponential backoff with jitter for reconnection delays (1s base, configurable max)
- TCP keepalive via `socket2` as defense-in-depth for dead connection detection
- String-based error classification (`is_auth_error()`) to distinguish fatal from retriable errors
## Testing