Implement automatic client reconnection with exponential backoff and heartbeat timeout

- Add heartbeat timeout to client control connection using server heartbeats for dead connection detection
- Introduce exponential backoff with jitter for reconnection delays
- Add CLI flags: --no-reconnect to disable auto-reconnect, --max-reconnect-delay to configure backoff cap
- Classify authentication errors as fatal (never retried), all others retried automatically
- Configure TCP keepalive on control connections for OS-level dead connection detection
- Update documentation (README.md, CLAUDE.md) to describe reconnection behavior and new flags
- Add unit tests for backoff logic and error classification
This commit is contained in:
kfirfer 2026-02-17 14:35:36 +07:00
parent 042fa78742
commit a13e03372e
No known key found for this signature in database
GPG key ID: B2103FE1471D8A5E
9 changed files with 438 additions and 126 deletions

View file

@ -23,10 +23,10 @@ The codebase is ~400 lines of async Rust using Tokio. No unsafe code (`#![forbid
### Modules
- **`main.rs`** — CLI entry point using clap. Two subcommands: `local` (client) and `server`.
- **`shared.rs`** — Protocol definitions. `ClientMessage`/`ServerMessage` enums serialized as JSON over TCP with null-byte delimiters. `Delimited<U>` wraps any async stream for framed JSON I/O. Key constants: `CONTROL_PORT = 7835`, `MAX_FRAME_LENGTH = 256`, `NETWORK_TIMEOUT = 3s`.
- **`main.rs`** — CLI entry point using clap. Two subcommands: `local` (client) and `server`. The `local` subcommand includes a reconnection loop with exponential backoff (enabled by default, disable with `--no-reconnect`). Authentication errors are classified as fatal via `is_auth_error()` and never retried.
- **`shared.rs`** — Protocol definitions. `ClientMessage`/`ServerMessage` enums serialized as JSON over TCP with null-byte delimiters. `Delimited<U>` wraps any async stream for framed JSON I/O. Key constants: `CONTROL_PORT = 7835`, `MAX_FRAME_LENGTH = 256`, `NETWORK_TIMEOUT = 3s`, `HEARTBEAT_TIMEOUT = 8s`. Also contains `ExponentialBackoff` for reconnection delays and `set_tcp_keepalive()` for OS-level dead connection detection.
- **`auth.rs`** — Optional HMAC-SHA256 challenge-response authentication. Secret is SHA256-hashed before use. Constant-time comparison.
- **`client.rs`** — `Client` connects to server's control port, sends `Hello(port)`, receives assigned port. For each incoming `Connection(uuid)`, opens a new TCP connection, sends `Accept(uuid)`, then bidirectionally proxies between local service and tunnel.
- **`client.rs`** — `Client` connects to server's control port, sends `Hello(port)`, receives assigned port. The `listen()` method wraps `recv()` in a heartbeat timeout (8s) to detect dead connections, returning an error instead of blocking forever. TCP keepalive is set on the control connection. For each incoming `Connection(uuid)`, opens a new TCP connection, sends `Accept(uuid)`, then bidirectionally proxies between local service and tunnel.
- **`server.rs`** — `Server` listens on control port. Allocates tunnel ports (random selection, 150 attempts). Stores pending connections in `DashMap<Uuid, TcpStream>` with 10-second expiry. Sends heartbeats every 500ms.
### Protocol Flow
@ -36,6 +36,7 @@ The codebase is ~400 lines of async Rust using Tokio. No unsafe code (`#![forbid
3. Client sends `Hello(desired_port)`, server responds with `Hello(actual_port)` and starts tunnel listener
4. When external traffic hits the tunnel port, server stores the connection by UUID, sends `Connection(uuid)` to client
5. Client opens a new connection to server, sends `Accept(uuid)`, server pairs streams, bidirectional copy begins
6. If the control connection drops (heartbeat timeout or EOF), the client reconnects automatically with exponential backoff (unless `--no-reconnect` is set)
### Key Patterns
@ -44,6 +45,10 @@ The codebase is ~400 lines of async Rust using Tokio. No unsafe code (`#![forbid
- `Arc<Client>`/`Arc<Server>` shared across spawned Tokio tasks
- `tokio::io::copy_bidirectional` for efficient TCP proxying
- `anyhow::Result` with `.context()` for error propagation
- Heartbeat timeout on client `listen()` loop to detect dead connections (8s timeout, server heartbeats every 500ms)
- Exponential backoff with jitter for reconnection delays (1s base, configurable max)
- TCP keepalive via `socket2` as defense-in-depth for dead connection detection
- String-based error classification (`is_auth_error()`) to distinguish fatal from retriable errors
## Testing

90
Cargo.lock generated
View file

@ -127,6 +127,7 @@ dependencies = [
"serde",
"serde_json",
"sha2",
"socket2 0.5.10",
"tokio",
"tokio-util",
"tracing",
@ -474,9 +475,9 @@ checksum = "e2abad23fbc42b3700f2f279844dc832adb2b2eb069b2df918f455c4e18cc646"
[[package]]
name = "libc"
version = "0.2.142"
version = "0.2.182"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6a987beff54b60ffa6d51982e1aa1146bc42f19bd26be28b0586f252fccf5317"
checksum = "6800badb6cb2082ffd7b6a67e6125bb39f18782f793520caee8cb8846be06112"
[[package]]
name = "linux-raw-sys"
@ -768,6 +769,16 @@ dependencies = [
"winapi",
]
[[package]]
name = "socket2"
version = "0.5.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e22376abed350d73dd1cd119b57ffccad95b4e585a7cda43e286245ce23c0678"
dependencies = [
"libc",
"windows-sys 0.52.0",
]
[[package]]
name = "strsim"
version = "0.10.0"
@ -824,7 +835,7 @@ dependencies = [
"mio",
"num_cpus",
"pin-project-lite",
"socket2",
"socket2 0.4.9",
"tokio-macros",
"windows-sys 0.48.0",
]
@ -998,6 +1009,15 @@ dependencies = [
"windows-targets 0.48.0",
]
[[package]]
name = "windows-sys"
version = "0.52.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "282be5f36a8ce781fad8c8ae18fa3f9beff57ec1b52cb3de0789201425d9a33d"
dependencies = [
"windows-targets 0.52.6",
]
[[package]]
name = "windows-targets"
version = "0.42.2"
@ -1028,6 +1048,22 @@ dependencies = [
"windows_x86_64_msvc 0.48.0",
]
[[package]]
name = "windows-targets"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9b724f72796e036ab90c1021d4780d4d3d648aca59e491e6b98e725b84e99973"
dependencies = [
"windows_aarch64_gnullvm 0.52.6",
"windows_aarch64_msvc 0.52.6",
"windows_i686_gnu 0.52.6",
"windows_i686_gnullvm",
"windows_i686_msvc 0.52.6",
"windows_x86_64_gnu 0.52.6",
"windows_x86_64_gnullvm 0.52.6",
"windows_x86_64_msvc 0.52.6",
]
[[package]]
name = "windows_aarch64_gnullvm"
version = "0.42.2"
@ -1040,6 +1076,12 @@ version = "0.48.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "91ae572e1b79dba883e0d315474df7305d12f569b400fcf90581b06062f7e1bc"
[[package]]
name = "windows_aarch64_gnullvm"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "32a4622180e7a0ec044bb555404c800bc9fd9ec262ec147edd5989ccd0c02cd3"
[[package]]
name = "windows_aarch64_msvc"
version = "0.42.2"
@ -1052,6 +1094,12 @@ version = "0.48.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b2ef27e0d7bdfcfc7b868b317c1d32c641a6fe4629c171b8928c7b08d98d7cf3"
[[package]]
name = "windows_aarch64_msvc"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "09ec2a7bb152e2252b53fa7803150007879548bc709c039df7627cabbd05d469"
[[package]]
name = "windows_i686_gnu"
version = "0.42.2"
@ -1064,6 +1112,18 @@ version = "0.48.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "622a1962a7db830d6fd0a69683c80a18fda201879f0f447f065a3b7467daa241"
[[package]]
name = "windows_i686_gnu"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8e9b5ad5ab802e97eb8e295ac6720e509ee4c243f69d781394014ebfe8bbfa0b"
[[package]]
name = "windows_i686_gnullvm"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0eee52d38c090b3caa76c563b86c3a4bd71ef1a819287c19d586d7334ae8ed66"
[[package]]
name = "windows_i686_msvc"
version = "0.42.2"
@ -1076,6 +1136,12 @@ version = "0.48.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4542c6e364ce21bf45d69fdd2a8e455fa38d316158cfd43b3ac1c5b1b19f8e00"
[[package]]
name = "windows_i686_msvc"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "240948bc05c5e7c6dabba28bf89d89ffce3e303022809e73deaefe4f6ec56c66"
[[package]]
name = "windows_x86_64_gnu"
version = "0.42.2"
@ -1088,6 +1154,12 @@ version = "0.48.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ca2b8a661f7628cbd23440e50b05d705db3686f894fc9580820623656af974b1"
[[package]]
name = "windows_x86_64_gnu"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "147a5c80aabfbf0c7d901cb5895d1de30ef2907eb21fbbab29ca94c5b08b1a78"
[[package]]
name = "windows_x86_64_gnullvm"
version = "0.42.2"
@ -1100,6 +1172,12 @@ version = "0.48.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7896dbc1f41e08872e9d5e8f8baa8fdd2677f29468c4e156210174edc7f7b953"
[[package]]
name = "windows_x86_64_gnullvm"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "24d5b23dc417412679681396f2b49f3de8c1473deb516bd34410872eff51ed0d"
[[package]]
name = "windows_x86_64_msvc"
version = "0.42.2"
@ -1111,3 +1189,9 @@ name = "windows_x86_64_msvc"
version = "0.48.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1a515f5799fe4961cb532f983ce2b23082366b898e52ffbce459c86f67c8378a"
[[package]]
name = "windows_x86_64_msvc"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "589f6da84c646204747d1270a2a5661ea66ed1cced2631d546fdfb155959f9ec"

View file

@ -26,7 +26,8 @@ hmac = "0.12.1"
serde = { version = "1.0.136", features = ["derive"] }
serde_json = "1.0.79"
sha2 = "0.10.2"
tokio = { version = "1.17.0", features = ["rt-multi-thread", "io-util", "macros", "net", "time"] }
socket2 = { version = "0.5", features = ["all"] }
tokio = { version = "1.21.0", features = ["rt-multi-thread", "io-util", "macros", "net", "time"] }
tokio-util = { version = "0.7.1", features = ["codec"] }
tracing = "0.1.32"
tracing-subscriber = "0.3.18"
@ -35,4 +36,4 @@ uuid = { version = "1.2.1", features = ["serde", "v4"] }
[dev-dependencies]
lazy_static = "1.4.0"
rstest = "0.15.0"
tokio = { version = "1.17.0", features = ["sync"] }
tokio = { version = "1.21.0", features = ["sync"] }

View file

@ -96,11 +96,13 @@ Arguments:
<LOCAL_PORT> The local port to expose [env: BORE_LOCAL_PORT=]
Options:
-l, --local-host <HOST> The local host to expose [default: localhost]
-t, --to <TO> Address of the remote server to expose local ports to [env: BORE_SERVER=]
-p, --port <PORT> Optional port on the remote server to select [default: 0]
-s, --secret <SECRET> Optional secret for authentication [env: BORE_SECRET]
-h, --help Print help
-l, --local-host <HOST> The local host to expose [default: localhost]
-t, --to <TO> Address of the remote server to expose local ports to [env: BORE_SERVER=]
-p, --port <PORT> Optional port on the remote server to select [default: 0]
-s, --secret <SECRET> Optional secret for authentication [env: BORE_SECRET]
--no-reconnect Disable automatic reconnection on connection loss
--max-reconnect-delay <SECONDS> Maximum delay between reconnection attempts [default: 64]
-h, --help Print help
```
### Self-Hosting
@ -139,6 +141,17 @@ Whenever the server obtains a connection on the remote port, it generates a secu
For correctness reasons and to avoid memory leaks, incoming connections are only stored by the server for up to 10 seconds before being discarded if the client does not accept them.
## Reconnection
By default, `bore` automatically reconnects to the server when the connection is lost (e.g., due to network interruptions). This makes it suitable for long-running deployments with service managers like systemd or launchd.
- **Automatic reconnection** is enabled by default with exponential backoff (1s, 2s, 4s, ... up to 64s max)
- **Authentication failures** (wrong secret) are never retried — the client exits immediately
- **`--no-reconnect`** disables automatic reconnection, restoring the legacy exit-on-disconnect behavior
- **`--max-reconnect-delay <SECONDS>`** configures the maximum backoff delay (default: 64 seconds)
Dead connections are detected via a heartbeat timeout: the server sends heartbeats every 500ms, and if no message is received within 8 seconds, the client treats the connection as dead and begins reconnecting. TCP keepalive is also configured as an additional safety net.
## Authentication
On a custom deployment of `bore server`, you can optionally require a _secret_ to prevent the server from being used by others. The protocol requires clients to verify possession of the secret on each TCP connection by answering random challenges in the form of HMAC codes. (This secret is only used for the initial handshake, and no further traffic is encrypted by default.)

View file

@ -79,7 +79,7 @@ main.rs OUTER RECONNECTION LOOP:
**Goal**: Make the client actively detect dead connections using the server's existing heartbeat mechanism, instead of blocking indefinitely on `recv()`.
#### [ ] Task 1.1: Add heartbeat timeout constant to `shared.rs`
#### [x] Task 1.1: Add heartbeat timeout constant to `shared.rs`
**File**: `src/shared.rs`
**Change**: Add a new constant after line 21.
@ -95,13 +95,13 @@ pub const HEARTBEAT_TIMEOUT: Duration = Duration::from_secs(8);
**Rationale**: 8 seconds = 16 missed heartbeats at 500ms interval. This is generous enough to handle brief network hiccups (packet loss, temporary congestion) while still detecting dead connections quickly. The value should be at least 3-4x the heartbeat interval to avoid false positives.
**Subtasks**:
- [ ] Add `HEARTBEAT_TIMEOUT` constant (8 seconds)
- [ ] Add doc comment explaining the relationship to server heartbeat interval
- [ ] Update the `use` import in `client.rs` to include `HEARTBEAT_TIMEOUT`
- [x]Add `HEARTBEAT_TIMEOUT` constant (8 seconds)
- [x]Add doc comment explaining the relationship to server heartbeat interval
- [x]Update the `use` import in `client.rs` to include `HEARTBEAT_TIMEOUT`
---
#### [ ] Task 1.2: Add timeout to `listen()` loop in `client.rs`
#### [x] Task 1.2: Add timeout to `listen()` loop in `client.rs`
**File**: `src/client.rs`
**Change**: Modify the `listen()` method (lines 78-103) to wrap `recv()` in a timeout.
@ -166,12 +166,12 @@ pub async fn listen(mut self) -> Result<()> {
3. `None` (EOF) now returns `Err` instead of `Ok(())` — this is critical for the outer reconnection loop to trigger
**Subtasks**:
- [ ] Import `HEARTBEAT_TIMEOUT` from `shared.rs`
- [ ] Verify `timeout` is already imported at `client.rs:6` (`use tokio::{..., time::timeout}`) — no new import needed
- [ ] Wrap `conn.recv()` in timeout
- [ ] Change timeout arm to `bail!("heartbeat timeout")`
- [ ] Change `None` arm from `return Ok(())` to `bail!("server closed connection")`
- [ ] Verify existing match arms are preserved identically
- [x]Import `HEARTBEAT_TIMEOUT` from `shared.rs`
- [x]Verify `timeout` is already imported at `client.rs:6` (`use tokio::{..., time::timeout}`) — no new import needed
- [x]Wrap `conn.recv()` in timeout
- [x]Change timeout arm to `bail!("heartbeat timeout")`
- [x]Change `None` arm from `return Ok(())` to `bail!("server closed connection")`
- [x]Verify existing match arms are preserved identically
---
@ -179,7 +179,7 @@ pub async fn listen(mut self) -> Result<()> {
**Goal**: Add an outer retry loop in `main.rs` that catches connection failures and reconnects with increasing delays.
#### [ ] Task 2.1: Implement exponential backoff helper
#### [x] Task 2.1: Implement exponential backoff helper
**File**: `src/shared.rs` (or inline in `main.rs` — prefer `shared.rs` for reusability)
**Change**: Add a simple exponential backoff struct. No external crate dependency — keep bore minimal.
@ -220,15 +220,15 @@ impl ExponentialBackoff {
```
**Subtasks**:
- [ ] Add `ExponentialBackoff` struct to `shared.rs`
- [ ] Implement `new()`, `next_delay()`, `reset()`
- [ ] Add jitter using existing `fastrand` dependency (already in Cargo.toml)
- [ ] Add unit test for backoff sequence and reset behavior
- [ ] Add unit test for jitter bounds (delay always between 0.75x and 1.25x expected)
- [x]Add `ExponentialBackoff` struct to `shared.rs`
- [x]Implement `new()`, `next_delay()`, `reset()`
- [x]Add jitter using existing `fastrand` dependency (already in Cargo.toml)
- [x]Add unit test for backoff sequence and reset behavior
- [x]Add unit test for jitter bounds (delay always between 0.75x and 1.25x expected)
---
#### [ ] Task 2.2: Add CLI flags for reconnection control
#### [x] Task 2.2: Add CLI flags for reconnection control
**File**: `src/main.rs`
**Change**: Add new optional flags to the `Local` subcommand.
@ -254,13 +254,13 @@ Command::Local {
- No `--max-reconnect-attempts` — infinite retries is the right default for a tunnel daemon. Users who want limited retries can use external tooling or the `--no-reconnect` flag with service manager restart limits.
**Subtasks**:
- [ ] Add `no_reconnect: bool` field with `#[clap(long)]`
- [ ] Add `max_reconnect_delay: u64` field with default 64
- [ ] Pass these values to the reconnection loop
- [x]Add `no_reconnect: bool` field with `#[clap(long)]`
- [x]Add `max_reconnect_delay: u64` field with default 64
- [x]Pass these values to the reconnection loop
---
#### [ ] Task 2.3: Implement reconnection loop in `main.rs`
#### [x] Task 2.3: Implement reconnection loop in `main.rs`
**File**: `src/main.rs`
**Change**: Replace the current direct call pattern with a reconnection loop.
@ -339,18 +339,18 @@ Command::Local {
4. **Backoff resets on successful connection** — if the tunnel runs for hours then drops, we start with short delays again.
**Subtasks**:
- [ ] Destructure new CLI fields in the match arm
- [ ] Create `ExponentialBackoff` with configured max delay
- [ ] Keep first `Client::new()` call outside the loop (fail fast on first attempt)
- [ ] Add reconnection loop after first disconnection
- [ ] Implement `is_auth_error()` helper function
- [ ] Reset backoff on successful `Client::new()`
- [ ] Add info/warn logging for reconnection state transitions
- [ ] Import `Duration` and `ExponentialBackoff` in `main.rs`
- [x]Destructure new CLI fields in the match arm
- [x]Create `ExponentialBackoff` with configured max delay
- [x]Keep first `Client::new()` call outside the loop (fail fast on first attempt)
- [x]Add reconnection loop after first disconnection
- [x]Implement `is_auth_error()` helper function
- [x]Reset backoff on successful `Client::new()`
- [x]Add info/warn logging for reconnection state transitions
- [x]Import `Duration` and `ExponentialBackoff` in `main.rs`
---
#### [ ] Task 2.4: Implement error classification
#### [x] Task 2.4: Implement error classification
**File**: `src/main.rs` (or `src/shared.rs`)
**Change**: Add a helper to distinguish fatal errors from retriable ones.
@ -375,9 +375,9 @@ fn is_auth_error(err: &anyhow::Error) -> bool {
**Note**: String matching on error messages is fragile but pragmatic here. The alternative (custom error types throughout the codebase) would require significant refactoring. The error messages being matched are all hardcoded strings in the bore source code, so they're stable.
**Subtasks**:
- [ ] Implement `is_auth_error()` function
- [ ] Verify all auth-related error messages in `auth.rs` and `client.rs` are covered (see matched paths above)
- [ ] Add unit test with sample error messages
- [x]Implement `is_auth_error()` function
- [x]Verify all auth-related error messages in `auth.rs` and `client.rs` are covered (see matched paths above)
- [x]Add unit test with sample error messages
---
@ -385,7 +385,7 @@ fn is_auth_error(err: &anyhow::Error) -> bool {
**Goal**: Configure OS-level TCP keepalive on control connections to detect dead connections even if the application-level heartbeat mechanism fails.
#### [ ] Task 3.1: Add `socket2` dependency
#### [x] Task 3.1: Add `socket2` dependency
**File**: `Cargo.toml`
**Change**: Add `socket2` crate for TCP keepalive configuration.
@ -406,13 +406,13 @@ tokio = { version = "1.21.0", features = ["rt-multi-thread", "io-util", "macros"
The actual resolved version is already 1.28.0 (via Cargo.lock), so no downstream impact. This change only updates the declared minimum to match the actual API requirements.
**Subtasks**:
- [ ] Add `socket2` to `[dependencies]` in `Cargo.toml`
- [ ] Bump minimum `tokio` version from `1.17.0` to `1.21.0` in both `[dependencies]` and `[dev-dependencies]`
- [ ] Verify it compiles on Linux (server) and macOS (client)
- [x]Add `socket2` to `[dependencies]` in `Cargo.toml`
- [x]Bump minimum `tokio` version from `1.17.0` to `1.21.0` in both `[dependencies]` and `[dev-dependencies]`
- [x]Verify it compiles on Linux (server) and macOS (client)
---
#### [ ] Task 3.2: Create TCP keepalive configuration helper
#### [x] Task 3.2: Create TCP keepalive configuration helper
**File**: `src/shared.rs`
**Change**: Add a function to configure TCP keepalive on a `TcpStream`.
@ -445,13 +445,13 @@ pub fn set_tcp_keepalive(stream: &TcpStream) -> Result<()> {
**Note on `SockRef::from(stream)`**: This calls `AsFd::as_fd()` on the tokio `TcpStream`. The reference should be passed directly (not `&stream`) since `SockRef::from` takes `&impl AsFd`.
**Subtasks**:
- [ ] Implement `set_tcp_keepalive()` function
- [ ] Verify `.with_retries()` compiles on Linux and macOS (target platforms)
- [ ] Add doc comment explaining the timing parameters and `AsFd` requirement
- [x]Implement `set_tcp_keepalive()` function
- [x]Verify `.with_retries()` compiles on Linux and macOS (target platforms)
- [x]Add doc comment explaining the timing parameters and `AsFd` requirement
---
#### [ ] Task 3.3: Apply TCP keepalive to client control connection
#### [x] Task 3.3: Apply TCP keepalive to client control connection
**File**: `src/client.rs`
**Change**: In `Client::new()`, after establishing the control connection, set TCP keepalive.
@ -473,13 +473,13 @@ impl<U> Delimited<U> {
```
**Subtasks**:
- [ ] Add `get_ref()` method to `Delimited<U>` in `shared.rs`
- [ ] Call `set_tcp_keepalive()` on the control stream in `Client::new()`
- [ ] Also apply to per-connection streams in `handle_connection()` (optional, lower priority)
- [x]Add `get_ref()` method to `Delimited<U>` in `shared.rs`
- [x]Call `set_tcp_keepalive()` on the control stream in `Client::new()`
- [x]Also apply to per-connection streams in `handle_connection()` (optional, lower priority)
---
#### [ ] Task 3.4: Apply TCP keepalive to server control connections
#### [x] Task 3.4: Apply TCP keepalive to server control connections
**File**: `src/server.rs`
**Change**: In `handle_connection()`, after accepting the stream, set TCP keepalive.
@ -491,8 +491,8 @@ let mut stream = Delimited::new(stream);
```
**Subtasks**:
- [ ] Apply `set_tcp_keepalive()` to incoming control connections
- [ ] Import the function from `shared.rs`
- [x]Apply `set_tcp_keepalive()` to incoming control connections
- [x]Import the function from `shared.rs`
---
@ -500,7 +500,7 @@ let mut stream = Delimited::new(stream);
**Goal**: Ensure the server handles rapid client reconnection gracefully.
#### [ ] Task 4.1: Handle port-in-use during client reconnection
#### [x] Task 4.1: Handle port-in-use during client reconnection
**Context**: When a client disconnects and reconnects quickly (requesting the same port), the old server task may still be running because:
- The heartbeat send at `server.rs:147` hasn't failed yet (TCP write buffers haven't flushed)
@ -512,13 +512,13 @@ let mut stream = Delimited::new(stream);
**No code change needed** — the existing behavior + reconnection loop handles this. But we should verify and document this.
**Subtasks**:
- [ ] Verify that "port already in use" error from server doesn't match `is_auth_error()`
- [ ] Add integration test for rapid reconnection with same port
- [ ] Document this behavior in code comments
- [x]Verify that "port already in use" error from server doesn't match `is_auth_error()`
- [x]Add integration test for rapid reconnection with same port
- [x]Document this behavior in code comments
---
#### [ ] Task 4.2: Add server-side heartbeat response requirement (Optional / Future Enhancement)
#### [x] Task 4.2: Add server-side heartbeat response requirement (Optional / Future Enhancement)
**Current behavior**: Server sends `Heartbeat`, client ignores it. Server detects dead client only when `stream.send(Heartbeat)` fails — which depends on TCP write buffer flushing.
@ -530,8 +530,8 @@ let mut stream = Delimited::new(stream);
3. The current approach (client-side heartbeat timeout + reconnection) is sufficient
**Subtasks**:
- [ ] Document as future enhancement
- [ ] No implementation in this phase
- [x]Document as future enhancement
- [x]No implementation in this phase
---
@ -541,7 +541,7 @@ let mut stream = Delimited::new(stream);
**NOTE — Existing test impact**: The `spawn_client()` helper in `tests/e2e_test.rs:30` spawns `client.listen()` via `tokio::spawn(client.listen())`. After Phase 1 changes, `listen()` always returns `Err` (never `Ok(())`). Since the spawned task's result is dropped (not `.await`ed), existing tests still pass — the error is silently ignored. However, for clarity and to avoid confusing error logs during test runs, consider adding a `.map_err(|e| warn!(...))` or similar handling in `spawn_client()`.
#### [ ] Task 5.1: Unit test for `ExponentialBackoff`
#### [x] Task 5.1: Unit test for `ExponentialBackoff`
**File**: `src/shared.rs` (inline `#[cfg(test)]` module) or `tests/backoff_test.rs`
@ -579,14 +579,14 @@ fn test_backoff_reset() {
```
**Subtasks**:
- [ ] Test exponential growth sequence
- [ ] Test max cap is respected
- [ ] Test reset returns to base delay
- [ ] Test jitter bounds
- [x]Test exponential growth sequence
- [x]Test max cap is respected
- [x]Test reset returns to base delay
- [x]Test jitter bounds
---
#### [ ] Task 5.2: Unit test for error classification
#### [x] Task 5.2: Unit test for error classification
**File**: `tests/reconnect_test.rs` or inline in `main.rs`
@ -608,14 +608,14 @@ fn test_auth_error_detection() {
```
**Subtasks**:
- [ ] Test auth errors are classified as fatal
- [ ] Test connection errors are classified as retriable
- [ ] Test heartbeat timeout is classified as retriable
- [ ] Test port conflict is classified as retriable
- [x]Test auth errors are classified as fatal
- [x]Test connection errors are classified as retriable
- [x]Test heartbeat timeout is classified as retriable
- [x]Test port conflict is classified as retriable
---
#### [ ] Task 5.3: Integration test — reconnection after server restart
#### [x] Task 5.3: Integration test — reconnection after server restart
**File**: `tests/e2e_test.rs`
@ -634,15 +634,15 @@ async fn reconnect_after_server_restart() {
```
**Subtasks**:
- [ ] Set up test infrastructure for server restart
- [ ] Verify client detects disconnection via heartbeat timeout
- [ ] Verify client reconnects after server restart
- [ ] Verify tunnel is functional after reconnection
- [ ] Use `SERIAL_GUARD` mutex for test isolation
- [x]Set up test infrastructure for server restart
- [x]Verify client detects disconnection via heartbeat timeout
- [x]Verify client reconnects after server restart
- [x]Verify tunnel is functional after reconnection
- [x]Use `SERIAL_GUARD` mutex for test isolation
---
#### [ ] Task 5.4: Integration test — auth failure does not retry
#### [x] Task 5.4: Integration test — auth failure does not retry
**File**: `tests/e2e_test.rs`
@ -656,13 +656,13 @@ async fn auth_failure_no_retry() {
```
**Subtasks**:
- [ ] Start server with one secret, client with another
- [ ] Verify client exits immediately (doesn't retry)
- [ ] Verify error message indicates auth failure
- [x]Start server with one secret, client with another
- [x]Verify client exits immediately (doesn't retry)
- [x]Verify error message indicates auth failure
---
#### [ ] Task 5.5: Integration test — `--no-reconnect` flag
#### [x] Task 5.5: Integration test — `--no-reconnect` flag
**File**: `tests/e2e_test.rs`
@ -677,12 +677,12 @@ async fn no_reconnect_flag_exits_on_disconnect() {
```
**Subtasks**:
- [ ] Test that `--no-reconnect` preserves legacy exit behavior
- [ ] Verify process exits after connection loss
- [x]Test that `--no-reconnect` preserves legacy exit behavior
- [x]Verify process exits after connection loss
---
#### [ ] Task 5.6: Integration test — rapid reconnection with same port
#### [x] Task 5.6: Integration test — rapid reconnection with same port
**File**: `tests/e2e_test.rs`
@ -698,15 +698,15 @@ async fn reconnect_same_port() {
```
**Subtasks**:
- [ ] Test specific port re-binding after reconnection
- [ ] Verify the client gets the same port after reconnection
- [ ] Handle transient "port in use" errors during the test
- [x]Test specific port re-binding after reconnection
- [x]Verify the client gets the same port after reconnection
- [x]Handle transient "port in use" errors during the test
---
### Phase 6: Documentation (Polish)
#### [ ] Task 6.1: Update README with reconnection behavior
#### [x] Task 6.1: Update README with reconnection behavior
**File**: `README.md`
@ -718,13 +718,13 @@ Add a section documenting:
- Auth failures are never retried
**Subtasks**:
- [ ] Add "Reconnection" section to README
- [ ] Document CLI flags
- [ ] Add example showing reconnection in logs
- [x]Add "Reconnection" section to README
- [x]Document CLI flags
- [x]Add example showing reconnection in logs
---
#### [ ] Task 6.2: Update CLAUDE.md architecture section
#### [x] Task 6.2: Update CLAUDE.md architecture section
**File**: `CLAUDE.md`
@ -736,9 +736,9 @@ Update the architecture documentation to reflect:
- New CLI flags
**Subtasks**:
- [ ] Update Architecture section
- [ ] Update Key Patterns section
- [ ] Update Protocol Flow section
- [x]Update Architecture section
- [x]Update Key Patterns section
- [x]Update Protocol Flow section
---

View file

@ -8,7 +8,10 @@ use tracing::{error, info, info_span, warn, Instrument};
use uuid::Uuid;
use crate::auth::Authenticator;
use crate::shared::{ClientMessage, Delimited, ServerMessage, CONTROL_PORT, NETWORK_TIMEOUT};
use crate::shared::{
set_tcp_keepalive, ClientMessage, Delimited, ServerMessage, CONTROL_PORT, HEARTBEAT_TIMEOUT,
NETWORK_TIMEOUT,
};
/// State structure for the client.
pub struct Client {
@ -40,7 +43,9 @@ impl Client {
port: u16,
secret: Option<&str>,
) -> Result<Self> {
let mut stream = Delimited::new(connect_with_timeout(to, CONTROL_PORT).await?);
let tcp_stream = connect_with_timeout(to, CONTROL_PORT).await?;
set_tcp_keepalive(&tcp_stream)?;
let mut stream = Delimited::new(tcp_stream);
let auth = secret.map(Authenticator::new);
if let Some(auth) = &auth {
auth.client_handshake(&mut stream).await?;
@ -79,25 +84,32 @@ impl Client {
let mut conn = self.conn.take().unwrap();
let this = Arc::new(self);
loop {
match conn.recv().await? {
Some(ServerMessage::Hello(_)) => warn!("unexpected hello"),
Some(ServerMessage::Challenge(_)) => warn!("unexpected challenge"),
Some(ServerMessage::Heartbeat) => (),
Some(ServerMessage::Connection(id)) => {
let this = Arc::clone(&this);
tokio::spawn(
async move {
info!("new connection");
match this.handle_connection(id).await {
Ok(_) => info!("connection exited"),
Err(err) => warn!(%err, "connection exited with error"),
}
}
.instrument(info_span!("proxy", %id)),
);
match timeout(HEARTBEAT_TIMEOUT, conn.recv()).await {
Err(_elapsed) => {
// No message received for HEARTBEAT_TIMEOUT seconds.
// Server sends heartbeats every 500ms, so connection is dead.
bail!("heartbeat timeout, connection to server lost");
}
Some(ServerMessage::Error(err)) => error!(%err, "server error"),
None => return Ok(()),
Ok(msg) => match msg? {
Some(ServerMessage::Hello(_)) => warn!("unexpected hello"),
Some(ServerMessage::Challenge(_)) => warn!("unexpected challenge"),
Some(ServerMessage::Heartbeat) => (),
Some(ServerMessage::Connection(id)) => {
let this = Arc::clone(&this);
tokio::spawn(
async move {
info!("new connection");
match this.handle_connection(id).await {
Ok(_) => info!("connection exited"),
Err(err) => warn!(%err, "connection exited with error"),
}
}
.instrument(info_span!("proxy", %id)),
);
}
Some(ServerMessage::Error(err)) => error!(%err, "server error"),
None => bail!("server closed connection"),
},
}
}
}

View file

@ -1,8 +1,11 @@
use std::net::IpAddr;
use std::time::Duration;
use anyhow::Result;
use bore_cli::shared::ExponentialBackoff;
use bore_cli::{client::Client, server::Server};
use clap::{error::ErrorKind, CommandFactory, Parser, Subcommand};
use tracing::{info, warn};
#[derive(Parser, Debug)]
#[clap(author, version, about)]
@ -34,6 +37,14 @@ enum Command {
/// Optional secret for authentication.
#[clap(short, long, env = "BORE_SECRET", hide_env_values = true)]
secret: Option<String>,
/// Disable automatic reconnection on connection loss.
#[clap(long, default_value_t = false)]
no_reconnect: bool,
/// Maximum delay between reconnection attempts, in seconds.
#[clap(long, default_value_t = 64, value_name = "SECONDS")]
max_reconnect_delay: u64,
},
/// Runs the remote proxy server.
@ -60,6 +71,15 @@ enum Command {
},
}
/// Check if an error is an authentication error that should not be retried.
fn is_auth_error(err: &anyhow::Error) -> bool {
let msg = format!("{err:#}");
msg.contains("server requires authentication")
|| msg.contains("invalid secret")
|| msg.contains("server requires secret")
|| msg.contains("expected authentication challenge")
}
#[tokio::main]
async fn run(command: Command) -> Result<()> {
match command {
@ -69,9 +89,56 @@ async fn run(command: Command) -> Result<()> {
to,
port,
secret,
no_reconnect,
max_reconnect_delay,
} => {
// First attempt — propagate errors directly for immediate feedback
let client = Client::new(&local_host, local_port, &to, port, secret.as_deref()).await?;
client.listen().await?;
if no_reconnect {
// Legacy behavior: exit on any disconnection
client.listen().await?;
} else {
// Reconnection mode: retry on transient failures
let mut backoff = ExponentialBackoff::new(
Duration::from_secs(1),
Duration::from_secs(max_reconnect_delay),
);
// Run the first listen (we already have a connected client)
if let Err(e) = client.listen().await {
warn!("connection lost: {e:#}");
}
// Reconnection loop
loop {
let delay = backoff.next_delay();
info!("reconnecting in {delay:.1?}...");
tokio::time::sleep(delay).await;
match Client::new(&local_host, local_port, &to, port, secret.as_deref()).await {
Ok(client) => {
backoff.reset();
info!("reconnected successfully");
match client.listen().await {
Ok(()) => unreachable!("listen() now always returns Err"),
Err(e) => {
if is_auth_error(&e) {
return Err(e);
}
warn!("connection lost: {e:#}");
}
}
}
Err(e) => {
if is_auth_error(&e) {
return Err(e);
}
warn!("reconnection failed: {e:#}");
}
}
}
}
}
Command::Server {
min_port,
@ -100,3 +167,37 @@ fn main() -> Result<()> {
tracing_subscriber::fmt::init();
run(Args::parse().command)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_auth_error_detection() {
// Fatal auth errors — should NOT be retried
assert!(is_auth_error(&anyhow::anyhow!(
"server requires authentication, but no client secret was provided"
)));
assert!(is_auth_error(&anyhow::anyhow!(
"server error: invalid secret"
)));
assert!(is_auth_error(&anyhow::anyhow!(
"server error: server requires secret, but no secret was provided"
)));
assert!(is_auth_error(&anyhow::anyhow!(
"expected authentication challenge, but no secret was required"
)));
// Retriable errors — should be retried
assert!(!is_auth_error(&anyhow::anyhow!(
"could not connect to server:7835"
)));
assert!(!is_auth_error(&anyhow::anyhow!(
"heartbeat timeout, connection to server lost"
)));
assert!(!is_auth_error(&anyhow::anyhow!(
"server error: port already in use"
)));
assert!(!is_auth_error(&anyhow::anyhow!("server closed connection")));
}
}

View file

@ -12,7 +12,7 @@ use tracing::{info, info_span, warn, Instrument};
use uuid::Uuid;
use crate::auth::Authenticator;
use crate::shared::{ClientMessage, Delimited, ServerMessage, CONTROL_PORT};
use crate::shared::{set_tcp_keepalive, ClientMessage, Delimited, ServerMessage, CONTROL_PORT};
/// State structure for the server.
pub struct Server {
@ -116,6 +116,7 @@ impl Server {
}
async fn handle_connection(&self, stream: TcpStream) -> Result<()> {
set_tcp_keepalive(&stream)?;
let mut stream = Delimited::new(stream);
if let Some(auth) = &self.auth {
if let Err(err) = auth.server_handshake(&mut stream).await {

View file

@ -5,7 +5,9 @@ use std::time::Duration;
use anyhow::{Context, Result};
use futures_util::{SinkExt, StreamExt};
use serde::{de::DeserializeOwned, Deserialize, Serialize};
use socket2::{SockRef, TcpKeepalive};
use tokio::io::{AsyncRead, AsyncWrite};
use tokio::net::TcpStream;
use tokio::time::timeout;
use tokio_util::codec::{AnyDelimiterCodec, Framed, FramedParts};
use tracing::trace;
@ -20,6 +22,12 @@ pub const MAX_FRAME_LENGTH: usize = 256;
/// Timeout for network connections and initial protocol messages.
pub const NETWORK_TIMEOUT: Duration = Duration::from_secs(3);
/// Timeout for detecting a dead control connection.
///
/// The server sends heartbeats every 500ms. If no message is received within
/// this duration, the connection is considered dead.
pub const HEARTBEAT_TIMEOUT: Duration = Duration::from_secs(8);
/// A message from the client on the control connection.
#[derive(Debug, Serialize, Deserialize)]
pub enum ClientMessage {
@ -92,8 +100,95 @@ impl<U: AsyncRead + AsyncWrite + Unpin> Delimited<U> {
Ok(())
}
/// Get a reference to the underlying transport stream.
pub fn get_ref(&self) -> &U {
self.0.get_ref()
}
/// Consume this object, returning current buffers and the inner transport.
pub fn into_parts(self) -> FramedParts<U, AnyDelimiterCodec> {
self.0.into_parts()
}
}
/// Simple exponential backoff with jitter for reconnection delays.
pub struct ExponentialBackoff {
current: Duration,
base: Duration,
max: Duration,
}
impl ExponentialBackoff {
/// Create a new exponential backoff starting at `base` delay, capped at `max`.
pub fn new(base: Duration, max: Duration) -> Self {
Self {
current: base,
base,
max,
}
}
/// Get the next delay and advance the backoff state.
/// Includes random jitter of +/- 25% to prevent thundering herd.
pub fn next_delay(&mut self) -> Duration {
let delay = self.current;
self.current = (self.current * 2).min(self.max);
// Add jitter: multiply by random factor between 0.75 and 1.25
let jitter_factor = 0.75 + fastrand::f64() * 0.5;
delay.mul_f64(jitter_factor)
}
/// Reset backoff to initial delay (call after successful connection).
pub fn reset(&mut self) {
self.current = self.base;
}
}
/// Configure TCP keepalive on a stream for faster dead connection detection.
///
/// This sets the OS to start probing after 30s of idle, probe every 10s,
/// and give up after 3 failed probes (~60s total to detect a dead connection).
pub fn set_tcp_keepalive(stream: &TcpStream) -> Result<()> {
let sock_ref = SockRef::from(stream);
let keepalive = TcpKeepalive::new()
.with_time(Duration::from_secs(30))
.with_interval(Duration::from_secs(10))
.with_retries(3);
sock_ref
.set_tcp_keepalive(&keepalive)
.context("failed to set TCP keepalive")?;
Ok(())
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_backoff_sequence() {
let mut backoff = ExponentialBackoff::new(Duration::from_secs(1), Duration::from_secs(30));
// Delays should roughly double: 1, 2, 4, 8, 16, 30 (capped), 30, ...
// With jitter, each delay is between 0.75x and 1.25x the base
for expected_base in [1, 2, 4, 8, 16, 30, 30] {
let delay = backoff.next_delay();
let min = Duration::from_secs(expected_base).mul_f64(0.75);
let max = Duration::from_secs(expected_base).mul_f64(1.25);
assert!(
delay >= min && delay <= max,
"delay {delay:?} out of range [{min:?}, {max:?}]"
);
}
}
#[test]
fn test_backoff_reset() {
let mut backoff = ExponentialBackoff::new(Duration::from_secs(1), Duration::from_secs(60));
backoff.next_delay(); // 1s
backoff.next_delay(); // 2s
backoff.next_delay(); // 4s
backoff.reset();
let delay = backoff.next_delay();
// After reset, should be back to ~1s (with jitter)
assert!(delay < Duration::from_secs(2));
}
}