first commit

This commit is contained in:
2026-02-12 04:52:03 +01:00
commit c99484a789
15 changed files with 2924 additions and 0 deletions

534
README.md Normal file
View File

@@ -0,0 +1,534 @@
# MariaDB/MySQL Binlog Replication Service
A robust MySQL/MariaDB binlog streaming replication service with automatic initial data transfer, resilience features, and comprehensive error handling.
## Features
### Core Functionality
- **Binlog Streaming**: Real-time replication from MySQL/MariaDB binlog events
- **Initial Data Transfer**: Automatic bulk data transfer when resync is needed
- **Multi-table Support**: Replicates INSERT, UPDATE, DELETE events for multiple tables
- **Position Persistence**: Saves and resumes from last processed binlog position
- **Schema Filtering**: Only replicates events for the configured schema (default: "replica")
### Resilience & Error Handling
- **Panic Recovery**: Automatic recovery from unexpected panics in event handlers
- **Retry Logic**: Exponential backoff retry for failed SQL operations
- **Connection Health Checks**: Periodic connection health monitoring with auto-reconnect
- **Schema Drift Detection**: Detects and handles schema changes during replication
- **Graceful Degradation**: Skips problematic tables after repeated failures
- **Auto-Reconnect**: Automatic reconnection on connection errors
### Transfer Features
- **Chunked Transfers**: Efficient batch transfers using primary key ranges
- **Progress Checkpointing**: Saves transfer progress for resume after interruption
- **Pause/Resume**: Support for pausing and resuming initial transfers
- **Transfer Statistics**: Detailed logging of transfer progress and errors
## Architecture
### Components
| File | Purpose |
|------|---------|
| [`main.go`](main.go) | Application entry point and configuration |
| [`service.go`](service.go) | BinlogSyncService - core replication orchestration |
| [`handlers.go`](handlers.go) | EventHandlers - binlog event processing with resilience |
| [`initial_transfer.go`](initial_transfer.go) | InitialTransfer - bulk data transfer management |
| [`position.go`](position.go) | PositionManager - binlog position persistence |
| [`sqlbuilder.go`](sqlbuilder.go) | SQLBuilder - SQL statement generation |
| [`config.go`](config.go) | Configuration types |
### Data Flow
```
Primary DB (MySQL/MariaDB)
|
v
Binlog Stream
|
v
BinlogSyncService.processEvent()
|
+---> Panic Recovery Wrapper
|
+---> EventHandlers.HandleRows()
| |
| +---> Schema Filter (only "replica" schema)
| +---> Schema Drift Check
| +---> Retry Logic (if needed)
| +---> Execute SQL
|
+---> PositionManager.Save()
|
+---> Health Checks (every 30s)
```
## Usage
### Quick Start
```bash
# Build the service
go build -o replica
# Run the service
./replica
```
### Configuration
Configuration is done in [`main.go`](main.go):
```go
// Primary MySQL/MariaDB connection (binlog source)
primaryDSN := "root:password@tcp(localhost:3306)/?multiStatements=true"
// Secondary (replica) connection
secondaryDSN := "root:password@tcp(localhost:3307)/replica?multiStatements=true"
// MariaDB primary configuration
primaryConfig := BinlogConfig{
Host: "localhost",
Port: 3306,
User: "root",
Password: "password",
ServerID: 100,
Name: "mariadb-primary",
}
// Create service with both connections
service := NewBinlogSyncService(primaryConfig, primaryDB, secondaryDB)
// Start with automatic resync (checks if transfer is needed)
if err := service.StartWithResync(
batchSize: 1000, // Rows per transfer chunk
excludeSchemas: []string{ // Schemas to skip during transfer
"information_schema",
"performance_schema",
"mysql",
"sys",
},
); err != nil {
log.Fatalf("Service error: %v", err)
}
```
## Resilience Features
### Panic Recovery
All event handlers include automatic panic recovery:
```go
defer func() {
if r := recover(); r != nil {
log.Printf("[PANIC RECOVERED] %v", r)
}
}()
```
This prevents a single malformed event from crashing the entire service. Recovery is implemented in:
- [`HandleRows()`](handlers.go:42)
- [`HandleQuery()`](handlers.go:81)
- [`HandleTableMap()`](handlers.go:100)
- [`processEvent()`](service.go:177)
### Retry Logic
Failed SQL operations are retried with exponential backoff:
```go
func (h *EventHandlers) executeWithRetry(query string) error {
for attempt := 0; attempt <= h.retryAttempts; attempt++ {
if attempt > 0 {
delay := h.retryDelay * time.Duration(1<<attempt)
log.Printf("[RETRY] Retrying in %v (attempt %d/%d)", delay, attempt, h.retryAttempts)
time.Sleep(delay)
}
result, err := h.secondaryDB.Exec(query)
if err == nil {
return nil
}
// Check if connection error and reconnect
if h.isConnectionError(err) {
h.reconnect()
}
}
return lastErr
}
```
**Configuration:**
- `retryAttempts`: Number of retry attempts (default: 3)
- `retryDelay`: Base delay between retries (default: 100ms)
### Connection Health Checks
The service performs periodic health checks every 30 seconds:
```go
// Checks:
// 1. Ping secondary database connection
// 2. Verify syncer is still active
// 3. Log health status
```
If a health check fails, the error is logged. The next SQL operation will trigger an auto-reconnect.
### Schema Drift Detection
Before applying events, the service checks for schema changes:
```go
func (h *EventHandlers) detectSchemaDrift(schema, table string) bool {
currentHash, _ := h.getSchemaHash(schema, table)
lastHash := h.lastSchemaHash[schema+"."+table]
if lastHash != "" && lastHash != currentHash {
log.Printf("[DRIFT] Schema changed for %s.%s", schema, table)
return true // Drift detected
}
h.lastSchemaHash[schema+"."+table] = currentHash
return false
}
```
If schema drift is detected, the table is marked as failed and temporarily skipped.
### Graceful Degradation
Tables with repeated failures are temporarily skipped:
```go
// After 5 consecutive failures, skip the table
if h.failedTables[key] >= h.maxFailures {
log.Printf("[SKIPPED] Too many failures for %s.%s", schema, table)
return nil // Skip this event
}
// Reset count on successful operation
h.failedTables[key] = 0
```
**Configuration:**
- `maxFailures`: Consecutive failures before skipping (default: 5)
### Auto-Reconnect
Connection errors trigger automatic reconnection:
```go
func (h *EventHandlers) reconnect() {
h.secondaryDB.Close()
maxRetries := 5
for i := 0; i < maxRetries; i++ {
h.secondaryDB, err = sql.Open(dsn)
// Configure connection pool
// Ping to verify
if err == nil {
log.Printf("[RECONNECT] Successfully reconnected")
return
}
time.Sleep(time.Duration(i+1) * time.Second)
}
}
```
**Detected connection errors:**
- "connection refused"
- "connection reset"
- "broken pipe"
- "timeout"
- "driver: bad connection"
- "invalid connection"
## Initial Transfer
When resync is needed (empty replica or no saved position), the service performs an initial data transfer:
### Transfer Process
1. **Detection**: Check if secondary database is empty or no position saved
2. **Database Enumeration**: List all databases from primary
3. **Schema Exclusion**: Skip excluded schemas (information_schema, mysql, etc.)
4. **Table Transfer**: For each table:
- Get table schema (column definitions)
- Check row count
- Transfer in chunks using primary key or LIMIT/OFFSET
5. **Progress Checkpointing**: Save progress to JSON file every 1000 rows
6. **Position Reset**: Clear saved binlog position after successful transfer
7. **Binlog Streaming**: Start streaming from current position
### Chunked Transfer
Tables are transferred in chunks for efficiency and memory safety:
```sql
-- Using primary key (efficient, preserves order)
SELECT * FROM table WHERE pk >= 1000 AND pk < 2000 ORDER BY pk
-- Without primary key (slower, may skip rows on updates)
SELECT * FROM table LIMIT 1000 OFFSET 1000
```
**Batch Size:** Configurable (default: 1000 rows per chunk)
### Progress Checkpointing
Transfer progress is saved to `transfer_progress_{instance}.json`:
```json
{
"DatabasesProcessed": 2,
"CurrentDatabase": "mydb",
"TablesProcessed": {
"mydb.users": 5000,
"mydb.orders": 10000
},
"LastCheckpoint": "2024-01-15T10:30:00Z"
}
```
If the transfer is interrupted, it resumes from the last checkpoint.
### Pause/Resume
Transfers can be paused and resumed programmatically:
```go
transfer := NewInitialTransfer(dsn, dsn, 1000, 1)
// Pause during transfer
transfer.Pause()
// Resume later
transfer.Resume()
```
## Configuration Reference
### BinlogConfig
```go
type BinlogConfig struct {
Host string // MySQL/MariaDB host
Port int // MySQL/MariaDB port
User string // Replication user
Password string // Password
ServerID uint32 // Unique server ID for replication
Name string // Instance name for logging
}
```
### EventHandlers
```go
type EventHandlers struct {
secondaryDB *sql.DB
tableMapCache map[uint64]*replication.TableMapEvent
sqlBuilder *SQLBuilder
failedTables map[string]int
maxFailures int // Default: 5
retryAttempts int // Default: 3
retryDelay time.Duration // Default: 100ms
lastSchemaHash map[string]string
}
```
### InitialTransfer
```go
type InitialTransfer struct {
primaryDB *sql.DB
secondaryDB *sql.DB
batchSize int // Default: 1000
workerCount int // Default: 1
excludedDBs map[string]bool
checkpointFile string
progress TransferProgress
}
```
### TransferProgress
```go
type TransferProgress struct {
DatabasesProcessed int
CurrentDatabase string
TablesProcessed map[string]int64 // "schema.table" -> rows transferred
LastCheckpoint time.Time
}
```
### TransferStats
```go
type TransferStats struct {
TotalRows int64
TotalTables int
TransferTime int64 // milliseconds
Errors []string
Progress TransferProgress
}
```
## Logging
The service uses structured logging with prefixes:
| Prefix | Example | Description |
|--------|---------|-------------|
| `[INSERT]` | `[INSERT] 1 row(s) affected` | Successful INSERT |
| `[UPDATE]` | `[UPDATE] 1 row(s) affected` | Successful UPDATE |
| `[DELETE]` | `[DELETE] 1 row(s) affected` | Successful DELETE |
| `[SUCCESS]` | `[SUCCESS] 1 row(s) affected` | SQL operation succeeded |
| `[ERROR]` | `[ERROR] INSERT failed after retries` | Operation failed |
| `[WARN]` | `[WARN] Schema drift detected` | Warning condition |
| `[PANIC]` | `[PANIC RECOVERED] panic message` | Recovered panic |
| `[RETRY]` | `[RETRY] Retrying in 200ms` | Retry attempt |
| `[DRIFT]` | `[DRIFT] Schema changed` | Schema drift detected |
| `[SKIPPED]` | `[SKIPPED] Too many failures` | Table skipped |
| `[FAILURE]` | `[FAILURE] failure count: 3/5` | Table failure count |
| `[HEALTH]` | `[HEALTH CHECK] Failed` | Health check result |
| `[TRANSFER]` | `[TRANSFER] Found 5 tables` | Transfer progress |
| `[INFO]` | `[INFO] Saved position` | Informational |
| `[ROTATE]` | `[mariadb-primary] Rotated to binlog.000001` | Binlog rotation |
| `[RECONNECT]` | `[RECONNECT] Successfully reconnected` | Reconnection result |
## Schema Filtering
By default, only events for the "replica" schema are replicated:
```go
// handlers.go line 54
if schemaName != "replica" {
return nil // Skip all other schemas
}
```
To replicate all schemas, comment out or modify this filter.
## Error Handling
### Common Errors
| Error | Cause | Resolution |
|-------|-------|------------|
| `connection refused` | Secondary DB down | Check secondary DB status |
| `schema drift detected` | Table schema changed | Manual intervention required |
| `too many failures` | Table repeatedly failing | Check table compatibility |
| `failed to get event` | Binlog stream interrupted | Service auto-recovers |
| `expected 4 destination arguments` | SHOW MASTER STATUS columns | Fixed in code |
| `assignment to entry in nil map` | Uninitialized map | Fixed in code |
### Recovery Procedures
1. **Secondary DB Connection Lost**:
- Service auto-reconnects (up to 5 attempts)
- Check secondary DB logs
- Verify network connectivity
2. **Schema Drift Detected**:
- Stop service
- Compare schemas: `SHOW CREATE TABLE schema.table` on both DBs
- Sync schemas manually
- Reset position: `DELETE FROM binlog_position`
- Restart service
3. **Transfer Interrupted**:
- Service resumes from `transfer_progress_{instance}.json`
- Check checkpoint file for progress
- Delete checkpoint file to restart transfer
4. **Event Processing Stuck**:
- Check health check logs
- Verify binlog position: `SELECT * FROM binlog_position`
- Restart service if needed
- Clear position: `DELETE FROM binlog_position`
## Monitoring
### Key Metrics to Watch
- **Replication Lag**: Time since last event processed
- **Rows Replicated**: Total INSERT/UPDATE/DELETE count
- **Tables Synced**: Number of tables synchronized
- **Error Count**: Failed operations counter
- **Success Rate**: Successful / total operations
### Manual Verification
```sql
-- Check position on secondary
SELECT * FROM binlog_position;
-- Check row counts
SELECT COUNT(*) FROM replica.your_table;
-- Compare with primary
SELECT COUNT(*) FROM your_table;
```
## Performance Considerations
1. **Batch Size**: Start with 1000, adjust based on table size and memory
2. **Connection Pooling**: `SetMaxOpenConns(25)` for moderate load
3. **Worker Count**: Currently single-threaded (multi-worker planned)
4. **Schema Caching**: Table schemas cached in memory (auto-updated on drift)
5. **Index Usage**: Chunked transfers require indexed primary key
## Limitations
- **Single-threaded**: One worker processes events sequentially
- **Position-based**: No GTID support yet (position-based only)
- **Integer PKs**: Chunking requires integer primary key for efficiency
- **No Conflict Resolution**: Concurrent writes not handled
- **No Data Validation**: Assumes source data is valid
## Known Issues & Fixes
### Fixed Issues
- [x] `SHOW MASTER STATUS` returning 4 columns (fixed by scanning 4 variables)
- [x] `nil map` panic in transfer (fixed by initializing TablesProcessed)
- [x] Event deduplication skipping valid events (disabled deduplication)
- [x] Checkpoint file path empty (fixed by setting instance-specific path)
### Workarounds
- **Position 0 Events**: Deduplication disabled for now
- **Schema Changes**: Service marks table as failed, manual restart required
## Future Enhancements
- [ ] GTID-based replication for better consistency
- [ ] Multi-threaded replication for higher throughput
- [ ] Conflict detection and resolution
- [ ] Prometheus metrics endpoint
- [ ] REST API for management
- [ ] Kubernetes operator
- [ ] Configuration file (YAML/JSON)
- [ ] Signal handling for dynamic config
## File Structure
```
/home/marek/coding/test/replica/
├── main.go # Entry point, connections
├── service.go # Replication orchestration
├── handlers.go # Event processing
├── initial_transfer.go # Bulk data transfer
├── position.go # Position persistence
├── sqlbuilder.go # SQL generation
├── config.go # Config types
├── README.md # This file
├── docker-compose.yml # Local development
├── .env # Environment (optional)
└── go.mod # Go module
```
## License
MIT License