# MariaDB/MySQL Binlog Replication Service

A robust MySQL/MariaDB binlog streaming replication service with automatic initial data transfer, resilience features, and comprehensive error handling. Supports single or multi-secondary replica configurations with optional Graylog logging.

## Installation

### Quick Install (Go)

```bash
go install git.ma-al.com/goc_marek/replica/cmd/replica@latest
```

### Build from Source

```bash
# Clone the repository
git clone https://git.ma-al.com/goc_marek/replica.git
cd replica

# Build the service
go build -o replica ./cmd/replica

# Or install globally
go install ./cmd/replica
```

### Docker

```bash
# Build the image
docker build -t replica .

# Run with docker-compose
docker-compose up -d
```

## Quick Start

```bash
# Copy example environment
cp example.env .env

# Edit .env with your configuration
nano .env

# Run the service
./replica
```

## Features

### Core Functionality
- **Binlog Streaming**: Real-time replication from MySQL/MariaDB binlog events
- **Initial Data Transfer**: Automatic bulk data transfer when resync is needed
- **Multi-table Support**: Replicates INSERT, UPDATE, DELETE events for multiple tables
- **Position Persistence**: Saves and resumes from last processed binlog position
- **Schema Filtering**: Only replicates events for the configured schema (default: "replica")

### Resilience & Error Handling
- **Panic Recovery**: Automatic recovery from unexpected panics in event handlers
- **Retry Logic**: Exponential backoff retry for failed SQL operations
- **Connection Health Checks**: Periodic connection health monitoring with auto-reconnect
- **Schema Drift Detection**: Detects and handles schema changes during replication
- **Graceful Degradation**: Skips problematic tables after repeated failures
- **Auto-Reconnect**: Automatic reconnection on connection errors

### Transfer Features
- **Chunked Transfers**: Efficient batch transfers using primary key ranges
- **Progress Checkpointing**: Saves transfer progress for resume after interruption
- **Pause/Resume**: Support for pausing and resuming initial transfers
- **Transfer Statistics**: Detailed logging of transfer progress and errors

## Architecture

### Components

| File | Purpose |
|------|---------|
| [`cmd/replica/main.go`](cmd/replica/main.go) | Application entry point and configuration |
| [`pkg/replica/service.go`](pkg/replica/service.go) | BinlogSyncService - core replication orchestration |
| [`pkg/replica/handlers.go`](pkg/replica/handlers.go) | EventHandlers - binlog event processing with resilience |
| [`pkg/replica/initial_transfer.go`](pkg/replica/initial_transfer.go) | InitialTransfer - bulk data transfer management |
| [`pkg/replica/position.go`](pkg/replica/position.go) | PositionManager - binlog position persistence |
| [`pkg/replica/sqlbuilder.go`](pkg/replica/sqlbuilder.go) | SQLBuilder - SQL statement generation |
| [`pkg/replica/config.go`](pkg/replica/config.go) | Configuration types |
| [`pkg/replica/logging.go`](pkg/replica/logging.go) | Structured logging with Graylog support |

### Data Flow

```
Primary DB (MySQL/MariaDB)
       |
       v
   Binlog Stream
       |
       v
BinlogSyncService.processEvent()
       |
       +---> Panic Recovery Wrapper
       |
       +---> EventHandlers.HandleRows()
       |          |
       |          +---> Schema Filter (only "replica" schema)
       |          +---> Schema Drift Check
       |          +---> Retry Logic (if needed)
       |          +---> Execute SQL
       |
       +---> PositionManager.Save()
       |
       +---> Health Checks (every 30s)
```

## Usage

### Quick Start

```bash
# Build the service
go build -o replica

# Run the service
./replica
```

### Configuration

### Environment Variables

All configuration is done via environment variables in the `.env` file:

```bash
# Copy example environment
cp example.env .env

# Edit with your settings
nano .env
```

### Primary Database Configuration

| Variable | Description | Default |
|----------|-------------|---------|
| `MARIA_PRIMARY_HOST` | Primary database hostname | `mariadb-primary` |
| `MARIA_PRIMARY_PORT` | Primary database port | `3306` |
| `MARIA_USER` | Replication user | `replica` |
| `MARIA_PASS` | Replication password | `replica` |
| `MARIA_SERVER_ID` | Unique server ID for binlog | `100` |
| `MARIA_PRIMARY_NAME` | Instance name for logging | `mariadb-primary` |

### Multi-Secondary Replica Configuration

The service supports replicating to multiple secondary databases simultaneously. Configure secondaries using comma-separated values:

| Variable | Description | Example |
|----------|-------------|---------|
| `MARIA_SECONDARY_HOSTS` | Comma-separated hostnames | `secondary-1,secondary-2,secondary-3` |
| `MARIA_SECONDARY_PORTS` | Comma-separated ports | `3307,3308,3309` |
| `MARIA_SECONDARY_NAMES` | Comma-separated instance names | `replica-1,replica-2,replica-3` |
| `MARIA_SECONDARY_USERS` | Per-secondary users (optional) | `replica1,replica2,replica3` |
| `MARIA_SECONDARY_PASSWORDS` | Per-secondary passwords (optional) | `pass1,pass2,pass3` |

#### Example: Single Secondary

```bash
MARIA_SECONDARY_HOSTS=mariadb-secondary
MARIA_SECONDARY_PORTS=3307
MARIA_SECONDARY_NAMES=secondary-1
```

#### Example: Three Secondaries

```bash
MARIA_SECONDARY_HOSTS=secondary-1,secondary-2,secondary-3
MARIA_SECONDARY_PORTS=3307,3308,3309
MARIA_SECONDARY_NAMES=replica-1,replica-2,replica-3
MARIA_SECONDARY_USERS=replica1,replica2,replica3
MARIA_SECONDARY_PASSWORDS=secret1,secret2,secret3
```

#### Example: Two Secondaries with Different Credentials

```bash
MARIA_SECONDARY_HOSTS=secondary-1,secondary-2
MARIA_SECONDARY_PORTS=3307,3308
MARIA_SECONDARY_NAMES=replica-east,replica-west
MARIA_SECONDARY_USERS=replica_east,replica_west
MARIA_SECONDARY_PASSWORDS=east_secret,west_secret
```

**Note:** If `MARIA_SECONDARY_USERS` or `MARIA_SECONDARY_PASSWORDS` are not provided, the default `MARIA_USER` and `MARIA_PASS` will be used for all secondaries.

### Graylog Configuration

| Variable | Description | Default |
|----------|-------------|---------|
| `GRAYLOG_ENABLED` | Enable Graylog logging | `false` |
| `GRAYLOG_ENDPOINT` | Graylog GELF endpoint | `localhost:12201` |
| `GRAYLOG_PROTOCOL` | Protocol (udp/tcp) | `udp` |
| `GRAYLOG_TIMEOUT` | Connection timeout | `5s` |
| `GRAYLOG_SOURCE` | Source name for logs | `binlog-sync` |

#### Enable Graylog Logging

```bash
# Edit .env and set:
GRAYLOG_ENABLED=true
GRAYLOG_ENDPOINT=graylog.example.com:12201
GRAYLOG_PROTOCOL=udp
GRAYLOG_SOURCE=binlog-sync-prod
```

### Other Settings

| Variable | Description | Default |
|----------|-------------|---------|
| `TRANSFER_BATCH_SIZE` | Rows per transfer chunk | `1000` |
| `LOCAL_PROJECT_NAME` | Project name for logging | `naluconcept` |

## Resilience Features

### Panic Recovery

All event handlers include automatic panic recovery:

```go
defer func() {
    if r := recover(); r != nil {
        log.Printf("[PANIC RECOVERED] %v", r)
    }
}()
```

This prevents a single malformed event from crashing the entire service. Recovery is implemented in:
- [`HandleRows()`](handlers.go:42)
- [`HandleQuery()`](handlers.go:81)
- [`HandleTableMap()`](handlers.go:100)
- [`processEvent()`](service.go:177)

### Retry Logic

Failed SQL operations are retried with exponential backoff:

```go
func (h *EventHandlers) executeWithRetry(query string) error {
    for attempt := 0; attempt <= h.retryAttempts; attempt++ {
        if attempt > 0 {
            delay := h.retryDelay * time.Duration(1<<attempt)
            log.Printf("[RETRY] Retrying in %v (attempt %d/%d)", delay, attempt, h.retryAttempts)
            time.Sleep(delay)
        }
        result, err := h.secondaryDB.Exec(query)
        if err == nil {
            return nil
        }
        // Check if connection error and reconnect
        if h.isConnectionError(err) {
            h.reconnect()
        }
    }
    return lastErr
}
```

**Configuration:**
- `retryAttempts`: Number of retry attempts (default: 3)
- `retryDelay`: Base delay between retries (default: 100ms)

### Connection Health Checks

The service performs periodic health checks every 30 seconds:

```go
// Checks:
// 1. Ping secondary database connection
// 2. Verify syncer is still active
// 3. Log health status
```

If a health check fails, the error is logged. The next SQL operation will trigger an auto-reconnect.

### Schema Drift Detection

Before applying events, the service checks for schema changes:

```go
func (h *EventHandlers) detectSchemaDrift(schema, table string) bool {
    currentHash, _ := h.getSchemaHash(schema, table)
    lastHash := h.lastSchemaHash[schema+"."+table]
    
    if lastHash != "" && lastHash != currentHash {
        log.Printf("[DRIFT] Schema changed for %s.%s", schema, table)
        return true  // Drift detected
    }
    h.lastSchemaHash[schema+"."+table] = currentHash
    return false
}
```

If schema drift is detected, the table is marked as failed and temporarily skipped.

### Graceful Degradation

Tables with repeated failures are temporarily skipped:

```go
// After 5 consecutive failures, skip the table
if h.failedTables[key] >= h.maxFailures {
    log.Printf("[SKIPPED] Too many failures for %s.%s", schema, table)
    return nil // Skip this event
}

// Reset count on successful operation
h.failedTables[key] = 0
```

**Configuration:**
- `maxFailures`: Consecutive failures before skipping (default: 5)

### Auto-Reconnect

Connection errors trigger automatic reconnection:

```go
func (h *EventHandlers) reconnect() {
    h.secondaryDB.Close()
    
    maxRetries := 5
    for i := 0; i < maxRetries; i++ {
        h.secondaryDB, err = sql.Open(dsn)
        // Configure connection pool
        // Ping to verify
        if err == nil {
            log.Printf("[RECONNECT] Successfully reconnected")
            return
        }
        time.Sleep(time.Duration(i+1) * time.Second)
    }
}
```

**Detected connection errors:**
- "connection refused"
- "connection reset"
- "broken pipe"
- "timeout"
- "driver: bad connection"
- "invalid connection"

## Initial Transfer

When resync is needed (empty replica or no saved position), the service performs an initial data transfer:

### Transfer Process

1. **Detection**: Check if secondary database is empty or no position saved
2. **Database Enumeration**: List all databases from primary
3. **Schema Exclusion**: Skip excluded schemas (information_schema, mysql, etc.)
4. **Table Transfer**: For each table:
   - Get table schema (column definitions)
   - Check row count
   - Transfer in chunks using primary key or LIMIT/OFFSET
5. **Progress Checkpointing**: Save progress to JSON file every 1000 rows
6. **Position Reset**: Clear saved binlog position after successful transfer
7. **Binlog Streaming**: Start streaming from current position

### Chunked Transfer

Tables are transferred in chunks for efficiency and memory safety:

```sql
-- Using primary key (efficient, preserves order)
SELECT * FROM table WHERE pk >= 1000 AND pk < 2000 ORDER BY pk

-- Without primary key (slower, may skip rows on updates)
SELECT * FROM table LIMIT 1000 OFFSET 1000
```

**Batch Size:** Configurable (default: 1000 rows per chunk)

### Progress Checkpointing

Transfer progress is saved to `transfer_progress_{instance}.json`:

```json
{
  "DatabasesProcessed": 2,
  "CurrentDatabase": "mydb",
  "TablesProcessed": {
    "mydb.users": 5000,
    "mydb.orders": 10000
  },
  "LastCheckpoint": "2024-01-15T10:30:00Z"
}
```

If the transfer is interrupted, it resumes from the last checkpoint.

### Pause/Resume

Transfers can be paused and resumed programmatically:

```go
transfer := NewInitialTransfer(dsn, dsn, 1000, 1)

// Pause during transfer
transfer.Pause()

// Resume later
transfer.Resume()
```

## Configuration Reference

### BinlogConfig

```go
type BinlogConfig struct {
    Host     string // MySQL/MariaDB host
    Port     int    // MySQL/MariaDB port
    User     string // Replication user
    Password string // Password
    ServerID uint32 // Unique server ID for replication
    Name     string // Instance name for logging
}
```

### EventHandlers

```go
type EventHandlers struct {
    secondaryDB    *sql.DB
    tableMapCache   map[uint64]*replication.TableMapEvent
    sqlBuilder      *SQLBuilder
    failedTables    map[string]int
    maxFailures     int              // Default: 5
    retryAttempts   int              // Default: 3
    retryDelay      time.Duration    // Default: 100ms
    lastSchemaHash  map[string]string
}
```

### InitialTransfer

```go
type InitialTransfer struct {
    primaryDB      *sql.DB
    secondaryDB    *sql.DB
    batchSize      int               // Default: 1000
    workerCount    int               // Default: 1
    excludedDBs    map[string]bool
    checkpointFile string
    progress       TransferProgress
}
```

### TransferProgress

```go
type TransferProgress struct {
    DatabasesProcessed int
    CurrentDatabase    string
    TablesProcessed    map[string]int64  // "schema.table" -> rows transferred
    LastCheckpoint     time.Time
}
```

### TransferStats

```go
type TransferStats struct {
    TotalRows    int64
    TotalTables  int
    TransferTime int64  // milliseconds
    Errors       []string
    Progress     TransferProgress
}
```

## Logging

The service uses structured logging with prefixes:

| Prefix | Example | Description |
|--------|---------|-------------|
| `[INSERT]` | `[INSERT] 1 row(s) affected` | Successful INSERT |
| `[UPDATE]` | `[UPDATE] 1 row(s) affected` | Successful UPDATE |
| `[DELETE]` | `[DELETE] 1 row(s) affected` | Successful DELETE |
| `[SUCCESS]` | `[SUCCESS] 1 row(s) affected` | SQL operation succeeded |
| `[ERROR]` | `[ERROR] INSERT failed after retries` | Operation failed |
| `[WARN]` | `[WARN] Schema drift detected` | Warning condition |
| `[PANIC]` | `[PANIC RECOVERED] panic message` | Recovered panic |
| `[RETRY]` | `[RETRY] Retrying in 200ms` | Retry attempt |
| `[DRIFT]` | `[DRIFT] Schema changed` | Schema drift detected |
| `[SKIPPED]` | `[SKIPPED] Too many failures` | Table skipped |
| `[FAILURE]` | `[FAILURE] failure count: 3/5` | Table failure count |
| `[HEALTH]` | `[HEALTH CHECK] Failed` | Health check result |
| `[TRANSFER]` | `[TRANSFER] Found 5 tables` | Transfer progress |
| `[INFO]` | `[INFO] Saved position` | Informational |
| `[ROTATE]` | `[mariadb-primary] Rotated to binlog.000001` | Binlog rotation |
| `[RECONNECT]` | `[RECONNECT] Successfully reconnected` | Reconnection result |

## Schema Filtering

By default, only events for the "replica" schema are replicated:

```go
// handlers.go line 54
if schemaName != "replica" {
    return nil  // Skip all other schemas
}
```

To replicate all schemas, comment out or modify this filter.

## Error Handling

### Common Errors

| Error | Cause | Resolution |
|-------|-------|------------|
| `connection refused` | Secondary DB down | Check secondary DB status |
| `schema drift detected` | Table schema changed | Manual intervention required |
| `too many failures` | Table repeatedly failing | Check table compatibility |
| `failed to get event` | Binlog stream interrupted | Service auto-recovers |
| `expected 4 destination arguments` | SHOW MASTER STATUS columns | Fixed in code |
| `assignment to entry in nil map` | Uninitialized map | Fixed in code |

### Recovery Procedures

1. **Secondary DB Connection Lost**:
   - Service auto-reconnects (up to 5 attempts)
   - Check secondary DB logs
   - Verify network connectivity

2. **Schema Drift Detected**:
   - Stop service
   - Compare schemas: `SHOW CREATE TABLE schema.table` on both DBs
   - Sync schemas manually
   - Reset position: `DELETE FROM binlog_position`
   - Restart service

3. **Transfer Interrupted**:
   - Service resumes from `transfer_progress_{instance}.json`
   - Check checkpoint file for progress
   - Delete checkpoint file to restart transfer

4. **Event Processing Stuck**:
   - Check health check logs
   - Verify binlog position: `SELECT * FROM binlog_position`
   - Restart service if needed
   - Clear position: `DELETE FROM binlog_position`

## Monitoring

### Key Metrics to Watch

- **Replication Lag**: Time since last event processed
- **Rows Replicated**: Total INSERT/UPDATE/DELETE count
- **Tables Synced**: Number of tables synchronized
- **Error Count**: Failed operations counter
- **Success Rate**: Successful / total operations

### Manual Verification

```sql
-- Check position on secondary
SELECT * FROM binlog_position;

-- Check row counts
SELECT COUNT(*) FROM replica.your_table;

-- Compare with primary
SELECT COUNT(*) FROM your_table;
```

## Performance Considerations

1. **Batch Size**: Start with 1000, adjust based on table size and memory
2. **Connection Pooling**: `SetMaxOpenConns(25)` for moderate load
3. **Worker Count**: Currently single-threaded (multi-worker planned)
4. **Schema Caching**: Table schemas cached in memory (auto-updated on drift)
5. **Index Usage**: Chunked transfers require indexed primary key

## Limitations

- **Single-threaded**: One worker processes events sequentially
- **Position-based**: No GTID support yet (position-based only)
- **Integer PKs**: Chunking requires integer primary key for efficiency
- **No Conflict Resolution**: Concurrent writes not handled
- **No Data Validation**: Assumes source data is valid

## Known Issues & Fixes

### Fixed Issues

- [x] `SHOW MASTER STATUS` returning 4 columns (fixed by scanning 4 variables)
- [x] `nil map` panic in transfer (fixed by initializing TablesProcessed)
- [x] Event deduplication skipping valid events (disabled deduplication)
- [x] Checkpoint file path empty (fixed by setting instance-specific path)

### Workarounds

- **Position 0 Events**: Deduplication disabled for now
- **Schema Changes**: Service marks table as failed, manual restart required

## Future Enhancements

- [ ] GTID-based replication for better consistency
- [ ] Multi-threaded replication for higher throughput
- [ ] Conflict detection and resolution
- [ ] Prometheus metrics endpoint
- [ ] REST API for management
- [ ] Kubernetes operator
- [ ] Configuration file (YAML/JSON)
- [ ] Signal handling for dynamic config

## File Structure

```
replica/
├── cmd/
│   └── replica/
│       └── main.go                    # Entry point
├── pkg/
│   └── replica/
│       ├── service.go                 # Replication orchestration
│       ├── handlers.go                # Event processing
│       ├── initial_transfer.go        # Bulk data transfer
│       ├── position.go                # Position persistence
│       ├── sqlbuilder.go              # SQL generation
│       ├── config.go                  # Configuration types
│       └── logging.go                 # Structured logging
├── example.env                        # Environment template
├── .env                               # Environment (gitignored)
├── docker-compose.yml                 # Local development
├── go.mod                             # Go module
└── README.md                          # This file
```

## License

MIT License