Courier MFT

Architecture Overview

System architecture, deployment units, and dependency layers for Courier MFT.

Courier is a three-tier application deployed as three independent processes: a REST API host, a background Worker host, and a Next.js frontend. All three share a single PostgreSQL database and communicate indirectly through the database and Azure Key Vault — there is no inter-process messaging bus in V1. This polling-based coordination has a documented throughput ceiling (~50–100 jobs/hour, 3–10s pickup latency) that is acceptable for V1's target workload; Section 15 describes the migration to event-driven scheduling.

2.1 System Context

┌─────────────────────────────────────────────────────────────────────────┐
│                            COURIER PLATFORM                             │
│                                                                         │
│  ┌──────────────┐     ┌──────────────────┐     ┌────────────────────┐  │
│  │   Frontend    │────►│     API Host     │     │   Worker Host      │  │
│  │   (Next.js)   │◄────│  (ASP.NET Core)  │     │  (.NET Worker)     │  │
│  │               │     │                  │     │                    │  │
│  │  • Dashboard  │HTTPS│  • REST API      │     │  • Job Engine      │  │
│  │  • Job Builder│     │  • Auth (Entra)  │     │  • Quartz Scheduler│  │
│  │  • Monitor UI │     │  • Validation    │     │  • File Monitors   │  │
│  │  • Key Mgmt   │     │  • OpenAPI/Swagger│    │  • Key Rotation    │  │
│  │  • Audit Log  │     │                  │     │  • Partition Maint. │  │
│  └──────────────┘     └────────┬─────────┘     └─────────┬──────────┘  │
│                                │                          │             │
│                        ┌───────▼──────────────────────────▼──────┐      │
│                        │            PostgreSQL 16+               │      │
│                        │                                         │      │
│                        │  Jobs, Connections, Keys, Monitors,     │      │
│                        │  Executions, Audit Log, Quartz Tables   │      │
│                        └─────────────────────────────────────────┘      │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
                │                    │                      │
                │                    │                      │
        ┌───────▼──────┐    ┌───────▼──────┐     ┌────────▼────────┐
        │  Azure Entra  │    │  Azure Key   │     │  Partner SFTP/  │
        │  ID (Auth)    │    │  Vault       │     │  FTP Servers    │
        └──────────────┘    └──────────────┘     └─────────────────┘

2.2 Deployment Units

UnitTechnologyResponsibilitiesScaling
API HostASP.NET Core 10REST API, authentication, request validation, CRUD operations, OpenAPI specHorizontal — stateless, any number of replicas behind a load balancer
Worker Host.NET 10 Worker ServiceQuartz.NET scheduler, Job Engine execution, File Monitor polling, key rotation checks, partition maintenanceSingle instance (V1) — Quartz AdoJobStore handles clustered failover if scaled later
FrontendNext.js (standalone)User interface, OAuth 2.0 Authorization Code + PKCE flow, API consumptionHorizontal — self-contained Node.js server in container

Why separate API and Worker? The API host is request-driven and benefits from horizontal scaling. The Worker host is long-running and CPU/IO-bound (file transfers, encryption, compression). Separating them allows independent scaling, independent deployment, and prevents a runaway job from starving API response times. Both processes share the same domain logic via shared class libraries.

2.3 Internal Architecture — Vertical Slices

Courier organizes code by feature, not by technical layer. Each feature folder contains everything needed for that domain: API controllers, request/response DTOs, validators, application services, domain entities, and infrastructure adapters.

Solution structure:

Courier.sln
│
├── src/
│   ├── Courier.Api/                        ← API Host (ASP.NET Core)
│   │   ├── Program.cs                      ← Startup, middleware, DI
│   │   ├── Middleware/                      ← Exception handler, auth, CORS
│   │   └── appsettings.json
│   │
│   ├── Courier.Worker/                     ← Worker Host (.NET Worker Service)
│   │   ├── Program.cs                      ← Startup, hosted services, DI
│   │   └── appsettings.json
│   │
│   ├── Courier.Features/                   ← Shared feature library (API + Worker)
│   │   ├── Jobs/
│   │   │   ├── Entities/                   ← Job, JobStep, JobVersion, JobExecution, StepExecution
│   │   │   ├── Dtos/                       ← JobDto, CreateJobRequest, JobFilter
│   │   │   ├── Validators/                 ← CreateJobValidator, UpdateJobValidator
│   │   │   ├── Services/                   ← JobService, JobExecutionService
│   │   │   ├── Controllers/                ← JobsController
│   │   │   ├── StepTypes/                  ← IStepHandler implementations
│   │   │   └── Mapping/                    ← EF Core entity configuration
│   │   │
│   │   ├── Chains/
│   │   │   ├── Entities/
│   │   │   ├── Dtos/
│   │   │   ├── Validators/
│   │   │   ├── Services/
│   │   │   ├── Controllers/
│   │   │   └── Mapping/
│   │   │
│   │   ├── Connections/
│   │   │   ├── Entities/                   ← Connection, KnownHost
│   │   │   ├── Dtos/
│   │   │   ├── Validators/
│   │   │   ├── Services/                   ← ConnectionService, ConnectionTester
│   │   │   ├── Controllers/
│   │   │   ├── Protocols/                  ← SftpClient, FtpClient, FtpsClient adapters
│   │   │   └── Mapping/
│   │   │
│   │   ├── Keys/
│   │   │   ├── Pgp/                        ← PgpKey entity, PGP services, controllers
│   │   │   ├── Ssh/                        ← SshKey entity, SSH services, controllers
│   │   │   └── Shared/                     ← ICryptoProvider, Key rotation service
│   │   │
│   │   ├── Monitors/
│   │   │   ├── Entities/                   ← FileMonitor, MonitorJobBinding, MonitorFileLog
│   │   │   ├── Dtos/
│   │   │   ├── Validators/
│   │   │   ├── Services/                   ← MonitorService, LocalWatcher, RemotePoller
│   │   │   ├── Controllers/
│   │   │   └── Mapping/
│   │   │
│   │   ├── Tags/
│   │   │   ├── Entities/                   ← Tag, EntityTag
│   │   │   ├── Dtos/
│   │   │   ├── Services/
│   │   │   ├── Controllers/
│   │   │   └── Mapping/
│   │   │
│   │   ├── Audit/
│   │   │   ├── Entities/                   ← AuditLogEntry, DomainEvent
│   │   │   ├── Dtos/
│   │   │   ├── Services/                   ← AuditService
│   │   │   ├── Controllers/
│   │   │   └── Mapping/
│   │   │
│   │   ├── Dashboard/
│   │   │   ├── Dtos/                       ← SummaryDto, RecentExecutionDto
│   │   │   ├── Services/                   ← DashboardService
│   │   │   └── Controllers/
│   │   │
│   │   └── Settings/
│   │       ├── Entities/                   ← SystemSetting
│   │       ├── Dtos/
│   │       ├── Services/
│   │       ├── Controllers/
│   │       └── Mapping/
│   │
│   ├── Courier.Domain/                     ← Shared domain primitives
│   │   ├── Common/                         ← ApiResponse<T>, ErrorCodes, enums
│   │   ├── ValueObjects/                   ← FailurePolicy, StepConfiguration, etc.
│   │   └── Interfaces/                     ← ITransferClient, ICryptoProvider, IStepHandler
│   │
│   ├── Courier.Infrastructure/             ← Cross-cutting infrastructure
│   │   ├── Persistence/                    ← CourierDbContext, global filters, interceptors
│   │   ├── Encryption/                     ← EnvelopeEncryptionService, KeyVaultClient
│   │   ├── Compression/                    ← ZIP, GZIP, TAR, 7z providers
│   │   └── Migrations/                     ← DbUp runner, embedded SQL scripts
│   │
│   └── Courier.Frontend/                   ← Next.js project (separate build pipeline)
│       ├── src/
│       │   ├── app/                        ← Next.js app router
│       │   ├── components/                 ← Shared UI components
│       │   └── lib/                        ← API client, auth, utilities
│       └── package.json
│
├── tests/
│   ├── Courier.Tests.Unit/
│   ├── Courier.Tests.Integration/
│   └── Courier.Tests.Architecture/
│
└── infra/                                  ← Deployment configs
    ├── docker/                             ← Dockerfiles for API, Worker
    ├── k8s/                                ← Kubernetes manifests
    └── scripts/                            ← CI/CD, seed data

Key principle: Feature folders own their entire vertical slice. If you need to understand how Jobs work, you open Courier.Features/Jobs/ — the entities, DTOs, validators, services, controllers, and EF mappings are all there. Cross-cutting concerns (database context, encryption, compression) live in Courier.Infrastructure and are injected via DI.

2.4 Dependency Rules

┌───────────────────────────────────────────┐
│             Courier.Api                    │  ← Thin host: startup, middleware
│             Courier.Worker                 │  ← Thin host: startup, hosted services
├───────────────────────────────────────────┤
│         Courier.Features                   │  ← All feature slices
├───────────────────────────────────────────┤
│       Courier.Infrastructure               │  ← EF Core, encryption, compression
├───────────────────────────────────────────┤
│          Courier.Domain                    │  ← Entities, interfaces, value objects
└───────────────────────────────────────────┘

  References flow downward only. No project references upward.
  • Courier.Api → references Courier.Features, Courier.Infrastructure, Courier.Domain
  • Courier.Worker → references Courier.Features, Courier.Infrastructure, Courier.Domain
  • Courier.Features → references Courier.Infrastructure, Courier.Domain
  • Courier.Infrastructure → references Courier.Domain
  • Courier.Domain → references nothing (no NuGet dependencies except value types)

These rules are enforced by architecture tests in Courier.Tests.Architecture using NetArchTest or ArchUnitNET.

2.5 Request Flow (API)

A typical API request flows through the following pipeline:

Client Request (HTTPS)
    │
    ▼
┌─────────────────────────┐
│  Kestrel / Reverse Proxy │
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  ApiExceptionMiddleware  │  ← Catches unhandled exceptions → ApiResponse envelope
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  CORS Middleware         │  ← Validates origin against Frontend:Origin
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  Authentication          │  ← Validates Entra ID JWT bearer token
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  Authorization           │  ← Checks [Authorize(Roles = "...")] attributes
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  Security Headers        │  ← X-Content-Type-Options, CSP, HSTS, etc.
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  Serilog Request Logging │  ← Structured log with method, path, status, duration
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  Controller Action       │  ← Route matched, model binding
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  FluentValidation Filter │  ← Validates request body → 400 if invalid
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  Application Service     │  ← Business logic, domain operations
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  EF Core / DbContext     │  ← Query or persist via PostgreSQL
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  ApiResponse<T> Envelope │  ← Wrap result in standard response model
└────────────┬────────────┘
             ▼
        JSON Response

2.6 Execution Flow (Worker)

A scheduled job execution flows through the Worker host:

Quartz.NET Trigger Fires
    │
    ▼
┌─────────────────────────┐
│  QuartzJobAdapter        │  ← Quartz IJob → resolves Courier's JobExecutionService via DI
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  JobExecutionService     │  ← Creates JobExecution record, loads Job + Steps + Version
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  Step Loop               │  ← For each Step in order:
│  │                       │
│  │  ┌─────────────────┐ │
│  │  │ IStepHandler     │ │  ← Resolved by typeKey from DI container
│  │  │ .ExecuteAsync()  │ │
│  │  └────────┬────────┘ │
│  │           │           │
│  │  ┌────────▼────────┐ │
│  │  │ Transfer /       │ │  ← ITransferClient, ICryptoProvider, ICompressionProvider
│  │  │ Encrypt /        │ │
│  │  │ Compress         │ │
│  │  └────────┬────────┘ │
│  │           │           │
│  │  StepExecution saved  │  ← State, duration, bytes, output
│  │  JobContext updated   │  ← Output variables for next step
│  │                       │
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  JobExecution completed  │  ← Final state: Completed / Failed
│  Temp directory cleaned  │  ← Immediate cleanup on completion
│  Audit event logged      │
│  Downstream triggers     │  ← Dependent jobs / chain next member
└─────────────────────────┘

2.7 Data Flow Between API and Worker

The API and Worker hosts do not communicate directly. They coordinate through the database:

┌──────────────┐                              ┌──────────────┐
│   API Host   │                              │ Worker Host  │
│              │                              │              │
│  POST /jobs/{id}/execute                    │              │
│       │                                     │              │
│       ▼                                     │              │
│  Insert job_executions                      │              │
│  row (state: 'queued')  ──────────────────► │  Quartz polls│
│                            PostgreSQL        │  AdoJobStore │
│  Insert job_schedules  ──────────────────►  │              │
│  row (cron expression)                      │  Quartz picks│
│                                             │  up trigger  │
│                                             │       │      │
│  GET /jobs/{id}/executions                  │       ▼      │
│       │                                     │  Execute job │
│       ▼                                     │  Update rows │
│  Read job_executions ◄──────────────────────│  (state,     │
│  rows (state, results)     PostgreSQL       │   results)   │
│                                             │              │
└──────────────┘                              └──────────────┘

For manual execution (POST /api/v1/jobs/\{id\}/execute), the API host creates a job_executions record with state queued and schedules an immediate Quartz trigger. The Worker's Quartz scheduler picks up the trigger via its AdoJobStore poll interval and begins execution. The API host reads execution status by querying the same job_executions table.

Throughput ceiling and known limitations:

Database-as-bus is a deliberate V1 tradeoff: zero additional infrastructure, simple debugging (query the tables), and transactional consistency. It has predictable limits that should inform capacity planning:

MetricV1 Design PointBottleneck
Job throughput~50–100 jobs/hourConcurrency limit (default 5) × avg job duration. Most file transfer jobs run 10–120 seconds.
Queue poll latency3–10 seconds (p95)Quartz AdoJobStore poll interval (default 5s) + queue dequeue poll (default 5s). Worst case on a fresh trigger is the sum of both intervals.
File Monitor throughput~200 files/minute (local), ~30 files/minute (remote per connection)Local: FileSystemWatcher + stability window. Remote: poll interval × connection overhead.
Typical file size1 KB – 500 MBStreaming architecture handles large files. Memory usage scales with step buffer size (default 8 KB), not file size.
Concurrent polling loadQuartz (1 query/5s) + queue dequeue (1 query/5s) + N monitors (1 query/interval each)Single Worker instance: ~2–10 queries/second to PostgreSQL. Negligible for a dedicated database.

Known failure modes at scale:

  • Poll jitter under load: When the database is under heavy write load (e.g., bulk audit logging during many concurrent jobs), poll queries experience variable latency. This manifests as inconsistent job pickup times. Mitigation: Quartz and queue polls use dedicated read connections, not the same connection pool as writes.
  • Thundering herd on restart: If the Worker restarts with a backlog of queued jobs, Quartz fires all pending triggers simultaneously, exceeding the concurrency limit. Mitigation: the concurrency semaphore (Section 5.8) gates actual execution — excess triggers enter Queued state and wait.
  • No backpressure from API to Worker: The API can queue jobs faster than the Worker can execute them. There is no feedback mechanism to slow down callers. Mitigation: the API returns the queue position in the response, and the dashboard shows queue depth. Alert on queue depth > configurable threshold.
  • "Exactly once" is not guaranteed: Database polling provides at-least-once pickup semantics. FOR UPDATE SKIP LOCKED (Section 5.8) prevents duplicate pickup in the steady state, but crash recovery could re-execute a job that was in progress. Mitigation: jobs are designed to be re-runnable (overwrite semantics on upload), and the job_executions table records the outcome of each attempt.

These limits are acceptable for V1's target workload (internal file transfer operations, not high-frequency event processing). Section 15 documents the V2 migration to event-driven scheduling that removes the polling bottleneck.

2.8 External Integration Points

External SystemDirectionProtocolPurposeSection
Azure Entra IDInboundOAuth 2.0 / OIDCUser authentication, role claims12.1
Azure Key VaultOutboundHTTPS (REST)Master key wrap/unwrap, application secrets7, 12
Azure Blob StorageOutboundHTTPS (REST)Archived partition data (cold storage)13.6
Azure Application InsightsOutboundHTTPSTelemetry, tracing, alerting (prod)3.10
SeqOutboundHTTPStructured log search (dev only)3.10
Azure Function AppsOutboundHTTPS (Admin API)Trigger serverless functions as job steps; poll completion via App Insights5.2, 6.1
Azure Log AnalyticsOutboundHTTPS (REST)Query Application Insights for function execution status and traces5.2
Partner SFTP serversOutboundSFTP (SSH)File transfer — upload, download, directory listing6
Partner FTP/FTPS serversOutboundFTP / FTPSFile transfer — upload, download, directory listing6
PostgreSQLBothTCP (SSL)Primary data store13

2.9 Cross-Cutting Concerns

ConcernImplementationOwner
AuthenticationEntra ID JWT validation via Microsoft.Identity.WebAPI Host middleware
AuthorizationRole-based [Authorize] attributes (Admin, Operator, Viewer)API Host controllers
LoggingSerilog → Seq (dev) / App Insights (prod) with sensitive data redactionBoth hosts
AuditAuditService writes to audit_log_entries on every state changeBoth hosts
EncryptionEnvelopeEncryptionService wrapping Key Vault + AES-256-GCMBoth hosts
FIPS complianceAlgorithm restrictions + validated module detection (Section 12.10)Both hosts
Error handlingApiExceptionMiddleware (API), try/catch in hosted services (Worker)Per host
Health checks.NET Aspire health endpoints (DB, Key Vault, Quartz, disk space)Both hosts
ConfigurationKey Vault (prod) + User Secrets (dev) via .NET ConfigurationBoth hosts

2.10 Architecture Decision Records

Key architectural decisions and their rationale, for future reference:

DecisionChosenAlternatives ConsideredRationale
Three deployablesAPI + Worker + FrontendMonolith, two deployablesIndependent scaling; CPU-bound jobs don't starve API; independent deploy cycles
Vertical slicesFeature foldersLayered architectureCohesion by feature; easier to navigate; each slice owns its full stack
Database as coordinationPostgreSQL pollingRabbitMQ, Azure Service BusV1 simplicity; no additional infrastructure; Quartz already polls. Ceiling: ~50–100 jobs/hour, 3–10s pickup latency. See Section 2.7 for throughput limits.
EF Core (query only)DbUp for migrationsEF Core migrationsRaw SQL gives full control over partitioning, triggers, indexes; DbUp is simpler for teams with strong SQL skills
Single PostgreSQL instanceOne database, all tablesSeparate databases per concernSimpler operations; transactional consistency; partitioning handles scale
Quartz.NET for schedulingAdoJobStoreHangfire, custom timerMature, persistent, cron support, clustered failover, battle-tested
BouncyCastle for PGPFIPS-approved algorithms onlyGnuPG CLI, custom PGPOnly .NET library with full PGP format support; FIPS algorithms enforced in config
Azure Key Vault for KEKEnvelope encryptionLocal key file, AWS KMSAzure-native; FIPS 140-2 Level 2/3; hardware-backed; no key material on disk
System.Text.Json primaryNewtonsoft for Quartz onlyNewtonsoft everywherePerformance; .NET native; smaller dependency surface; Quartz requires Newtonsoft
No inter-process messaging (V1)Database polling + FOR UPDATE SKIP LOCKEDRabbitMQ, gRPC, SignalRV1 simplicity; poll jitter and DB load are acceptable at target throughput. V2 migrates to event-driven scheduling via outbox + message bus (Section 15).

2.11 Non-Functional Requirements & Design Targets

Without explicit targets, claims like "polling is fine" and "partition monthly" cannot be evaluated. These are the design-point assumptions for V1. They are not SLAs — they are the workload profile the architecture was designed to support. Exceeding them requires the V2 changes documented in Section 15.

Throughput & capacity:

MetricV1 Design TargetNotes
Max file size (per step)10 GBStreaming architecture; memory bounded to ~2× buffer size (default 80KB). Tested to 10 GB; larger files should work but are not validated.
Concurrent job executions5 (configurable up to 20)Global semaphore, not per-job. Bounded by Worker CPU/memory and IOPS.
Job throughput~50–100 jobs/hourDepends on avg job duration. Bottleneck is concurrency limit × job runtime.
File Monitor throughput~200 files/min (local), ~30 files/min (remote)Local: limited by FileSystemWatcher + stability window. Remote: limited by poll interval + connection overhead.
Concurrent active monitors50Beyond this, poll scheduling contention and database load from directory state become measurable.
Concurrent transfers per connection1SSH.NET and FluentFTP connections are not shared across jobs. Each job opens its own connection.
PGP/SSH keys stored~500No hard limit; performance degrades on key list queries if tags are heavily used and unindexed.
Audit log write rate~10–50 entries/second sustainedPartitioned by month; insert performance is stable. Querying across partitions degrades beyond 12 months of retained data.

Latency:

MetricV1 Design TargetNotes
Job pickup latency (queued → running)3–10 seconds (p95)Sum of Quartz poll interval + queue dequeue poll.
API response time (p95)< 200msFor CRUD operations. Excludes connection test (network-bound) and key generation (CPU-bound).
File Monitor detection (local, new file)< 10 secondsWatcher provides ~instant detection; stability window adds 5s before trigger.
File Monitor detection (remote, new file)1–2× poll intervalDepends on configured interval (min 30s, recommended 60s+).
Key Vault wrap/unwrap latency~20ms per operationAzure Key Vault REST call. Adds to every encrypt/decrypt operation.

Retention & storage:

MetricV1 Design TargetNotes
Audit log retention12 months onlineMonthly partitions. Older partitions archived to Azure Blob (cold storage).
Job execution history12 months onlineSame partitioning strategy as audit log.
Temp directory retention (orphaned)7 daysBackground cleanup service purges.
Database size (1 year, typical)~10–50 GBDepends heavily on audit log volume and JSONB column sizes.

Availability & recovery:

MetricV1 Design TargetNotes
RPO (Recovery Point Objective)< 1 hourAzure Database for PostgreSQL continuous backup with PITR. RPO depends on WAL archival frequency.
RTO (Recovery Time Objective)< 30 minutesContainer restart + migration check + Quartz re-acquisition. Does not include database restore time if the DB itself is lost.
Planned downtime tolerance< 5 minutesRolling deployment: API hosts can be cycled independently. Worker requires brief stop for Quartz trigger handoff.
Unplanned Worker crashJobs in Running state are marked Failed on next startup. Queued jobs are re-picked up automatically.No automatic failover to a second Worker in V1.

What these targets do NOT cover (V2):

  • Multi-region or active-active deployment
  • Sub-second job pickup latency (requires event-driven scheduling)
  • Horizontal Worker scaling (requires Quartz cluster mode + event bus)
  • Zero-downtime database migrations
  • Formal SLA commitments with contractual penalties