Performance, Scaling, and Monitoring

Performance, Scaling, and Monitoring

Performance and Scaling Overview

Generator consists of two main services that work together:

  • Backend service: Handles API requests and user interactions
  • Worker service: Processes background tasks and content generation

Both services are designed to be stateless and horizontally scalable. This guide covers two main aspects of scaling:

  • Container Provisioning: How to size individual containers (vertical scaling)
  • Scaling Strategies: Different approaches to handling load, including both vertical and horizontal scaling

Both approaches can achieve similar performance, but horizontal scaling offers more flexibility for dynamic workloads.

Container Provisioning

This section covers how to size individual containers (vertical scaling) by allocating appropriate CPU and memory resources.

Backend Service Provisioning

The backend service automatically scales to use available CPU resources:

  • Automatic scaling: One worker process per CPU core (no configuration required)
  • Stateless design: Multiple backend containers can run simultaneously
  • Load balancing: Use a load balancer to distribute requests across backend instances

Worker Service Provisioning

Worker containers process background tasks and can be scaled independently:

Environment Variables:

  • THREADS__DRAMATIQ_PROCESSES: Number of Dramatiq processes per container
  • THREADS__DRAMATIQ_THREADS: Number of threads per process

Scaling Guidelines:

  • Memory-based: Start with ~1 thread per GB of memory on worker nodes
  • Process scaling: Each process contains THREADS__DRAMATIQ_THREADS threads
  • Resource monitoring: Increase threads if host metrics aren’t stressed
  • PDF processing: Uses significant memory; adjust accordingly if not processing many PDFs

Example Configuration:

# For a 4GB worker node
THREADS__DRAMATIQ_THREADS=4
THREADS__DRAMATIQ_PROCESSES=1

# For a 16GB worker node with 2 processes
THREADS__DRAMATIQ_THREADS=8
THREADS__DRAMATIQ_PROCESSES=2

Scaling Strategies

This section covers different approaches to handling increased load, including both vertical scaling (bigger containers) and horizontal scaling (more containers).

Advantages:

  • Simpler deployment and management
  • Fewer moving parts
  • Easier to troubleshoot

Configuration:

  • Increase container CPU/memory allocation
  • Backend automatically uses additional CPUs
  • Adjust THREADS__DRAMATIQ_THREADS for worker containers

Advantages:

  • Better resource utilization
  • Easier to scale in/out based on demand
  • Better fault tolerance

Configuration:

  • Deploy multiple backend containers behind load balancer
  • Deploy multiple worker containers
  • Use container orchestration (ECS, Kubernetes, Docker Swarm, etc)

Autoscaling

Generator’s stateless design makes it well-suited for autoscaling. Since both backend and worker containers can be scaled independently, you can adjust capacity based on demand. Your load balancer will automatically distribute work across the backend containers, and the queue system will distribute work across the workers. This set-up makes it easy to scale up during peak usage and scale down during quieter periods.

Monitoring Resource Usage: While Generator provides a metrics endpoint (see Monitoring section below), container orchestration platforms like ECS or Kubernetes typically provide more comprehensive resource monitoring and autoscaling capabilities. These platforms can monitor CPU, memory, and queue depth to make more informed scaling decisions than the basic metrics endpoint alone.

Monitoring

Metrics Endpoint

Generator exposes Prometheus-style metrics with a /metrics. e.g.

▶ curl http://[generator server]/metrics
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 1305.0
python_gc_objects_collected_total{generation="1"} 351.0
python_gc_objects_collected_total{generation="2"} 89.0
[...]

Available Metrics

  • HTTP request volume and response times
  • Python runtime metrics (memory usage, garbage collection)
  • More in the future

To avoid leaking this information publicly, the metrics endpoint will reject any requests that were forwarded through a load balancer (determined by the presence of any X-Forwarded-* headers).

If you want to disable this endpoint, set the HOSTING__ENABLE_METRICS environment variable to false.