Backend Engineering |

CI/CD Pipeline Architecture: How Automating Deployments Eliminated Our Weekend Hotfixes

Design CI/CD pipelines from scratch with GitHub Actions. Covers build, test, deploy stages, blue-green deployments, canary releases, and infrastructure as code.

By SouvenirList

For the first two years of our product, deployments were a ritual. Every Thursday afternoon, the lead developer would SSH into the production server, pull the latest code from the main branch, run the database migrations manually, restart the application, and then spend the next hour monitoring the logs for errors. If something went wrong, the rollback process was to git checkout the previous commit and restart again.

This process had predictable consequences. We deployed only once a week because each deployment was risky and time-consuming. Bugs that were discovered on Friday often required weekend hotfixes — which followed the same manual process but under more pressure and with more mistakes. When we finally automated the pipeline, weekly deployments became daily deployments, then multiple times per day. Weekend hotfixes dropped from monthly occurrences to zero.

This guide covers the CI/CD pipeline architecture I have built and refined across multiple teams, from the basic stages to advanced deployment strategies like blue-green and canary releases.


TL;DR — Pipeline Stages

StageWhat It DoesBlocks Deploy?Typical Duration
LintCode style, formatting checksYes15-30 seconds
BuildCompile, bundle, create artifactsYes1-5 minutes
Unit TestsTest individual functions and modulesYes1-3 minutes
Integration TestsTest service interactions, database queriesYes3-10 minutes
Security ScanDependency vulnerabilities, secrets detectionYes (critical)1-3 minutes
Staging DeployDeploy to staging environmentNo (informational)2-5 minutes
Smoke TestsVerify staging deployment worksYes1-2 minutes
Production DeployDeploy to productionN/A2-5 minutes
Health CheckVerify production health post-deployTriggers rollback1-2 minutes

The Pipeline That Changed Everything

Here is the GitHub Actions workflow I use as a starting point for every new project. It covers the essential stages and can be extended with the advanced patterns described later.

name: CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Lint
        run: npm run lint

      - name: Type check
        run: npm run typecheck

      - name: Unit tests
        run: npm run test:unit -- --coverage

      - name: Upload coverage
        uses: codecov/codecov-action@v4
        with:
          file: ./coverage/lcov.info

  integration-tests:
    runs-on: ubuntu-latest
    needs: lint-and-test
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: test_db
          POSTGRES_USER: test_user
          POSTGRES_PASSWORD: test_pass
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
      redis:
        image: redis:7
        ports:
          - 6379:6379
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'
      - run: npm ci
      - name: Run migrations
        run: npm run db:migrate
        env:
          DATABASE_URL: postgresql://test_user:test_pass@localhost:5432/test_db
      - name: Integration tests
        run: npm run test:integration
        env:
          DATABASE_URL: postgresql://test_user:test_pass@localhost:5432/test_db
          REDIS_URL: redis://localhost:6379

  security-scan:
    runs-on: ubuntu-latest
    needs: lint-and-test
    steps:
      - uses: actions/checkout@v4
      - name: Dependency audit
        run: npm audit --audit-level=high
      - name: Secret scanning
        uses: trufflesecurity/trufflehog@main
        with:
          extra_args: --only-verified

  deploy-staging:
    runs-on: ubuntu-latest
    needs: [integration-tests, security-scan]
    if: github.ref == 'refs/heads/main'
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to staging
        run: |
          echo "Deploying to staging..."
          # Your staging deployment command here

  deploy-production:
    runs-on: ubuntu-latest
    needs: deploy-staging
    if: github.ref == 'refs/heads/main'
    environment: production
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to production
        run: |
          echo "Deploying to production..."
          # Your production deployment command here
      - name: Health check
        run: |
          sleep 30
          curl --fail https://api.example.com/health || exit 1
      - name: Notify team
        if: success()
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -H 'Content-Type: application/json' \
            -d '{"text":"Production deployment successful: ${{ github.sha }}"}'

Why This Structure Works

The pipeline runs lint and unit tests first because they are fast (under a minute). If code style or basic logic is broken, there is no point running expensive integration tests or security scans. Integration tests and security scans run in parallel after the fast checks pass. Staging deployment happens only on the main branch after all checks pass. Production deployment requires staging to succeed first.

This structure catches 90% of issues in the first 60 seconds and only runs expensive checks when the basics pass.


Each Stage in Detail

Linting: Catching Problems Before They Exist

Linting is the lowest-cost, highest-value stage. It catches formatting issues, unused variables, potential bugs, and style inconsistencies in seconds. I enforce linting with a pre-commit hook locally and in CI as a gate.

{
  "scripts": {
    "lint": "eslint . --ext .ts,.tsx --max-warnings 0",
    "lint:fix": "eslint . --ext .ts,.tsx --fix"
  }
}

The --max-warnings 0 flag is important. Without it, warnings accumulate until nobody reads them. With it, every warning is a build failure that must be addressed.

Testing: The Three Levels

I organize tests into three levels, each with different scope and speed:

Unit Tests (fast, isolated)
├── Test individual functions
├── Mock external dependencies
├── Run in milliseconds
└── 80% of test count

Integration Tests (medium, realistic)
├── Test service interactions
├── Use real database and Redis
├── Run in seconds
└── 15% of test count

End-to-End Tests (slow, comprehensive)
├── Test full user workflows
├── Use staging environment
├── Run in minutes
└── 5% of test count

The ratio matters. I once worked on a project where 70% of tests were end-to-end browser tests. The CI pipeline took 45 minutes, developers stopped running tests locally, and the test suite became so fragile that flaky tests were just re-run until they passed. After rebalancing to the pyramid above, the pipeline dropped to 8 minutes.

Security Scanning: Catching Vulnerabilities Early

Every pipeline should include dependency vulnerability scanning and secret detection. I use npm audit for known vulnerability detection and TruffleHog for leaked secrets (API keys, passwords, private keys accidentally committed):

- name: Dependency audit
  run: npm audit --audit-level=high
  
- name: Secret scanning
  uses: trufflesecurity/trufflehog@main
  with:
    extra_args: --only-verified

I set the audit level to high rather than moderate to avoid blocking deployments for low-severity issues that do not have fixes available yet. Critical and high-severity vulnerabilities always block the pipeline.


Deployment Strategies

Rolling Deployment

The simplest strategy: gradually replace old instances with new ones. Kubernetes does this by default with Deployments.

# Kubernetes rolling update
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

maxUnavailable: 0 ensures that no old instances are removed until a new instance is healthy. This gives you zero-downtime deployments, but if the new version has a bug, it affects users gradually as new instances come online.

Blue-Green Deployment

Blue-green maintains two identical production environments. At any time, one (blue) serves traffic while the other (green) is idle. Deployments go to the idle environment, and traffic is switched instantaneously after verification.

Before deployment:
  [Load Balancer] → [Blue (v1.0)] ← Live traffic
                    [Green (idle)]

During deployment:
  [Load Balancer] → [Blue (v1.0)] ← Live traffic
                    [Green (v1.1)] ← Deploy here, run smoke tests

After switch:
  [Load Balancer] → [Green (v1.1)] ← Live traffic
                    [Blue (v1.0)] ← Instant rollback target

The key advantage is instant rollback. If the new version has problems, you switch traffic back to the old environment in seconds. I implemented this with AWS Application Load Balancer target groups:

- name: Deploy to green
  run: |
    aws ecs update-service --cluster prod --service green --task-definition app:${{ github.sha }}
    aws ecs wait services-stable --cluster prod --services green

- name: Smoke tests on green
  run: |
    GREEN_URL=$(aws elbv2 describe-target-groups --names green --query 'TargetGroups[0].TargetGroupArn')
    curl --fail "$GREEN_URL/health"
    npm run test:smoke -- --base-url "$GREEN_URL"

- name: Switch traffic to green
  run: |
    aws elbv2 modify-listener --listener-arn $LISTENER_ARN \
      --default-actions Type=forward,TargetGroupArn=$GREEN_TARGET_GROUP

Canary Deployment

Canary deployments route a small percentage of traffic to the new version and gradually increase it if metrics look healthy. This is the safest strategy for critical services.

Step 1: 5% traffic → new version, 95% → old version
Step 2: Monitor error rate, latency for 10 minutes
Step 3: If healthy → 25% traffic to new version
Step 4: Monitor for 10 minutes
Step 5: If healthy → 100% traffic to new version
Step 6: If any step fails → 100% back to old version

I use canary deployments for our payment processing service because the cost of a bad deployment is measured in lost revenue. The entire deployment takes 30 minutes instead of 5, but the risk of a user-facing incident drops dramatically.


Infrastructure as Code

Every environment — development, staging, production — should be defined in code and version-controlled. I learned this lesson after spending a weekend debugging a production issue that turned out to be a configuration difference between staging and production that someone had applied manually months earlier.

Docker for Consistency

FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --production=false
COPY . .
RUN npm run build

FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package*.json ./
EXPOSE 3000
USER node
CMD ["node", "dist/server.js"]

The multi-stage build keeps the production image small — only the compiled output and production dependencies. No source code, no dev dependencies, no build tools.

Environment Configuration

# docker-compose.yml for local development
services:
  app:
    build: .
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgresql://dev:dev@postgres:5432/app_dev
      - REDIS_URL=redis://redis:6379
      - NODE_ENV=development
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started

  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: app_dev
      POSTGRES_USER: dev
      POSTGRES_PASSWORD: dev
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U dev"]
      interval: 5s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7

Every developer on the team runs docker compose up and gets an identical environment. No “works on my machine” conversations, no manual PostgreSQL or Redis installation, no version mismatches.


Pipeline Anti-Patterns I Have Learned to Avoid

Skipping tests for “urgent” fixes. I have never seen this end well. The “urgent” fix that bypasses the pipeline invariably introduces a new bug that requires another urgent fix. Every commit goes through the full pipeline — no exceptions.

Manual approval gates for every deployment. If production deployments require a manager’s Slack approval, deployments slow down and batch up, making each one larger and riskier. Automated gates (tests, security scans, health checks) are more reliable than human approval. Reserve manual approval for high-risk changes only.

Not monitoring post-deployment. The pipeline does not end at deployment. I include automated health checks for 5 minutes after every deployment. If the health check fails, the pipeline triggers an automatic rollback.

Environment-specific configuration in code. Never hardcode staging or production URLs, credentials, or feature flags. Use environment variables or a configuration service. I once accidentally deployed staging database credentials to production because someone hardcoded them in a config file.


Frequently Asked Questions

GitHub Actions vs. GitLab CI vs. Jenkins — Which Should I Choose?

GitHub Actions if your code is on GitHub and you want the simplest setup with a vast marketplace of pre-built actions. GitLab CI if your code is on GitLab and you want a fully integrated DevOps platform with built-in container registry and Kubernetes integration. Jenkins if you need maximum customization and are willing to manage the infrastructure yourself. For most teams, GitHub Actions or GitLab CI is the right choice — Jenkins’ operational overhead is justified only when you have unique requirements that managed CI tools cannot handle.

How Fast Should a CI Pipeline Be?

Target under 10 minutes for the full pipeline. Developers stop waiting for pipelines longer than 10 minutes and start context-switching, which kills productivity. My current pipeline runs lint and unit tests in 90 seconds, integration tests in 4 minutes, and deploys to staging in 3 minutes — total about 8 minutes from push to staging.

Should I Run Tests in Parallel?

Yes, when possible. Unit tests typically parallelize well because they are isolated. Integration tests may need sequencing if they share a database. I split integration tests across multiple CI jobs using test sharding — each job runs a subset of tests in parallel, and the pipeline passes only when all shards succeed.

How Do I Handle Database Migrations in CI/CD?

Run migrations as a separate step before the application starts, never as part of application startup. In the pipeline: run migrations against the staging database as part of the staging deployment, verify they succeed, then run them against production before the production deployment. Always write reversible migrations and test the rollback path.

What Is the Minimum CI/CD Pipeline for a New Project?

At minimum: lint, unit tests, and automated deployment to production. Even a two-stage pipeline (test → deploy) is infinitely better than manual deployments. You can add integration tests, security scanning, and staging environments as the project matures. The most important thing is to automate the deployment from day one — the longer you wait, the harder the manual process becomes to replace.


The Bottom Line

CI/CD pipeline automation is the highest-leverage investment you can make in your development workflow. The transition from manual Thursday deployments to automated multiple-daily deployments did not just eliminate weekend hotfixes — it fundamentally changed how our team approached software development. Smaller changes, deployed faster, with more confidence.

Start with the basic pipeline (lint, test, deploy) and add sophistication as your needs grow. Automate deployment on day one — even if it is just a shell script that SSHs into a server and pulls the latest code. Every deployment that runs without human intervention is a deployment that cannot suffer from human error.

Product recommendations are based on independent research and testing. We may earn a commission through affiliate links at no extra cost to you.

Tags: cicd devops github actions deployment backend automation

Related Articles