CI/CD Pipeline Architecture: How Automating Deployments Eliminated Our Weekend Hotfixes
Design CI/CD pipelines from scratch with GitHub Actions. Covers build, test, deploy stages, blue-green deployments, canary releases, and infrastructure as code.
For the first two years of our product, deployments were a ritual. Every Thursday afternoon, the lead developer would SSH into the production server, pull the latest code from the main branch, run the database migrations manually, restart the application, and then spend the next hour monitoring the logs for errors. If something went wrong, the rollback process was to git checkout the previous commit and restart again.
This process had predictable consequences. We deployed only once a week because each deployment was risky and time-consuming. Bugs that were discovered on Friday often required weekend hotfixes — which followed the same manual process but under more pressure and with more mistakes. When we finally automated the pipeline, weekly deployments became daily deployments, then multiple times per day. Weekend hotfixes dropped from monthly occurrences to zero.
This guide covers the CI/CD pipeline architecture I have built and refined across multiple teams, from the basic stages to advanced deployment strategies like blue-green and canary releases.
TL;DR — Pipeline Stages
| Stage | What It Does | Blocks Deploy? | Typical Duration |
|---|---|---|---|
| Lint | Code style, formatting checks | Yes | 15-30 seconds |
| Build | Compile, bundle, create artifacts | Yes | 1-5 minutes |
| Unit Tests | Test individual functions and modules | Yes | 1-3 minutes |
| Integration Tests | Test service interactions, database queries | Yes | 3-10 minutes |
| Security Scan | Dependency vulnerabilities, secrets detection | Yes (critical) | 1-3 minutes |
| Staging Deploy | Deploy to staging environment | No (informational) | 2-5 minutes |
| Smoke Tests | Verify staging deployment works | Yes | 1-2 minutes |
| Production Deploy | Deploy to production | N/A | 2-5 minutes |
| Health Check | Verify production health post-deploy | Triggers rollback | 1-2 minutes |
The Pipeline That Changed Everything
Here is the GitHub Actions workflow I use as a starting point for every new project. It covers the essential stages and can be extended with the advanced patterns described later.
name: CI/CD Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
lint-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Lint
run: npm run lint
- name: Type check
run: npm run typecheck
- name: Unit tests
run: npm run test:unit -- --coverage
- name: Upload coverage
uses: codecov/codecov-action@v4
with:
file: ./coverage/lcov.info
integration-tests:
runs-on: ubuntu-latest
needs: lint-and-test
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: test_db
POSTGRES_USER: test_user
POSTGRES_PASSWORD: test_pass
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
redis:
image: redis:7
ports:
- 6379:6379
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- run: npm ci
- name: Run migrations
run: npm run db:migrate
env:
DATABASE_URL: postgresql://test_user:test_pass@localhost:5432/test_db
- name: Integration tests
run: npm run test:integration
env:
DATABASE_URL: postgresql://test_user:test_pass@localhost:5432/test_db
REDIS_URL: redis://localhost:6379
security-scan:
runs-on: ubuntu-latest
needs: lint-and-test
steps:
- uses: actions/checkout@v4
- name: Dependency audit
run: npm audit --audit-level=high
- name: Secret scanning
uses: trufflesecurity/trufflehog@main
with:
extra_args: --only-verified
deploy-staging:
runs-on: ubuntu-latest
needs: [integration-tests, security-scan]
if: github.ref == 'refs/heads/main'
environment: staging
steps:
- uses: actions/checkout@v4
- name: Deploy to staging
run: |
echo "Deploying to staging..."
# Your staging deployment command here
deploy-production:
runs-on: ubuntu-latest
needs: deploy-staging
if: github.ref == 'refs/heads/main'
environment: production
steps:
- uses: actions/checkout@v4
- name: Deploy to production
run: |
echo "Deploying to production..."
# Your production deployment command here
- name: Health check
run: |
sleep 30
curl --fail https://api.example.com/health || exit 1
- name: Notify team
if: success()
run: |
curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
-H 'Content-Type: application/json' \
-d '{"text":"Production deployment successful: ${{ github.sha }}"}'
Why This Structure Works
The pipeline runs lint and unit tests first because they are fast (under a minute). If code style or basic logic is broken, there is no point running expensive integration tests or security scans. Integration tests and security scans run in parallel after the fast checks pass. Staging deployment happens only on the main branch after all checks pass. Production deployment requires staging to succeed first.
This structure catches 90% of issues in the first 60 seconds and only runs expensive checks when the basics pass.
Each Stage in Detail
Linting: Catching Problems Before They Exist
Linting is the lowest-cost, highest-value stage. It catches formatting issues, unused variables, potential bugs, and style inconsistencies in seconds. I enforce linting with a pre-commit hook locally and in CI as a gate.
{
"scripts": {
"lint": "eslint . --ext .ts,.tsx --max-warnings 0",
"lint:fix": "eslint . --ext .ts,.tsx --fix"
}
}
The --max-warnings 0 flag is important. Without it, warnings accumulate until nobody reads them. With it, every warning is a build failure that must be addressed.
Testing: The Three Levels
I organize tests into three levels, each with different scope and speed:
Unit Tests (fast, isolated)
├── Test individual functions
├── Mock external dependencies
├── Run in milliseconds
└── 80% of test count
Integration Tests (medium, realistic)
├── Test service interactions
├── Use real database and Redis
├── Run in seconds
└── 15% of test count
End-to-End Tests (slow, comprehensive)
├── Test full user workflows
├── Use staging environment
├── Run in minutes
└── 5% of test count
The ratio matters. I once worked on a project where 70% of tests were end-to-end browser tests. The CI pipeline took 45 minutes, developers stopped running tests locally, and the test suite became so fragile that flaky tests were just re-run until they passed. After rebalancing to the pyramid above, the pipeline dropped to 8 minutes.
Security Scanning: Catching Vulnerabilities Early
Every pipeline should include dependency vulnerability scanning and secret detection. I use npm audit for known vulnerability detection and TruffleHog for leaked secrets (API keys, passwords, private keys accidentally committed):
- name: Dependency audit
run: npm audit --audit-level=high
- name: Secret scanning
uses: trufflesecurity/trufflehog@main
with:
extra_args: --only-verified
I set the audit level to high rather than moderate to avoid blocking deployments for low-severity issues that do not have fixes available yet. Critical and high-severity vulnerabilities always block the pipeline.
Deployment Strategies
Rolling Deployment
The simplest strategy: gradually replace old instances with new ones. Kubernetes does this by default with Deployments.
# Kubernetes rolling update
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
maxUnavailable: 0 ensures that no old instances are removed until a new instance is healthy. This gives you zero-downtime deployments, but if the new version has a bug, it affects users gradually as new instances come online.
Blue-Green Deployment
Blue-green maintains two identical production environments. At any time, one (blue) serves traffic while the other (green) is idle. Deployments go to the idle environment, and traffic is switched instantaneously after verification.
Before deployment:
[Load Balancer] → [Blue (v1.0)] ← Live traffic
[Green (idle)]
During deployment:
[Load Balancer] → [Blue (v1.0)] ← Live traffic
[Green (v1.1)] ← Deploy here, run smoke tests
After switch:
[Load Balancer] → [Green (v1.1)] ← Live traffic
[Blue (v1.0)] ← Instant rollback target
The key advantage is instant rollback. If the new version has problems, you switch traffic back to the old environment in seconds. I implemented this with AWS Application Load Balancer target groups:
- name: Deploy to green
run: |
aws ecs update-service --cluster prod --service green --task-definition app:${{ github.sha }}
aws ecs wait services-stable --cluster prod --services green
- name: Smoke tests on green
run: |
GREEN_URL=$(aws elbv2 describe-target-groups --names green --query 'TargetGroups[0].TargetGroupArn')
curl --fail "$GREEN_URL/health"
npm run test:smoke -- --base-url "$GREEN_URL"
- name: Switch traffic to green
run: |
aws elbv2 modify-listener --listener-arn $LISTENER_ARN \
--default-actions Type=forward,TargetGroupArn=$GREEN_TARGET_GROUP
Canary Deployment
Canary deployments route a small percentage of traffic to the new version and gradually increase it if metrics look healthy. This is the safest strategy for critical services.
Step 1: 5% traffic → new version, 95% → old version
Step 2: Monitor error rate, latency for 10 minutes
Step 3: If healthy → 25% traffic to new version
Step 4: Monitor for 10 minutes
Step 5: If healthy → 100% traffic to new version
Step 6: If any step fails → 100% back to old version
I use canary deployments for our payment processing service because the cost of a bad deployment is measured in lost revenue. The entire deployment takes 30 minutes instead of 5, but the risk of a user-facing incident drops dramatically.
Infrastructure as Code
Every environment — development, staging, production — should be defined in code and version-controlled. I learned this lesson after spending a weekend debugging a production issue that turned out to be a configuration difference between staging and production that someone had applied manually months earlier.
Docker for Consistency
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --production=false
COPY . .
RUN npm run build
FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package*.json ./
EXPOSE 3000
USER node
CMD ["node", "dist/server.js"]
The multi-stage build keeps the production image small — only the compiled output and production dependencies. No source code, no dev dependencies, no build tools.
Environment Configuration
# docker-compose.yml for local development
services:
app:
build: .
ports:
- "3000:3000"
environment:
- DATABASE_URL=postgresql://dev:dev@postgres:5432/app_dev
- REDIS_URL=redis://redis:6379
- NODE_ENV=development
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_started
postgres:
image: postgres:16
environment:
POSTGRES_DB: app_dev
POSTGRES_USER: dev
POSTGRES_PASSWORD: dev
healthcheck:
test: ["CMD-SHELL", "pg_isready -U dev"]
interval: 5s
timeout: 5s
retries: 5
redis:
image: redis:7
Every developer on the team runs docker compose up and gets an identical environment. No “works on my machine” conversations, no manual PostgreSQL or Redis installation, no version mismatches.
Pipeline Anti-Patterns I Have Learned to Avoid
Skipping tests for “urgent” fixes. I have never seen this end well. The “urgent” fix that bypasses the pipeline invariably introduces a new bug that requires another urgent fix. Every commit goes through the full pipeline — no exceptions.
Manual approval gates for every deployment. If production deployments require a manager’s Slack approval, deployments slow down and batch up, making each one larger and riskier. Automated gates (tests, security scans, health checks) are more reliable than human approval. Reserve manual approval for high-risk changes only.
Not monitoring post-deployment. The pipeline does not end at deployment. I include automated health checks for 5 minutes after every deployment. If the health check fails, the pipeline triggers an automatic rollback.
Environment-specific configuration in code. Never hardcode staging or production URLs, credentials, or feature flags. Use environment variables or a configuration service. I once accidentally deployed staging database credentials to production because someone hardcoded them in a config file.
Frequently Asked Questions
GitHub Actions vs. GitLab CI vs. Jenkins — Which Should I Choose?
GitHub Actions if your code is on GitHub and you want the simplest setup with a vast marketplace of pre-built actions. GitLab CI if your code is on GitLab and you want a fully integrated DevOps platform with built-in container registry and Kubernetes integration. Jenkins if you need maximum customization and are willing to manage the infrastructure yourself. For most teams, GitHub Actions or GitLab CI is the right choice — Jenkins’ operational overhead is justified only when you have unique requirements that managed CI tools cannot handle.
How Fast Should a CI Pipeline Be?
Target under 10 minutes for the full pipeline. Developers stop waiting for pipelines longer than 10 minutes and start context-switching, which kills productivity. My current pipeline runs lint and unit tests in 90 seconds, integration tests in 4 minutes, and deploys to staging in 3 minutes — total about 8 minutes from push to staging.
Should I Run Tests in Parallel?
Yes, when possible. Unit tests typically parallelize well because they are isolated. Integration tests may need sequencing if they share a database. I split integration tests across multiple CI jobs using test sharding — each job runs a subset of tests in parallel, and the pipeline passes only when all shards succeed.
How Do I Handle Database Migrations in CI/CD?
Run migrations as a separate step before the application starts, never as part of application startup. In the pipeline: run migrations against the staging database as part of the staging deployment, verify they succeed, then run them against production before the production deployment. Always write reversible migrations and test the rollback path.
What Is the Minimum CI/CD Pipeline for a New Project?
At minimum: lint, unit tests, and automated deployment to production. Even a two-stage pipeline (test → deploy) is infinitely better than manual deployments. You can add integration tests, security scanning, and staging environments as the project matures. The most important thing is to automate the deployment from day one — the longer you wait, the harder the manual process becomes to replace.
The Bottom Line
CI/CD pipeline automation is the highest-leverage investment you can make in your development workflow. The transition from manual Thursday deployments to automated multiple-daily deployments did not just eliminate weekend hotfixes — it fundamentally changed how our team approached software development. Smaller changes, deployed faster, with more confidence.
Start with the basic pipeline (lint, test, deploy) and add sophistication as your needs grow. Automate deployment on day one — even if it is just a shell script that SSHs into a server and pulls the latest code. Every deployment that runs without human intervention is a deployment that cannot suffer from human error.
Product recommendations are based on independent research and testing. We may earn a commission through affiliate links at no extra cost to you.