Skip to main content

Command Palette

Search for a command to run...

Drift Detection using Terraform

Published
5 min read

Tools used:

  • Terraform

  • GitHub Actions

  • Slack webhook

Repo link: Link

Project Architecture

Open the image in a new tab to see it clearly.

Workflow

  1. We have a github repo with a dev branch.

  2. The infrastructure team then push the code to the github repo via a pull request.

  3. We have two triggers: 1) Manual trigger, 2) Cron trigger (also known as cron expression) will trigger based on schedule or manual.

  4. Next it will checkout the code. Based on the branch will determine the environment.

  5. Then we will execute the infrastructure using Terraform.

  6. Drift detection: Compare the plan and github code to determine any drift in the infrastructure.

  7. Based on the decision gateway we will apply the changes again to the environment.

    1. If drift detected, the auto fix applied based on the logic above. Followed by sending a slack message about the update.

    2. Close the github issue.

    3. If no drift occured, we will provide a report.

There are two backend files we created for execution of both dev and prod branch.

dev branch:

bucket       = "day30-drift-detection-amals-dev"
key          = "dev/terraform.tfstate"
region       = "us-east-1"
use_lockfile = true
encrypt      = true

prod branch:

bucket       = "day30-drift-detection-amals-prod"
key          = "prod/terraform.tfstate"
region       = "us-east-1"
use_lockfile = true
encrypt      = true

The given buckets need to be present before we trigger the github actions.

GitHub Actions Workflow

1. Trigger Strategy

The workflow is triggered by:

  • Pull Requests to the main or dev branches (runs the Plan job).

  • Pushes (merges) to the main or dev branches (runs both Plan and Apply jobs).

2. Environment Management

It dynamically switches between environments based on the branch name:

main Branch: Deploys to the prod (Production) environment.

dev Branch: Deploys to the dev (Development) environment.

  • It uses environment-specific backend configurations: backend-prod.hcl

    and backend-dev.hcl.

3. Job Workflow

The pipeline is split into two main stages:

Stage A: Terraform Plan

  • Validation: Runs terraform fmt and terraform validate to ensure code quality.

  • Visibility: If triggered by a Pull Request, it automatically comments the Terraform Plan directly onto the PR. This allows team members to review infrastructure changes before they are merged.

  • Artifacts: It saves the execution plan (tfplan) as a GitHub artifact to ensure that the exact same plan is used in the Apply stage.

Stage B: Terraform Apply

  • Strict Condition: This job only runs on a push/merge to the main or dev branches. It will not run on pull requests.

  • Execution: It downloads the plan artifact from the first stage and runs terraform apply.

  • Summary: After completion, it posts a summary of the deployed infrastructure and Terraform outputs to the GitHub Actions "Summary" page.

4. Security & Configuration

  • AWS Integration: Uses aws-actions/configure-aws-credentials with secrets (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY).

  • Terraform Version: Hardcoded to 1.10.3 for consistency across the team.

  • Permissions: Specifically requests pull-requests: write and issues: write permissions to allow the bot to comment on PRs.

Before moving forward, making sure:

  • Bucket creation rules and setup are correct.

  • Added AWS Secrets to GitHub.

  • Added slack webhook to GitHub. Create one at api.slack.com/apps.

  • Created dev and prod Environments in GitHub. With prod with a manual approver.

  • Enabled "Read and Write" workflow permissions in GitHub.

The drift detection workflow is given in the github actions file. Checkout the file named terraform.yml to understand the workflow on how we manage the drift detection based on the workflow.

In the drift detection workflow, we are creating two triggers as we said, one is a cron and the other is a manual trigger.

on:
  schedule:
    - cron: "*/1 * * * *" # Runs every minute (Not the ideal production deployment just for workflow)
  workflow_dispatch: # Allows manual triggering

I have given a overview of the original drift detection code here, you can checkout the drift_detection.yml file and find the logic.

jobs:
    // previous steps
    ...

    steps:
      - name: Checkout Repository
        uses: actions/checkout@v4

      - name: Determine Environment
        id: env-vars
        run: |
          ...// execution steps

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          ...// credentials setup

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: 1.10.3

      - name: Terraform Init
        run: terraform init -reconfigure -backend-config="backend-${{ env.ENVIRONMENT }}.hcl"

      - name: Terraform Plan (Drift Detection)
        id: plan
        run: |
          set +e
          terraform plan -detailed-exitcode -no-color > plan_output.txt 2>&1
          EXIT_CODE=$?
          echo "exitcode=$EXIT_CODE" >> $GITHUB_OUTPUT
          cat plan_output.txt
          exit 0

      - name: Analyze Drift
        if: steps.plan.outputs.exitcode == '2'
        uses: actions/github-script@v6
        env:
          PLAN_OUTPUT: plan_output.txt
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          script: |
            .... # GitHub issue creation logic

      - name: Auto-Fix Drift
        if: steps.plan.outputs.exitcode == '2'
        id: apply
        run: |
          echo "Applying Terraform changes to fix drift..."
          terraform apply -auto-approve -no-color > apply_output.txt 2>&1
        continue-on-error: true

      - name: Notify Slack - Drift Detected & Fixed
        if: steps.plan.outputs.exitcode == '2' && steps.apply.outcome == 'success'
        run: |
          .... # Slack notification

      - name: Notify Slack - Auto-Fix Failed
        if: steps.plan.outputs.exitcode == '2' && steps.apply.outcome == 'failure'
        run: |
          .... # Slack notification

      - name: Update Issue on Success
        if: steps.plan.outputs.exitcode == '2' && steps.apply.outcome == 'success'
        uses: actions/github-script@v6
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          script: |
            .... # Issue update logic

      - name: No Drift
        if: steps.plan.outputs.exitcode == '0'
        run: echo "No drift detected."

      - name: Terraform Plan Failure
        if: steps.plan.outputs.exitcode == '1'
        run: exit 1

Testing the Drift

Changed the load balancer tag and pushed the code to dev branch.

This updated in our workflow:

Applying Drift

I changed the tag ManagedBy tag for application load balancer via AWS directly. The GitHub actions polls state files every one minute to check any changes. If changes are their, we would be able to see the drift.

Before drift:

After detection:

It reverted back the ManagedBy tag, with the value we given in github.

But,

the slack message didn’t work well as I thought. It have some issue with intervention. I guess when github actions run at the same time, their might be a situation where the state file locked for the other operation thus the another process can’t edit it. So I created a concurrency to our CI/CD.

It worked.

Now, I will merge the dev branch to the main branch.


Video Reference:


Arigato!