Advanced

Example 55: Custom Module - Hello Module

Custom modules extend Ansible’s functionality using Python. This simple module demonstrates the basic structure: argument spec definition, input validation, and result return with changed status.

  %% Color Palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161
graph TD
    A["Ansible Task"] --> B["Custom Module<br/>hello.py"]
    B --> C["AnsibleModule<br/>Parse Args"]
    C --> D["Module Logic<br/>Process Input"]
    D --> E["exit_json<br/>Return Results"]
    E --> F["Ansible Core<br/>Task Result"]

    style A fill:#0173B2,color:#fff
    style B fill:#DE8F05,color:#fff
    style C fill:#029E73,color:#fff
    style D fill:#CC78BC,color:#fff
    style E fill:#CA9161,color:#fff
    style F fill:#0173B2,color:#fff
# library/hello.py
#!/usr/bin/python
from ansible.module_utils.basic import AnsibleModule

def run_module():
    module_args = dict(
        name=dict(type='str', required=True)    # => Module argument definition
    )

    module = AnsibleModule(
        argument_spec=module_args,
        supports_check_mode=True
    )

    result = dict(
        changed=False,
        message=f"Hello, {module.params['name']}!"
    )

    module.exit_json(**result)                  # => Return results to Ansible

if __name__ == '__main__':
    run_module()
# => Usage: ansible localhost -m hello -a "name=World"

Key Takeaway: Custom modules are Python scripts that use AnsibleModule for argument parsing and exit_json() for result return.

Why It Matters: Custom modules extend Ansible beyond built-in modules for organization-specific operations—proprietary API interactions, legacy system management, specialized compliance checks. Modules encapsulate complex logic into reusable, testable components that behave identically to core modules. This enables teams to build domain-specific automation libraries that integrate seamlessly with standard Ansible workflows.


Example 56: Custom Module with State Management

Production modules manage resources with state (present/absent). This pattern checks current state, calculates necessary changes, and reports accurate changed status for idempotency.

  %% Color Palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161
graph TD
    A["Module Execution"] --> B["Check Current State"]
    B --> C{State Matches<br/>Desired?}
    C -->|Yes| D["changed: False<br/>No Action"]
    C -->|No| E{Desired State?}
    E -->|present| F["Create Resource<br/>changed: True"]
    E -->|absent| G["Remove Resource<br/>changed: True"]
    F --> H["Return Result"]
    G --> H
    D --> H

    style A fill:#0173B2,color:#fff
    style B fill:#DE8F05,color:#fff
    style C fill:#029E73,color:#fff
    style D fill:#029E73,color:#fff
    style E fill:#CC78BC,color:#fff
    style F fill:#DE8F05,color:#fff
    style G fill:#DE8F05,color:#fff
    style H fill:#0173B2,color:#fff
# library/user_quota.py
#!/usr/bin/python
from ansible.module_utils.basic import AnsibleModule
import os

def main():
    module = AnsibleModule(
        argument_spec=dict(
            username=dict(required=True),
            quota_mb=dict(type='int', default=1000),
            state=dict(choices=['present', 'absent'], default='present')
        )
    )

    username = module.params['username']
    quota = module.params['quota_mb']
    state = module.params['state']

    quota_file = f"/etc/quotas/{username}"
    exists = os.path.exists(quota_file)

    changed = False

    if state == 'present' and not exists:
        # Create quota
        with open(quota_file, 'w') as f:
            f.write(str(quota))
        changed = True                           # => Resource created
        msg = f"Created quota {quota}MB for {username}"
    elif state == 'absent' and exists:
        os.remove(quota_file)
        changed = True                           # => Resource removed
        msg = f"Removed quota for {username}"
    else:
        msg = f"Quota already in desired state"  # => No change needed

    module.exit_json(changed=changed, msg=msg)

if __name__ == '__main__':
    main()

Key Takeaway: Idempotent modules check current state before making changes and accurately report changed status.

Why It Matters: Idempotent state management is the contract between modules and Ansible—modules must accurately report changes to trigger handlers correctly. Production modules managing custom resources (application licenses, cloud resources, hardware configurations) must implement state checking to prevent redundant operations. Proper state management reduces playbook runtime by 60% through intelligent change detection.


Example 57: Ansible Collections - Using Collections

Collections bundle modules, plugins, and roles into distributable packages. Install from Ansible Galaxy and reference modules with FQCN (Fully Qualified Collection Name).

  %% Color Palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161
graph TD
    A["requirements.yml"] --> B["ansible-galaxy<br/>install"]
    B --> C["Collection<br/>community.general"]
    B --> D["Collection<br/>ansible.posix"]
    C --> E["Playbook<br/>FQCN Reference"]
    D --> E
    E --> F["Module Execution"]

    style A fill:#0173B2,color:#fff
    style B fill:#DE8F05,color:#fff
    style C fill:#029E73,color:#fff
    style D fill:#029E73,color:#fff
    style E fill:#CC78BC,color:#fff
    style F fill:#CA9161,color:#fff
# requirements.yml
---
collections:
  - name: community.general
    version: ">=8.0.0" # => Minimum version constraint
  - name: ansible.posix
    version: "9.0.0"
# => Install with: ansible-galaxy collection install -r requirements.yml
# use_collection.yml
---
- name: Using Collection Modules
  hosts: localhost
  tasks:
    - name: Archive files with community.general
      community.general.archive:
        path: /tmp/mydir
        dest: /tmp/archive.tar.gz
        format: gz
      # => Uses FQCN: namespace.collection.module

    - name: Mount filesystem with ansible.posix
      ansible.posix.mount:
        path: /mnt/data
        src: /dev/sdb1
        fstype: ext4
        state: mounted
      # => FQCN ensures no module name conflicts

Key Takeaway: Collections provide namespaced modules via FQCN (namespace.collection.module). Install via requirements.yml for reproducible environments.

Why It Matters: Module return values pass computed data to subsequent tasks and populate Ansible facts for playbook logic. Custom modules that query external systems (API endpoints, databases, monitoring tools) return structured data for decision-making. The ansible_facts return key injects discovered data into the facts namespace, enabling dynamic inventory enrichment and runtime adaptation.


Example 58: Testing with Molecule - Scenario

Molecule automates role testing across multiple platforms. It creates test instances, applies roles, runs verifiers, and cleans up. Essential for role development.

  %% Color Palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161
graph TD
    A["molecule test"] --> B["Create<br/>Docker Instance"]
    B --> C["Converge<br/>Apply Role"]
    C --> D["Verify<br/>Run Tests"]
    D --> E{Tests Pass?}
    E -->|Yes| F["Destroy<br/>Cleanup"]
    E -->|No| G["Fail & Report"]
    F --> H["Success"]

    style A fill:#0173B2,color:#fff
    style B fill:#DE8F05,color:#fff
    style C fill:#029E73,color:#fff
    style D fill:#CC78BC,color:#fff
    style E fill:#DE8F05,color:#fff
    style F fill:#029E73,color:#fff
    style G fill:#CA9161,color:#fff
    style H fill:#029E73,color:#fff
# molecule/default/molecule.yml
---
driver:
  name: docker # => Use Docker for test instances
platforms:
  - name: ubuntu-test
    image: ubuntu:22.04
    pre_build_image: true
provisioner:
  name: ansible
  playbooks:
    converge: converge.yml # => Playbook that applies role
verifier:
  name: ansible
  playbooks:
    verify: verify.yml # => Playbook that tests results
# => Run with: molecule test
# molecule/default/converge.yml
---
- name: Converge
  hosts: all
  roles:
    - role: my_role
      vars:
        app_port: 8080
# => Applies role to test instance
# molecule/default/verify.yml
---
- name: Verify
  hosts: all
  tasks:
    - name: Check service is running
      service:
        name: myapp
        state: started
      check_mode: yes
      register: result
      failed_when: result.changed # => Fail if service not running

Key Takeaway: Molecule provides full role testing lifecycle: create → converge → verify → destroy. Use for TDD (Test-Driven Development) of roles.

Why It Matters: Robust error handling prevents cryptic failures that halt automation. Production modules validate input types, check preconditions, and return actionable error messages. Test harnesses verify module behavior across edge cases—missing inputs, API failures, partial states. Well-tested modules reduce mean-time-to-recovery from hours (debugging opaque failures) to minutes (clear error messages pointing to root cause).


Example 59: Ansible-Lint Configuration

Ansible-lint enforces best practices and catches common errors. Configure via .ansible-lint for project-specific rules and skip patterns.

# .ansible-lint
---
profile: production # => Use production rule profile

skip_list:
  - yaml[line-length] # => Allow long lines
  - name[casing] # => Allow any task name casing

warn_list:
  - experimental # => Warn on experimental features

exclude_paths:
  - .cache/
  - test/fixtures/
  - molecule/
# => Run with: ansible-lint site.yml
# CI pipeline integration
ansible-lint playbooks/*.yml --force-color --format pep8 > lint-results.txt
# => Returns non-zero exit code on failures
# => Integration with CI/CD for automated quality checks

Key Takeaway: Ansible-lint automates best practice enforcement. Configure via .ansible-lint file. Integrate in CI/CD pipelines for quality gates.

Why It Matters: Collections organize related modules, plugins, and roles into distributable packages with independent versioning. Organizations publish internal collections to standardize automation across teams—network teams provide network device modules, security teams provide compliance modules. The collection namespace (organization.collection.module) prevents naming conflicts and enables parallel development of domain-specific automation.


Example 60: Performance - Fact Caching

Fact gathering is slow on large inventories. Enable fact caching to store facts between runs. Supports memory, file, Redis, and Memcached backends.

  %% Color Palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161
graph TD
    A["Playbook Run 1"] --> B{Facts Cached?}
    B -->|No| C["Gather Facts<br/>#40;Slow#41;"]
    C --> D["Cache Facts<br/>Redis/File"]
    B -->|Yes| E["Load from Cache<br/>#40;Fast#41;"]
    D --> F["Execute Tasks"]
    E --> F
    F --> G["Playbook Run 2"]
    G --> E

    style A fill:#0173B2,color:#fff
    style B fill:#DE8F05,color:#fff
    style C fill:#CA9161,color:#fff
    style D fill:#029E73,color:#fff
    style E fill:#029E73,color:#fff
    style F fill:#CC78BC,color:#fff
    style G fill:#0173B2,color:#fff
# ansible.cfg
[defaults]
gathering = smart                    # => Only gather if facts not cached
fact_caching = jsonfile              # => Use JSON file backend
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 86400         # => Cache for 24 hours
# playbook.yml
---
- name: Use Cached Facts
  hosts: all
  gather_facts: yes # => Will use cache if available
  tasks:
    - name: Print cached IP
      debug:
        msg: "IP: {{ ansible_default_ipv4.address }}"
      # => First run: gathers facts (slow)
      # => Subsequent runs: uses cache (fast)

Key Takeaway: Fact caching dramatically speeds up playbooks on large inventories. Configure in ansible.cfg with appropriate timeout.

Why It Matters: Ansible-lint prevents configuration errors before they reach production. Linting catches 80% of common mistakes (deprecated syntax, incorrect indentation, missing task names) during development. CI/CD integration enforces quality standards across teams, preventing playbooks with anti-patterns from merging into mainline branches.


Example 61: Performance - Pipelining

Pipelining reduces SSH overhead by executing modules without creating temporary files on target. Requires requiretty disabled in sudoers.

  %% Color Palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161
graph TD
    A["Without Pipelining"] --> B["SSH Connect"]
    B --> C["Create Temp File"]
    C --> D["Execute Module"]
    D --> E["Delete Temp File"]

    F["With Pipelining"] --> G["SSH Connect"]
    G --> H["Stream Module<br/>to stdin"]
    H --> I["Execute Directly"]

    style A fill:#CA9161,color:#fff
    style B fill:#DE8F05,color:#fff
    style C fill:#CA9161,color:#fff
    style D fill:#029E73,color:#fff
    style E fill:#CA9161,color:#fff
    style F fill:#0173B2,color:#fff
    style G fill:#DE8F05,color:#fff
    style H fill:#029E73,color:#fff
    style I fill:#029E73,color:#fff
# ansible.cfg
[defaults]
pipelining = True                    # => Enable SSH pipelining

[ssh_connection]
pipelining = True
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
# => Reuse SSH connections for 60 seconds
# playbook.yml
---
- name: Fast Execution with Pipelining
  hosts: webservers
  tasks:
    - name: Install 10 packages
      apt:
        name:
          - pkg1
          - pkg2
          # ... 10 packages
        state: present
      # => With pipelining: ~30% faster execution
      # => Without: creates temp file for each module

Key Takeaway: Pipelining reduces SSH overhead significantly. Enable in ansible.cfg. Requires sudoers without requiretty.

Why It Matters: Fact caching eliminates redundant fact gathering on large inventories. Without caching, playbooks gather facts from 1000 hosts every run (5+ minutes). With caching, subsequent runs skip gathering (10 seconds), reducing deployment time by 98%. Redis-backed caching enables shared cache across multiple control nodes for team collaboration.


Example 62: CI/CD - GitHub Actions Pipeline

Automate Ansible execution in CI/CD pipelines. This GitHub Actions workflow validates syntax, runs linting, executes playbooks, and tests idempotency.

# .github/workflows/ansible-ci.yml
name: Ansible CI
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.11"

      - name: Install Ansible
        run: pip install ansible ansible-lint
        # => Install tools in CI environment

      - name: Syntax check
        run: ansible-playbook site.yml --syntax-check
        # => Validate YAML syntax

      - name: Lint playbooks
        run: ansible-lint site.yml
        # => Check best practices

      - name: Run playbook
        run: ansible-playbook site.yml -i inventory/ci
        # => Execute against CI inventory

      - name: Test idempotency
        run: |
          ansible-playbook site.yml -i inventory/ci | tee first-run.txt
          ansible-playbook site.yml -i inventory/ci | tee second-run.txt
          grep -q 'changed=0' second-run.txt
        # => Verify playbook is idempotent

Key Takeaway: CI/CD pipelines automate validation, linting, execution, and idempotency testing. Essential for production Ansible workflows.

Why It Matters: SSH pipelining reduces module execution overhead by 30-40% by eliminating temporary file creation on targets. At scale (1000+ hosts), pipelining saves 10+ minutes per playbook run. ControlMaster connection sharing (ControlPersist=60s) reuses SSH connections, reducing handshake overhead from 100+ connections to 10-20 for large inventories.


Example 63: Production Pattern - Rolling Updates

Rolling updates deploy changes gradually to avoid downtime. Use serial to control batch size and max_fail_percentage for automatic rollback triggers.

  %% Color Palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161
graph TD
    A["Start Rolling Update"] --> B["Batch 1: 2 Hosts"]
    B --> C["Remove from LB"]
    C --> D["Deploy & Test"]
    D --> E{Success?}
    E -->|Yes| F["Add to LB"]
    E -->|No| G["Abort & Rollback"]
    F --> H["Batch 2: 2 Hosts"]
    H --> I["Repeat Process"]

    style A fill:#0173B2,color:#fff
    style B fill:#DE8F05,color:#fff
    style C fill:#CC78BC,color:#fff
    style D fill:#029E73,color:#fff
    style E fill:#DE8F05,color:#fff
    style F fill:#029E73,color:#fff
    style G fill:#CA9161,color:#fff
    style H fill:#DE8F05,color:#fff
    style I fill:#029E73,color:#fff
# rolling_update.yml
---
- name: Rolling Update Web Servers
  hosts: webservers
  serial: 2 # => Update 2 hosts at a time
  max_fail_percentage: 25 # => Abort if >25% hosts fail

  pre_tasks:
    - name: Remove from load balancer
      uri:
        url: "http://lb.example.com/api/hosts/{{ inventory_hostname }}/disable"
        method: POST
      delegate_to: localhost
      # => Remove host from LB before update

  tasks:
    - name: Deploy new version
      copy:
        src: "app-{{ app_version }}.jar"
        dest: /opt/myapp/app.jar
      notify: Restart application

    - name: Wait for application health
      uri:
        url: "http://{{ inventory_hostname }}:8080/health"
        status_code: 200
      retries: 10
      delay: 3
      # => Verify app is healthy before continuing

  post_tasks:
    - name: Add back to load balancer
      uri:
        url: "http://lb.example.com/api/hosts/{{ inventory_hostname }}/enable"
        method: POST
      delegate_to: localhost
      # => Re-add host to LB after successful update

  handlers:
    - name: Restart application
      service:
        name: myapp
        state: restarted

Key Takeaway: Rolling updates use serial for batch control and health checks between batches. Pre/post tasks manage load balancer integration.

Why It Matters: CI/CD automation prevents human errors in deployment workflows. Automated syntax checks catch typos before production deployment. Idempotency testing detects playbooks that incorrectly report changes on every run (flapping playbooks). GitHub Actions integration enables pull request validation, preventing broken playbooks from merging into main branches.


Example 64: Production Pattern - Canary Deployment

Canary deployments test new versions on a subset of servers before full rollout. Combine with monitoring to validate changes before proceeding.

  %% Color Palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161
graph TD
    A["New Version"] --> B["Deploy to Canary<br/>#40;1 Server#41;"]
    B --> C["Monitor Metrics"]
    C --> D{Metrics OK?}
    D -->|Yes| E["Deploy to All<br/>#40;99 Servers#41;"]
    D -->|No| F["Rollback Canary"]
    E --> G["Complete"]
    F --> H["Fix Issues"]

    style A fill:#0173B2,color:#fff
    style B fill:#DE8F05,color:#fff
    style C fill:#CC78BC,color:#fff
    style D fill:#DE8F05,color:#fff
    style E fill:#029E73,color:#fff
    style F fill:#CA9161,color:#fff
    style G fill:#029E73,color:#fff
    style H fill:#CA9161,color:#fff
# canary_deploy.yml
---
- name: Canary Deployment
  hosts: webservers
  tasks:
    - name: Deploy to canary hosts
      copy:
        src: "app-{{ new_version }}.jar"
        dest: /opt/myapp/app.jar
      when: "'canary' in group_names"
      notify: Restart application
      # => Only deploy to canary group first

    - name: Wait for canary validation
      pause:
        prompt: "Check metrics. Press enter to continue or Ctrl-C to abort"
      when: "'canary' in group_names"
      # => Manual validation checkpoint

    - name: Deploy to production
      copy:
        src: "app-{{ new_version }}.jar"
        dest: /opt/myapp/app.jar
      when: "'production' in group_names"
      notify: Restart application
      # => Deploy to all production after canary success
# inventory.ini
[canary]
web1.example.com

[production]
web2.example.com
web3.example.com
web4.example.com

[webservers:children]
canary
production

Key Takeaway: Canary deployments reduce risk by testing on subset. Use inventory groups and conditionals to control deployment stages.

Why It Matters: Rolling updates enable zero-downtime deployments for stateless services. The serial parameter controls blast radius—deploy to 2 hosts at a time, verify, then proceed. Load balancer integration (pre_tasks/post_tasks) ensures traffic never routes to updating hosts. Health checks between batches detect failures early, preventing bad deployments from affecting entire fleet.


Example 65: Production Pattern - Blue-Green Deployment

Blue-green deployments maintain two identical environments. Deploy to inactive environment, verify, then switch traffic. Enables instant rollback.

  %% Color Palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161
graph TD
    A["Blue: Active<br/>Green: Inactive"] --> B["Deploy to Green"]
    B --> C["Test Green"]
    C --> D{Tests Pass?}
    D -->|Yes| E["Switch LB to Green"]
    D -->|No| F["Keep Blue Active"]
    E --> G["Green: Active<br/>Blue: Inactive"]
    F --> H["Fix Green"]

    style A fill:#0173B2,color:#fff
    style B fill:#029E73,color:#fff
    style C fill:#CC78BC,color:#fff
    style D fill:#DE8F05,color:#fff
    style E fill:#029E73,color:#fff
    style F fill:#CA9161,color:#fff
    style G fill:#029E73,color:#fff
    style H fill:#CA9161,color:#fff
# blue_green.yml
---
- name: Blue-Green Deployment
  hosts: localhost
  vars:
    active_color: "{{ lookup('file', '/etc/active_color.txt') }}" # => Current: blue or green
    inactive_color: "{{ 'green' if active_color == 'blue' else 'blue' }}"

  tasks:
    - name: Deploy to inactive environment
      include_tasks: deploy.yml
      vars:
        target_hosts: "{{ inactive_color }}_webservers"
      # => Deploy to inactive (green if blue is active)

    - name: Run smoke tests
      uri:
        url: "http://{{ inactive_color }}-lb.example.com/health"
        status_code: 200
      # => Verify inactive environment is healthy

    - name: Switch load balancer
      uri:
        url: "http://lb.example.com/api/switch"
        method: POST
        body_format: json
        body:
          active: "{{ inactive_color }}"
      # => Switch traffic to newly deployed environment

    - name: Update active color file
      copy:
        content: "{{ inactive_color }}"
        dest: /etc/active_color.txt
      # => Record new active environment

Key Takeaway: Blue-green deployments enable zero-downtime releases and instant rollback by switching between two complete environments.

Why It Matters: Canary deployments minimize risk by testing new versions on 5-10% of fleet before full rollout. Monitoring integration enables data-driven decisions—proceed if error rates stay flat, rollback if metrics degrade. The pattern prevents widespread outages from bad deployments while maintaining fast release velocity.


Example 66: Production Pattern - Immutable Infrastructure

Immutable infrastructure replaces servers rather than modifying them. Build new AMIs/images, launch new instances, then terminate old ones.

# immutable_deploy.yml
---
- name: Build Golden AMI
  hosts: packer_builder
  tasks:
    - name: Launch Packer build
      command: packer build -var 'version={{ app_version }}' ami-template.json
      register: packer_result

    - name: Extract AMI ID
      set_fact:
        new_ami: "{{ packer_result.stdout | regex_search('ami-[a-z0-9]+') }}"
      # => Captures: ami-0abc123def456

- name: Deploy New Auto Scaling Group
  hosts: localhost
  tasks:
    - name: Create launch configuration
      ec2_lc:
        name: "myapp-{{ app_version }}"
        image_id: "{{ new_ami }}"
        instance_type: t3.medium
        security_groups: ["sg-123456"]
      # => New launch config with new AMI

    - name: Update Auto Scaling Group
      ec2_asg:
        name: myapp-asg
        launch_config_name: "myapp-{{ app_version }}"
        min_size: 3
        max_size: 6
        desired_capacity: 3
      # => Triggers instance replacement

    - name: Wait for new instances healthy
      ec2_instance_facts:
        filters:
          "tag:Version": "{{ app_version }}"
          "instance-state-name": running
      register: instances
      until: instances.instances | length == 3
      retries: 20
      delay: 30

Key Takeaway: Immutable infrastructure builds new images and replaces instances entirely. Eliminates configuration drift and enables reliable rollbacks.

Why It Matters: Blue-green deployments provide instant rollback capability—switch traffic back to blue environment if green fails. The pattern eliminates deployment risk for stateless applications. Entire environment validation happens before traffic switch, catching integration failures that unit tests miss. Netflix and AWS use blue-green for zero-downtime releases at massive scale.


Example 67: Zero-Downtime Deployment Pattern

Combine health checks, load balancer management, and serial execution for truly zero-downtime deployments. Each server is updated while others handle traffic.

# zero_downtime.yml
---
- name: Zero-Downtime Deployment
  hosts: webservers
  serial: 1 # => One host at a time
  max_fail_percentage: 0 # => Abort on any failure

  tasks:
    - name: Pre-deployment health check
      uri:
        url: "http://{{ inventory_hostname }}:8080/health"
        status_code: 200
      # => Ensure host healthy before starting

    - name: Disable host in load balancer
      haproxy:
        backend: web_backend
        host: "{{ inventory_hostname }}"
        state: disabled
        socket: /run/haproxy/admin.sock
      delegate_to: lb.example.com
      # => Remove from LB pool

    - name: Wait for connections to drain
      wait_for:
        timeout: 30
      # => Allow active requests to complete

    - name: Deploy application
      copy:
        src: "myapp-{{ version }}.jar"
        dest: /opt/myapp/app.jar
      notify: Restart application

    - name: Flush handlers now
      meta: flush_handlers
      # => Ensure restart happens before health check

    - name: Wait for application startup
      wait_for:
        port: 8080
        delay: 5
        timeout: 120
      # => Wait for app to bind port

    - name: Application health check
      uri:
        url: "http://{{ inventory_hostname }}:8080/health"
        status_code: 200
      retries: 12
      delay: 5
      # => Verify app is healthy

    - name: Enable host in load balancer
      haproxy:
        backend: web_backend
        host: "{{ inventory_hostname }}"
        state: enabled
        socket: /run/haproxy/admin.sock
      delegate_to: lb.example.com
      # => Add back to LB pool

    - name: Wait for host to receive traffic
      pause:
        seconds: 10
      # => Allow LB health checks to pass

  handlers:
    - name: Restart application
      service:
        name: myapp
        state: restarted

Key Takeaway: Zero-downtime deployments require serial execution, LB integration, connection draining, and comprehensive health checks at each stage.

Why It Matters: Immutable infrastructure eliminates configuration drift—every deployment creates identical servers from golden images. Manual changes to servers are impossible (read-only root filesystems). Rollback becomes “deploy previous AMI” instead of “undo configuration changes.” This pattern underpins modern cloud-native architectures at Google, Facebook, and Spotify.


Example 68: Monitoring Integration

Integrate Ansible with monitoring systems to track deployment progress and trigger alerts. Send notifications to Slack, DataDog, or PagerDuty during critical phases.

# monitored_deploy.yml
---
- name: Deployment with Monitoring
  hosts: webservers
  tasks:
    - name: Send deployment start notification
      uri:
        url: "{{ slack_webhook_url }}"
        method: POST
        body_format: json
        body:
          text: "Starting deployment of {{ app_version }} to {{ inventory_hostname }}"
      delegate_to: localhost
      # => Notify team in Slack

    - name: Create deployment marker in DataDog
      uri:
        url: "https://api.datadoghq.com/api/v1/events"
        method: POST
        headers:
          DD-API-KEY: "{{ datadog_api_key }}"
        body_format: json
        body:
          title: "Deployment Started"
          text: "{{ app_version }} deploying to {{ inventory_hostname }}"
          tags:
            - "environment:production"
            - "version:{{ app_version }}"
      delegate_to: localhost
      # => Creates event marker in DataDog dashboard

    - name: Deploy application
      copy:
        src: "app-{{ app_version }}.jar"
        dest: /opt/myapp/app.jar
      notify: Restart application

    - name: Check error rate post-deployment
      uri:
        url: "{{ metrics_api }}/error_rate?host={{ inventory_hostname }}"
        return_content: yes
      register: error_rate
      delegate_to: localhost
      # => Query metrics API

    - name: Trigger alert if error rate high
      uri:
        url: "{{ pagerduty_events_url }}"
        method: POST
        body_format: json
        body:
          routing_key: "{{ pagerduty_key }}"
          event_action: trigger
          payload:
            summary: "High error rate after deployment"
            severity: critical
      when: error_rate.json.value > 5.0
      delegate_to: localhost
      # => Create PagerDuty incident if errors spike

Key Takeaway: Monitor deployments by integrating with Slack, DataDog, PagerDuty. Send notifications at key phases and trigger alerts on anomalies.

Why It Matters: Zero-downtime deployments require coordination of load balancers, health checks, and gradual rollout. Connection draining (30s wait) allows active requests to complete before server shutdown. Per-host health verification prevents deploying broken builds. This pattern enables Netflix to deploy thousands of times per day without user-visible outages.


Example 69: Disaster Recovery Pattern

Automate disaster recovery with playbooks that restore from backups, recreate infrastructure, and verify system integrity. Test DR playbooks regularly.

  %% Color Palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161
graph TD
    A["Disaster Occurs"] --> B["Provision New<br/>Infrastructure"]
    B --> C["Restore Database<br/>from Backup"]
    C --> D["Restore App Files"]
    D --> E["Verify Integrity"]
    E --> F{Data Valid?}
    F -->|Yes| G["Update DNS<br/>to DR Site"]
    F -->|No| H["Alert & Investigate"]
    G --> I["DR Complete"]

    style A fill:#CA9161,color:#fff
    style B fill:#DE8F05,color:#fff
    style C fill:#CC78BC,color:#fff
    style D fill:#CC78BC,color:#fff
    style E fill:#029E73,color:#fff
    style F fill:#DE8F05,color:#fff
    style G fill:#029E73,color:#fff
    style H fill:#CA9161,color:#fff
    style I fill:#029E73,color:#fff
# disaster_recovery.yml
---
- name: Disaster Recovery Procedure
  hosts: localhost
  vars:
    backup_date: "{{ lookup('pipe', 'date +%Y-%m-%d') }}"

  tasks:
    - name: Provision new infrastructure
      include_role:
        name: provision_infrastructure
      vars:
        environment: dr_recovery
      # => Recreate VMs/cloud resources

    - name: Restore database from backup
      postgresql_db:
        name: myapp
        state: restore
        target: "s3://backups/db-{{ backup_date }}.dump"
      # => Restore DB from S3

    - name: Restore application files
      aws_s3:
        bucket: backups
        object: "app-{{ backup_date }}.tar.gz"
        dest: /tmp/app-restore.tar.gz
        mode: get
      # => Download app backup

    - name: Extract application
      unarchive:
        src: /tmp/app-restore.tar.gz
        dest: /opt/myapp
        remote_src: yes
      # => Restore application code

    - name: Verify data integrity
      command: /opt/myapp/bin/verify-data.sh
      register: integrity_check
      failed_when: "'PASS' not in integrity_check.stdout"
      # => Validate restored data

    - name: Update DNS to DR site
      route53:
        state: present
        zone: example.com
        record: app.example.com
        type: A
        value: "{{ dr_lb_ip }}"
        ttl: 60
      # => Point DNS to DR environment

    - name: Send recovery notification
      uri:
        url: "{{ slack_webhook_url }}"
        method: POST
        body:
          text: "DR completed. Services running at DR site."
      # => Notify team of DR completion

Key Takeaway: DR playbooks automate infrastructure recreation, data restoration, and traffic cutover. Test regularly to ensure RTO/RPO targets.

Why It Matters: Monitoring integration provides deployment visibility and automated failure detection. Event markers in DataDog dashboards correlate metric changes with deployments. Slack notifications keep teams informed without manual status updates. Automated alerting on error rate spikes enables immediate rollback before user impact spreads.


Example 70: Configuration Drift Detection

Detect configuration drift by comparing desired state (playbooks) against actual state (target hosts). Run in check mode and alert on differences.

  %% Color Palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161
graph TD
    A["Playbook<br/>#40;Desired State#41;"] --> B["Run in<br/>--check Mode"]
    C["Target Hosts<br/>#40;Actual State#41;"] --> B
    B --> D{State Matches?}
    D -->|Yes| E["No Drift<br/>Report: OK"]
    D -->|No| F["Drift Detected"]
    F --> G["Generate Report"]
    G --> H["Alert Ops Team"]

    style A fill:#0173B2,color:#fff
    style B fill:#DE8F05,color:#fff
    style C fill:#CC78BC,color:#fff
    style D fill:#DE8F05,color:#fff
    style E fill:#029E73,color:#fff
    style F fill:#CA9161,color:#fff
    style G fill:#CC78BC,color:#fff
    style H fill:#CA9161,color:#fff
# drift_detection.yml
---
- name: Detect Configuration Drift
  hosts: production
  check_mode: yes # => Don't make changes, only check
  diff: yes # => Show differences
  tasks:
    - name: Check nginx configuration
      template:
        src: nginx.conf.j2
        dest: /etc/nginx/nginx.conf
      register: nginx_drift
      # => In check mode: reports if file would change

    - name: Check service state
      service:
        name: nginx
        state: started
        enabled: yes
      register: service_drift

    - name: Check package versions
      package:
        name:
          - nginx=1.18*
          - postgresql=14*
        state: present
      register: package_drift

    - name: Collect drift report
      set_fact:
        drift_detected: >-
          {{
            nginx_drift.changed or
            service_drift.changed or
            package_drift.changed
          }}

    - name: Alert on drift
      uri:
        url: "{{ alerting_webhook }}"
        method: POST
        body:
          host: "{{ inventory_hostname }}"
          drift: "{{ drift_detected }}"
          details:
            nginx: "{{ nginx_drift.changed }}"
            service: "{{ service_drift.changed }}"
            packages: "{{ package_drift.changed }}"
      when: drift_detected
      delegate_to: localhost
      # => Send alert if any drift detected

Key Takeaway: Run playbooks in check mode to detect drift without changing systems. Schedule drift detection jobs to catch manual changes.

Why It Matters: Automated disaster recovery reduces RTO (recovery time objective) from hours to minutes. Playbook-driven DR eliminates manual runbooks that become outdated or error-prone. Regular DR testing (monthly or quarterly) validates procedures work before real disasters occur. This automation enables compliance with business continuity requirements.


Example 71: Multi-Stage Deployment Pipeline

Orchestrate multi-stage deployments (dev → staging → production) with approval gates and environment-specific configurations.

# pipeline_deploy.yml
---
- name: Deploy to Development
  hosts: dev_webservers
  vars_files:
    - vars/dev.yml
  tasks:
    - include_tasks: deploy_tasks.yml

- name: Run Integration Tests
  hosts: dev_webservers
  tasks:
    - name: Execute test suite
      command: /opt/tests/run-integration-tests.sh
      register: tests
      failed_when: tests.rc != 0
      # => Fail pipeline if tests fail

- name: Deploy to Staging
  hosts: staging_webservers
  vars_files:
    - vars/staging.yml
  tasks:
    - include_tasks: deploy_tasks.yml

- name: Staging Smoke Tests
  hosts: staging_webservers
  tasks:
    - name: Check critical endpoints
      uri:
        url: "http://{{ inventory_hostname }}/{{ item }}"
        status_code: 200
      loop:
        - health
        - api/users
        - api/orders
      # => Verify staging is functional

- name: Production Approval Gate
  hosts: localhost
  tasks:
    - name: Wait for approval
      pause:
        prompt: "Approve production deployment? (Enter to continue)"
      # => Manual approval before production

- name: Deploy to Production
  hosts: prod_webservers
  serial: 3 # => Rolling update
  vars_files:
    - vars/production.yml
  tasks:
    - include_tasks: deploy_tasks.yml

Key Takeaway: Multi-stage pipelines use separate plays for each environment with tests and approval gates between stages.

Why It Matters: Drift detection catches manual server changes (“snowflake servers”) that break automation. Check mode + scheduled runs (cron every 6 hours) provide continuous compliance validation. Alert-based drift detection enables rapid response to unauthorized changes or failed automation. This pattern prevents production incidents from untracked configuration changes.


Example 72: Secrets Management with HashiCorp Vault

Integrate Ansible with HashiCorp Vault for dynamic secrets. Fetch credentials at runtime instead of storing in Ansible Vault or vars files.

  %% Color Palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161
graph TD
    A["Ansible Task"] --> B["Request Creds<br/>from Vault API"]
    B --> C["Vault Server"]
    C --> D["Generate Dynamic<br/>DB Credentials"]
    D --> E["Return Creds<br/>#40;1h Lease#41;"]
    E --> F["Use in Task"]
    F --> G["Revoke Lease<br/>on Completion"]

    style A fill:#0173B2,color:#fff
    style B fill:#DE8F05,color:#fff
    style C fill:#029E73,color:#fff
    style D fill:#CC78BC,color:#fff
    style E fill:#DE8F05,color:#fff
    style F fill:#029E73,color:#fff
    style G fill:#CA9161,color:#fff
# vault_integration.yml
---
- name: Dynamic Secrets from Vault
  hosts: webservers
  vars:
    vault_addr: "https://vault.example.com:8200"
  tasks:
    - name: Get database credentials from Vault
      uri:
        url: "{{ vault_addr }}/v1/database/creds/myapp"
        method: GET
        headers:
          X-Vault-Token: "{{ lookup('env', 'VAULT_TOKEN') }}"
        return_content: yes
      register: db_creds
      delegate_to: localhost
      no_log: true # => Don't log credentials
      # => Fetches dynamic DB credentials

    - name: Configure application with Vault credentials
      template:
        src: app-config.j2
        dest: /opt/myapp/config.yml
        mode: "0600"
      vars:
        db_username: "{{ db_creds.json.data.username }}"
        db_password: "{{ db_creds.json.data.password }}"
      no_log: true
      # => Credentials never stored in playbooks

    - name: Revoke credentials on failure
      uri:
        url: "{{ vault_addr }}/v1/sys/leases/revoke"
        method: PUT
        headers:
          X-Vault-Token: "{{ lookup('env', 'VAULT_TOKEN') }}"
        body:
          lease_id: "{{ db_creds.json.lease_id }}"
      delegate_to: localhost
      when: ansible_failed_task is defined
      # => Clean up credentials on failure

Key Takeaway: HashiCorp Vault integration provides dynamic secrets that auto-expire. Use no_log to prevent credential exposure in logs.

Why It Matters: Multi-stage pipelines enforce quality gates between environments. Integration tests run in dev before code reaches staging. Manual approval before production prevents untested changes from affecting users. Environment-specific configurations (dev vs staging vs prod) ensure consistent deployment processes while maintaining environment isolation.


Example 73: Compliance Auditing

Automate compliance checks (CIS benchmarks, STIG) and generate audit reports. Compare actual configuration against security baselines.

# compliance_audit.yml
---
- name: CIS Ubuntu 22.04 Compliance Audit
  hosts: all
  become: yes
  tasks:
    - name: Check SSH configuration
      block:
        - name: Verify PermitRootLogin is disabled
          lineinfile:
            path: /etc/ssh/sshd_config
            regexp: "^PermitRootLogin"
            line: "PermitRootLogin no"
          check_mode: yes
          register: ssh_root
          # => Check without changing

        - name: Record compliance status
          set_fact:
            compliance_ssh_root: "{{ not ssh_root.changed }}"

    - name: Check firewall status
      command: ufw status
      register: firewall
      changed_when: false
      failed_when: "'Status: active' not in firewall.stdout"

    - name: Check password policy
      command: grep -E '^PASS_MAX_DAYS' /etc/login.defs
      register: pass_policy
      changed_when: false
      failed_when: pass_policy.stdout.split()[1] | int > 90

    - name: Generate compliance report
      template:
        src: compliance-report.j2
        dest: "/var/log/compliance-{{ ansible_date_time.date }}.json"
      vars:
        checks:
          ssh_root_disabled: "{{ compliance_ssh_root }}"
          firewall_active: "{{ 'active' in firewall.stdout }}"
          password_max_days: "{{ pass_policy.stdout.split()[1] }}"
      delegate_to: localhost
      # => JSON report for SIEM ingestion

Key Takeaway: Compliance audits use check mode and assertions to verify security baselines. Generate structured reports for audit trails.

Why It Matters: HashiCorp Vault provides dynamic secrets with automatic expiration and rotation. Database credentials valid for 1 hour reduce blast radius of credential compromise. Lease revocation on playbook failure prevents orphaned credentials. Vault audit logs track who accessed which secrets, enabling compliance with SOC 2 and PCI DSS requirements.


Example 74: Network Automation - VLAN Configuration

Automate network device configuration using vendor-specific modules. This example configures VLANs on Cisco switches.

# network_vlans.yml
---
- name: Configure VLANs on Cisco Switches
  hosts: cisco_switches
  gather_facts: no
  tasks:
    - name: Create VLANs
      cisco.ios.ios_vlans:
        config:
          - vlan_id: 10
            name: ENGINEERING
            state: active
          - vlan_id: 20
            name: SALES
            state: active
          - vlan_id: 30
            name: GUEST
            state: active
        state: merged
      # => Creates VLANs if missing, updates if exist

    - name: Configure trunk port
      cisco.ios.ios_l2_interfaces:
        config:
          - name: GigabitEthernet0/1
            mode: trunk
            trunk:
              allowed_vlans: 10,20,30
        state: replaced
      # => Configures port as trunk with allowed VLANs

    - name: Save configuration
      cisco.ios.ios_config:
        save_when: modified
      # => Writes config to startup-config

Key Takeaway: Network modules provide declarative interface to network devices. Use vendor collections (cisco.ios, arista.eos) for device-specific operations.

Why It Matters: Automated compliance auditing provides continuous security validation. CIS benchmarks and STIG checks run hourly, detecting misconfigurations immediately. JSON-formatted audit reports integrate with SIEM systems for centralized compliance monitoring. This automation reduces compliance audit preparation from weeks to hours.


Example 75: Container Orchestration - Docker Deployment

Manage Docker containers with Ansible. Deploy multi-container applications with proper networking and volume configuration.

# docker_deploy.yml
---
- name: Deploy Docker Application
  hosts: docker_hosts
  tasks:
    - name: Create application network
      docker_network:
        name: myapp_network
        driver: bridge
      # => Creates isolated network for containers

    - name: Deploy PostgreSQL container
      docker_container:
        name: postgres
        image: postgres:15
        state: started
        restart_policy: always
        networks:
          - name: myapp_network
        env:
          POSTGRES_DB: myapp
          POSTGRES_PASSWORD: "{{ db_password }}"
        volumes:
          - postgres_data:/var/lib/postgresql/data
      # => Database container with persistent volume

    - name: Deploy application container
      docker_container:
        name: myapp
        image: "myapp:{{ version }}"
        state: started
        restart_policy: always
        networks:
          - name: myapp_network
        env:
          DB_HOST: postgres
          DB_NAME: myapp
        ports:
          - "8080:8080"
      # => App container linked to database

    - name: Wait for application health
      uri:
        url: "http://{{ inventory_hostname }}:8080/health"
        status_code: 200
      retries: 10
      delay: 3

Key Takeaway: Docker modules manage containers declaratively. Use networks for container communication and volumes for data persistence.

Why It Matters: Network automation standardizes switch and router configuration across thousands of devices. Ansible modules provide vendor-agnostic abstraction—same playbook pattern works for Cisco, Arista, Juniper with different collections. VLAN provisioning automation reduces network changes from 30 minutes (manual CLI) to 2 minutes (Ansible), eliminating human configuration errors.


Example 76: Kubernetes Deployment

Deploy applications to Kubernetes using Ansible. Apply manifests, wait for rollout completion, and verify pod health.

# k8s_deploy.yml
---
- name: Deploy to Kubernetes
  hosts: localhost
  tasks:
    - name: Create namespace
      kubernetes.core.k8s:
        state: present
        definition:
          apiVersion: v1
          kind: Namespace
          metadata:
            name: myapp
      # => Creates namespace if missing

    - name: Deploy application
      kubernetes.core.k8s:
        state: present
        namespace: myapp
        definition: "{{ lookup('file', 'k8s/deployment.yml') }}"
      # => Applies deployment manifest

    - name: Wait for deployment rollout
      kubernetes.core.k8s_info:
        kind: Deployment
        namespace: myapp
        name: myapp
      register: deployment
      until: deployment.resources[0].status.readyReplicas == 3
      retries: 20
      delay: 10
      # => Waits for all replicas ready

    - name: Expose service
      kubernetes.core.k8s:
        state: present
        namespace: myapp
        definition:
          apiVersion: v1
          kind: Service
          metadata:
            name: myapp
          spec:
            type: LoadBalancer
            selector:
              app: myapp
            ports:
              - port: 80
                targetPort: 8080
      # => Creates LoadBalancer service

Key Takeaway: Kubernetes modules enable GitOps workflows. Use k8s_info to wait for resources to reach desired state before proceeding.

Why It Matters: Docker automation manages containerized applications declaratively. Volume mounts persist data across container recreation. Network isolation prevents direct container communication, forcing explicit service dependencies. This pattern enables microservices deployment where each service runs in isolated containers with defined networking contracts.


Example 77: Database Migration Automation

Automate database schema migrations as part of deployment pipelines. Run migrations, verify success, and rollback on failure.

# db_migration.yml
---
- name: Database Migration
  hosts: db_servers
  tasks:
    - name: Backup database before migration
      postgresql_db:
        name: myapp
        state: dump
        target: "/backups/pre-migration-{{ ansible_date_time.epoch }}.sql"
      # => Safety backup before schema changes

    - name: Run database migrations
      command: /opt/myapp/bin/migrate up
      register: migration
      failed_when: migration.rc != 0
      # => Execute migration scripts

    - name: Verify migration success
      postgresql_query:
        db: myapp
        query: "SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1"
      register: current_version
      # => Check current schema version

    - name: Rollback on failure
      block:
        - name: Restore from backup
          postgresql_db:
            name: myapp
            state: restore
            target: "/backups/pre-migration-{{ ansible_date_time.epoch }}.sql"
      rescue:
        - name: Alert on rollback failure
          uri:
            url: "{{ pagerduty_url }}"
            method: POST
            body:
              message: "CRITICAL: Migration rollback failed"
          delegate_to: localhost
      when: migration.failed

Key Takeaway: Automate migrations with pre-migration backups and rollback procedures. Use blocks for error handling and recovery.

Why It Matters: Kubernetes automation enables GitOps—infrastructure as code stored in Git, automatically deployed via CI/CD. Ansible waits for pod readiness before proceeding, ensuring deployments complete successfully. The k8s module provides full Kubernetes API access, enabling complex orchestration like blue-green deployments and canary releases on Kubernetes.


Example 78: Self-Healing Infrastructure

Implement self-healing by detecting failures and automatically remediating. Monitor service health and restart failed services.

  %% Color Palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161
graph TD
    A["Scheduled Playbook<br/>#40;Every 15min#41;"] --> B["Check Service<br/>Health"]
    B --> C{Service Running?}
    C -->|Yes| D["No Action"]
    C -->|No| E["Restart Service"]
    E --> F{Restart Success?}
    F -->|Yes| G["Log Recovery"]
    F -->|No| H["Alert Ops Team"]

    style A fill:#0173B2,color:#fff
    style B fill:#DE8F05,color:#fff
    style C fill:#DE8F05,color:#fff
    style D fill:#029E73,color:#fff
    style E fill:#CC78BC,color:#fff
    style F fill:#DE8F05,color:#fff
    style G fill:#029E73,color:#fff
    style H fill:#CA9161,color:#fff
# self_healing.yml
---
- name: Self-Healing Monitor
  hosts: all
  tasks:
    - name: Check critical services
      service_facts:
      # => Gathers service status facts

    - name: Restart failed nginx
      service:
        name: nginx
        state: restarted
      when: ansible_facts.services['nginx.service'].state != 'running'
      # => Auto-restart if stopped

    - name: Check disk space
      shell: df -h / | tail -1 | awk '{print $5}' | sed 's/%//'
      register: disk_usage
      changed_when: false
      # => Get root filesystem usage percentage

    - name: Clean logs if disk full
      file:
        path: /var/log/old-logs
        state: absent
      when: disk_usage.stdout | int > 85
      # => Remove old logs if >85% full

    - name: Verify database connectivity
      postgresql_ping:
        db: myapp
      register: db_ping
      ignore_errors: yes
      # => Test DB connection

    - name: Restart database on failure
      service:
        name: postgresql
        state: restarted
      when: db_ping.failed
      # => Auto-remediate DB failures

    - name: Alert if remediation fails
      uri:
        url: "{{ alerting_webhook }}"
        method: POST
        body:
          host: "{{ inventory_hostname }}"
          issue: "Self-healing failed"
      when: db_ping.failed
      delegate_to: localhost

Key Takeaway: Self-healing playbooks run periodically (cron/systemd timers) to detect and remediate common failures automatically.

Why It Matters: Database migrations are high-risk operations that require careful orchestration. Automated pre-migration backups enable instant rollback on failure. Schema versioning tracking (via migrations table) prevents duplicate or out-of-order migrations. This automation reduces database deployment risks from manual SQL execution errors.


Example 79: Infrastructure Cost Optimization

Automate cost optimization by identifying and remediating wasteful resource usage (unused volumes, stopped instances, oversized VMs).

# cost_optimization.yml
---
- name: Identify Unused Resources
  hosts: localhost
  tasks:
    - name: Find unattached EBS volumes
      ec2_vol_info:
        region: us-east-1
        filters:
          status: available # => Unattached volumes
      register: unused_volumes
      # => Lists orphaned volumes

    - name: Delete old unattached volumes
      ec2_vol:
        id: "{{ item.id }}"
        state: absent
      loop: "{{ unused_volumes.volumes }}"
      when: item.create_time | to_datetime < (ansible_date_time.epoch | int - 2592000)
      # => Delete volumes older than 30 days

    - name: Find stopped instances running >7 days
      ec2_instance_info:
        region: us-east-1
        filters:
          instance-state-name: stopped
      register: stopped_instances

    - name: Terminate long-stopped instances
      ec2_instance:
        instance_ids: "{{ item.instance_id }}"
        state: absent
      loop: "{{ stopped_instances.instances }}"
      when: item.launch_time | to_datetime < (ansible_date_time.epoch | int - 604800)
      # => Terminate stopped >7 days

    - name: Generate cost report
      template:
        src: cost-report.j2
        dest: "/reports/cost-optimization-{{ ansible_date_time.date }}.html"
      vars:
        deleted_volumes: "{{ unused_volumes.volumes | length }}"
        terminated_instances: "{{ stopped_instances.instances | length }}"

Key Takeaway: Automate cost optimization by periodically identifying and removing unused cloud resources.

Why It Matters: Self-healing automation reduces mean-time-to-recovery (MTTR) from hours to minutes. Automated service restart handles 90% of common failures (OOM crashes, deadlocks) without human intervention. Disk cleanup prevents storage exhaustion incidents. Scheduled self-healing playbooks (every 15 minutes) provide continuous resilience, essential for maintaining SLAs in 24/7 operations.


Example 80: Chaos Engineering with Ansible

Implement chaos engineering experiments to test system resilience. Inject failures and verify recovery mechanisms.

# chaos_experiment.yml
---
- name: Chaos Engineering - Random Service Failure
  hosts: production
  serial: 1
  tasks:
    - name: Select random service to disrupt
      set_fact:
        chaos_target: "{{ ['nginx', 'myapp', 'postgres'] | random }}"
      # => Pick random service

    - name: Record experiment start
      uri:
        url: "{{ metrics_api }}/chaos/start"
        method: POST
        body:
          host: "{{ inventory_hostname }}"
          service: "{{ chaos_target }}"
      delegate_to: localhost

    - name: Stop service
      service:
        name: "{{ chaos_target }}"
        state: stopped
      # => Inject failure

    - name: Wait for monitoring to detect failure
      pause:
        seconds: 30
      # => Give monitoring time to alert

    - name: Verify alerting fired
      uri:
        url: "{{ alerting_api }}/check"
        method: GET
      register: alerts
      failed_when: chaos_target not in alerts.json.active_alerts
      delegate_to: localhost
      # => Ensure monitoring detected failure

    - name: Allow self-healing to trigger
      pause:
        seconds: 60
      # => Wait for auto-remediation

    - name: Verify service recovered
      service_facts:
      failed_when: ansible_facts.services[chaos_target + '.service'].state != 'running'
      # => Ensure auto-remediation worked

    - name: Record experiment completion
      uri:
        url: "{{ metrics_api }}/chaos/complete"
        method: POST
        body:
          host: "{{ inventory_hostname }}"
          service: "{{ chaos_target }}"
          outcome: "{{ 'success' if ansible_failed_result is not defined else 'failure' }}"
      delegate_to: localhost

Key Takeaway: Chaos engineering validates monitoring and auto-remediation. Run experiments in controlled manner to test system resilience.

Why It Matters: Chaos engineering validates resilience before real failures occur. Automated failure injection (random service stops) tests monitoring, alerting, and self-healing systems under controlled conditions. Experiments verify SLAs hold during partial failures, building confidence in production resilience. Netflix pioneered this practice (Chaos Monkey) to ensure their systems survive datacenter failures.


🎯 Advanced level complete! You’ve mastered custom modules, collections, testing frameworks, performance optimization, production deployment patterns, and operational automation. You now have comprehensive Ansible knowledge from beginner fundamentals through advanced production patterns, covering 95% of real-world use cases.

Last updated