TCP #27: How To Troubleshoot High CPU Utilization on Amazon Aurora Postgres Database?
This simple runbook will save you hours of effort.
You can also read my newsletters from the Substack mobile app and be notified when a new issue is available.
Recently, my team was investigating a spike in CPU utilization on the Amazon Aurora Postgres database.
It took one of my team members 2 business days to perform a thorough root cause analysis.
This got me thinking.
Next time, if such an issue happens, another team member will have to spend roughly the same or more time investigating this.
Why not create something that provides a step-by-step guide on investigating such issues?
Enter Runbooks.
In today’s newsletter issue, I will explain:
What is a Runbook
Benefits of having runbooks for DevOps and SRE teams
Types of Runbooks
Runbook example for troubleshooting a real-world issue
Let’s dive in.
What is a Runbook?
A runbook is a detailed, step-by-step guide that outlines procedures for handling specific operational tasks, troubleshooting, or resolving incidents in IT systems.
It is primarily used by DevOps, Site Reliability Engineers (SREs), system administrators, and other IT professionals to ensure consistent and efficient operations.
Why Runbooks are Important?
A runbook is essential for several reasons, especially in Site Reliability Engineering (SRE) and DevOps teams:
1. Consistency in Troubleshooting
A runbook ensures that all engineers follow the same steps when investigating an issue. This consistency helps maintain operational stability and avoids ad hoc or ineffective troubleshooting.
2. Reduction in Response Time
Time is critical during incidents. A well-structured runbook provides engineers with a predefined process to quickly diagnose and resolve issues, minimizing downtime and user impact.
3. Knowledge Sharing
Runbooks capture institutional knowledge. New team members or engineers unfamiliar with a specific system can follow the documented steps without deep prior knowledge, making onboarding and knowledge transfer smoother.
4. Error Reduction
Following a detailed and validated procedure makes engineers less likely to make mistakes during high-pressure incidents. The runbook reduces guesswork and ensures that steps are followed correctly.
5. Accountability and Documentation
A runbook provides a clear record of how incidents are handled, ensuring that all actions are documented. This is crucial for post-incident reviews, audits, or communication with stakeholders.
6. Scalability
As teams grow or shift, runbooks allow for the repeatability of complex procedures without requiring deep expertise from every team member. This makes scaling processes across multiple teams and regions much easier.
7. Prevention of Future Incidents
Runbooks often include preventative measures, helping engineers not only fix issues but also prevent them from occurring again. This proactive approach improves the overall reliability of systems.
8. Alignment with SLAs and Compliance
Many teams operate under strict Service Level Agreements (SLAs) and compliance requirements. A runbook ensures that incidents are handled per these obligations, helping teams meet performance targets and regulatory requirements.
Types of Runbooks
Operational Runbooks: For routine tasks, like backups, system upgrades, or deployments.
Incident Response Runbooks: These are used for troubleshooting and resolving system failures or issues, such as high CPU utilization, network outages, or database errors.
Real-world Runbook to Troubleshoot High CPU Utilization on Amazon Aurora Postgres Database
Here’s what a runbook looks like to investigate and troubleshoot a real-world issue:
Date: [Insert Date]
Author: [Insert Name]
1. Purpose
This runbook provides a standardized approach for Site Reliability Engineers (SRE) to troubleshoot and resolve high CPU utilization on Aurora PostgreSQL instances, ensuring minimal downtime and performance degradation.
2. Prerequisites
AWS Console access (RDS, CloudWatch).
PostgreSQL database access for query analysis.
Familiarity with Aurora and PostgreSQL performance monitoring tools.
AWS CLI (optional for advanced troubleshooting).
3. Identifying the Issue
Initial Symptoms:
CloudWatch
CPUUtilization
metric exceeds 80%.Applications experience slow query responses or timeouts.
Increased connection or query failure rates.
Relevant CloudWatch Metrics to Check:
CPUUtilization
DBConnections
ReadIOPS/WriteIOPS
FreeableMemory
4. Troubleshooting Steps
Step 1: Review CloudWatch Metrics
Navigate to CloudWatch Metrics:
Go to AWS Management Console > CloudWatch > Metrics > RDS > Aurora PostgreSQL.Check trends:
Review CPU utilization trends and correlate with other metrics likeDBConnections
,FreeableMemory
, andIOPS
to get an overall sense of resource consumption patterns.
Step 2: Analyze Active Queries
Log into Aurora PostgreSQL:
Use a PostgreSQL client or the AWS RDS Query Editor.Identify long-running queries:
Execute the following SQL command to view active queries:
SELECT pid, usename, query, state, age(clock_timestamp(), query_start) AS query_duration FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start DESC;
Action:
Investigate long-running or inefficient queries. Consider terminating problematic queries if they are severely impacting the CPU:
SELECT pg_terminate_backend(pid);
Copy code
Step 3: Investigate Locking Issues
Check for lock contention:
SELECT pg_stat_activity.pid, pg_class.relname, pg_locks.mode FROM pg_stat_activity JOIN pg_locks ON pg_stat_activity.pid = pg_locks.pid JOIN pg_class ON pg_locks.relation = pg_class.oid WHERE pg_stat_activity.state != 'idle';
Action:
Identify and resolve locks causing high CPU load by terminating lock-causing processes or optimizing queries.
Step 4: Analyze System-Level Metrics
Review Enhanced Monitoring data:
Enable enhanced monitoring for Aurora PostgreSQL to gather granular system-level data (OS-level CPU, memory, disk metrics).Analyze processes:
Identify whether Aurora system processes, like background writes or checkpoints, are causing CPU spikes.
Step 5: Review Autovacuum and Maintenance Tasks
Check for autovacuum activity:
SELECT relname, last_autovacuum, autovacuum_count FROM pg_stat_all_tables WHERE autovacuum_count > 0;
Action:
If autovacuum or analyze operations are consuming high CPU, evaluate vacuum settings and tune them appropriately based on table activity.
Step 6: Scale the Aurora Instance
Instance type check:
Ensure the instance size matches your workload. For example, adb.r5.large
instance might need to be scaled todb.r5.xlarge
if workloads have increased.Action:
Consider vertical scaling (increase instance size) or horizontal scaling (add read replicas) if high CPU is consistent.
Step 7: Review Performance Insights (Optional)
Performance Insights:
If enabled, review the Performance Insights dashboard in RDS to gain deeper insights into query performance and CPU-heavy operations.
5. Remediation Actions
Immediate:
Terminate problematic or stuck queries.
Adjust locking behavior and optimize queries.
Long-term:
Query optimization (indexing, partitioning).
Scale instance resources.
Tune PostgreSQL configurations (e.g.,
work_mem
,max_connections
,autovacuum
).
6. Preventative Measures
Set up CloudWatch Alarms for CPUUtilization thresholds to detect issues early.
Enable Enhanced Monitoring for Aurora instances.
Regularly review and tune query performance using pg_stat_statements.
Review maintenance window schedules to avoid impact during peak times.
7. Escalation Path
If CPU utilization remains high after troubleshooting, escalate the issue to:
Database Administrator (DBA) for query tuning and further optimization.
AWS Support for deeper investigation of Aurora service issues.
8. Documentation
Log all troubleshooting steps and resolutions in your incident management system.
Document any changes to instance configurations, query optimizations, or scaling activities.
9. References
Final Thoughts
To summarize, a runbook is a critical tool for maintaining high availability, ensuring operational efficiency, and reducing the risk of human error during incidents.
It helps teams troubleshoot effectively and consistently, leading to more reliable systems and smoother operations.
Don’t you think they are cool?
Let me know in the comments.
Also, don't forget to follow me on X/Twitter and LinkedIn for daily insights.
That’s it for today!
Did you enjoy this newsletter issue?
Share with your friends, colleagues, and your favorite social media platform.
Until next week — Amrut
Posts that caught my eye this week
Different ways of working with SQL Databases in Go by
Securing Your Microservices: A Comprehensive Guide by
Issue 63 by
Whenever you’re ready, there are 2 ways I can help you:
Are you thinking about getting certified as a Google Cloud Digital Leader?
Here’s a link to my Udemy course, which has helped 617+ students prepare and pass the exam. Currently, rated 4.24/5. (link)
Course Recommendation: AWS Courses by Adrian Cantrill (Certified + Job Ready):
ALL THE THINGS Bundle (I got this and highly recommend it!)
Get in touch
You can find me on LinkedIn or X.
If you wish to request a topic you would like to read, you can contact me directly via LinkedIn or X.