🩺 How I Troubleshoot an EC2 Instance in the Real World (Using Instance Diagnostics)
When an EC2 instance starts misbehaving, my first reaction is not to SSH into it or reboot it. Instead, I open the EC2 console and go straight to Instance Diagnostics.
Over time, I’ve realized that most EC2 issues can be understood — and often solved — just by carefully reading what AWS already shows on this page.
In this blog, I’ll explain how I use each section of Instance Diagnostics to troubleshoot EC2 issues in a practical, real-world way.
The First Question I Answer
Before touching anything, I ask myself one simple question:
Is this an AWS infrastructure issue, or is it something inside my instance?
Instance Diagnostics helps answer this in seconds.
Status Overview: Always the Starting Point
I always begin with the Status Overview at the top.
Instance State
This confirms whether the instance is running, stopped, or terminated.
If it is not running, there is usually nothing to troubleshoot.
System Status Check
This reflects the health of the underlying AWS infrastructure such as the physical host and networking.
If this check fails, the issue is on the AWS side. In most cases, stopping and starting the instance resolves it by moving the instance to a healthy host.
Instance Status Check
This check represents the health of the operating system and internal networking.
If this fails, the problem is inside the instance — typically related to OS boot issues, kernel problems, firewall rules, or resource exhaustion.
EBS Status Check
This confirms the health of the attached EBS volumes.
If this fails, disk or storage-level issues are likely, and data protection becomes the immediate priority.
CloudTrail Events: Tracking Configuration Changes
If an issue appears suddenly, the CloudTrail Events tab is where I go next.
I use it to confirm:
- Whether the instance was stopped, started, or rebooted
- If security groups or network settings were modified
- Whether IAM roles or instance profiles were changed
- If volumes were attached or detached
This helps quickly identify human or automation-driven changes.
SSM Command History: Understanding What Ran on the Instance
The SSM Command History tab shows all Systems Manager Run Commands executed on the instance.
This is especially useful for identifying:
- Patch jobs
- Maintenance scripts
- Automated remediations
- Configuration changes
If there are no recent commands, that information itself is useful because it confirms that no SSM-driven actions caused the issue.
Reachability Analyzer: When the Issue Is Network-Related
If the instance is running but not reachable, I open the Reachability Analyzer directly from Instance Diagnostics.
This is my go-to tool for diagnosing:
- Security group issues
- Network ACL misconfigurations
- Route table problems
- Internet gateway or NAT gateway connectivity
- VPC-to-VPC or on-prem connectivity issues
Instead of guessing, Reachability Analyzer visually shows exactly where the network path is blocked.
Instance Events: Checking AWS-Initiated Actions
The Instance Events tab tells me if AWS has scheduled or performed any actions on the instance.
This includes:
- Scheduled maintenance
- Host retirement
- Instance reboot notifications
If an issue aligns with one of these events, the root cause becomes immediately clear.
Instance Screenshot: When the OS Is Stuck
If I cannot connect to the instance at all, I check the Instance Screenshot.
This is especially helpful for:
- Identifying boot failures
- Detecting kernel panic messages
- Seeing whether the OS is stuck during startup
Even a single screenshot can explain hours of troubleshooting.
System Log: Understanding Boot and Kernel Issues
The System Log provides low-level OS and kernel messages.
I rely on it when:
- The instance fails to boot properly
- Services fail during startup
- Kernel or file system errors are suspected
This is one of the best tools for diagnosing OS-level failures without logging in.
[[0;32m OK [0m] Reached target [0;1;39mTimer Units[0m.
[[0;32m OK [0m] Started [0;1;39mUser Login Management[0m.
[[0;32m OK [0m] Started [0;1;39mUnattended Upgrades Shutdown[0m.
[[0;32m OK [0m] Started [0;1;39mHostname Service[0m.
Starting [0;1;39mAuthorization Manager[0m...
[[0;32m OK [0m] Started [0;1;39mAuthorization Manager[0m.
[[0;32m OK [0m] Started [0;1;39mThe PHP 8.2 FastCGI Process Manager[0m.
[[0;32m OK [0m] Finished [0;1;39mEC2 Instance Connect Host Key Harvesting[0m.
Starting [0;1;39mOpenBSD Secure Shell server[0m...
[[0;32m OK [0m] Started [0;1;39mOpenBSD Secure Shell server[0m.
[[0;32m OK [0m] Started [0;1;39mDispatcher daemon for systemd-networkd[0m.
[[0;1;31mFAILED[0m] Failed to start [0;1;39mPostfix Ma… Transport Agent (instance -)[0m.
See 'systemctl status postfix@-.service' for details.
[[0;32m OK [0m] Started [0;1;39mLSB: AWS CodeDeploy Host Agent[0m.
[[0;32m OK [0m] Started [0;1;39mVarnish HTTP accelerator log daemon[0m.
[[0;32m OK [0m] Started [0;1;39mSnap Daemon[0m.
Starting [0;1;39mTime & Date Service[0m...
[ 13.865473] cloud-init[1136]: Cloud-init v. 25.1.4-0ubuntu0~22.04.1 running 'modules:config' at Fri, 05 Dec 2025 01:25:29 +0000. Up 13.71 seconds.
Ubuntu 22.04.3 LTS ip-***** ttyS0
ip-****** login: [ 15.070290] cloud-init[1152]: Cloud-init v. 25.1.4-0ubuntu0~22.04.1 running 'modules:final' at Fri, 05 Dec 2025 01:25:30 +0000. Up 14.98 seconds.
2025/12/05 01:25:30Z: Amazon SSM Agent v3.3.2299.0 is running
2025/12/05 01:25:30Z: OsProductName: Ubuntu
2025/12/05 01:25:30Z: OsVersion: 22.04
[ 15.189197] cloud-init[1152]: Cloud-init v. 25.1.4-0ubuntu0~22.04.1 finished at Fri, 05 Dec 2025 01:25:30 +0000. Datasource DataSourceEc2Local. Up 15.16 seconds
2025/12/15 21:35:50Z: Amazon SSM Agent v3.3.3050.0 is running
2025/12/15 21:35:50Z: OsProductName: Ubuntu
2025/12/15 21:35:50Z: OsVersion: 22.04
[1091674.876805] Out of memory: Killed process 465 (java) total-vm:11360104kB, anon-rss:1200164kB, file-rss:3072kB, shmem-rss:0kB, UID:1004 pgtables:2760kB oom_score_adj:0
[1091770.835233] Out of memory: Killed process 349683 (php) total-vm:563380kB, anon-rss:430132kB, file-rss:4096kB, shmem-rss:0kB, UID:0 pgtables:1068kB oom_score_adj:0
[1092018.639252] Out of memory: Killed process 347300 (php-fpm8.2) total-vm:531624kB, anon-rss:193648kB, file-rss:3456kB, shmem-rss:106240kB, UID:33 pgtables:888kB oom_score_adj:0
Session Manager: Secure Access Without SSH
If Systems Manager is enabled, I prefer using Session Manager to access the instance.
This allows me to:
- Inspect CPU, memory, and disk usage
- Restart services safely
- Avoid opening SSH ports or managing key pairs
From both a security and operational standpoint, this is my preferred access method.
What Experience Has Taught Me
Troubleshooting EC2 instances is not about reacting quickly — it is about observing carefully.
Instance Diagnostics already provides:
- Health signals
- Change history
- Network analysis
- OS-level visibility
When used correctly, these tools eliminate guesswork and reduce downtime.
Final Thoughts
My approach to EC2 troubleshooting is simple:
Start with Instance Diagnostics.
Understand the signals.
Act only after the root cause is clear.
In most cases, the answer is already visible — we just need to slow down and read it.



