AWS ec2 instance stops every 3rd day

AWS Cloud - Troubleshooting

Aug 05, 2024

i have an ec2 instance running on AWS.. it runs only for 2 or 3 days.. after that i get status check error 'Instance reachability check failed'. i dont understand why it is happening. can someone help me with this issue. and just for reference node server puppeteer along with some api call. Usual api call to server could take aorund 10-35 second due to the parallel calls it makes inside the node api. and following i am sharing some system logs that received at the end


[[0;32m  OK  [0m] Finished [0;1;39mRemove Stale Onliâ¦ext4 Metadata Check Snapshots[0m.
[[0;32m  OK  [0m] Finished [0;1;39mGRUB failed boot detection[0m.
[[0;32m  OK  [0m] Started [0;1;39mLSB: automatic crash report generation[0m.
[[0;32m  OK  [0m] Started [0;1;39mAccounts Service[0m.
[[0;32m  OK  [0m] Started [0;1;39mA high performanceâ¦er and a reverse proxy server[0m.
[[0;32m  OK  [0m] Started [0;1;39mModem Manager[0m.
[[0;32m  OK  [0m] Started [0;1;39mDisk Manager[0m.
[[0;32m  OK  [0m] Started [0;1;39mLogin Service[0m.
[[0;32m  OK  [0m] Started [0;1;39mUnattended Upgrades Shutdown[0m.
[[0;32m  OK  [0m] Started [0;1;39mHostname Service[0m.
[[0;32m  OK  [0m] Started [0;1;39mDispatcher daemon for systemd-networkd[0m.
[[0;32m  OK  [0m] Finished [0;1;39mEC2 Instance Connect Host Key Harvesting[0m.
         Starting [0;1;39mOpenBSD Secure Shell server[0m...
[[0;32m  OK  [0m] Started [0;1;39mOpenBSD Secure Shell server[0m.


Ubuntu 20.04.6 LTS ip-172-31-26-38 ttyS0

ip-172-31-26-38 login: [   30.308818] cloud-init[1458]: Cloud-init v. 23.1.2-0ubuntu0~20.04.2 running 'modules:config' at Mon, 09 Oct 2023 12:49:22 +0000. Up 30.17 seconds.
[   31.169166] cloud-init[1465]: Cloud-init v. 23.1.2-0ubuntu0~20.04.2 running 'modules:final' at Mon, 09 Oct 2023 12:49:23 +0000. Up 30.97 seconds.
[   31.169317] cloud-init[1465]: Cloud-init v. 23.1.2-0ubuntu0~20.04.2 finished at Mon, 09 Oct 2023 12:49:23 +0000. Datasource DataSourceEc2Local.  Up 31.16 seconds

i tried making changes to ec2 security groups.

Troubleshooting and Answer for the above issue :

It sounds like you are experiencing a recurring problem with your AWS EC2 instance failing the "Instance reachability check" every few days. This issue could be caused by a variety of factors, including system resource constraints, configuration problems, or underlying issues with the instance or the AWS infrastructure. Here’s a detailed plan to diagnose and potentially resolve this issue.

Step 1: Monitor System Resources

First, closely monitor the CPU, memory (RAM), disk usage, and network activity on your instance. This can help identify if the instance is being overloaded, which could lead to failures.

CPU and Memory Usage: Use CloudWatch to monitor CPU and memory utilization. Set up alarms to notify you when usage is nearing capacity.
Disk Usage: Ensure that your disk isn't full. A full disk can cause many services, including logging and cloud-init, to fail.
Network Usage: Monitor bandwidth to see if your instance is experiencing unusually high network traffic, which might indicate a DDoS attack or another type of misuse.

Step 2: Check CPU Credit Balance

If you are using a T2 or T3 instance (burstable performance instances), check your CPU credit balance. Insufficient CPU credits can severely limit your instance’s CPU usage, causing processes to become unresponsive.

You can monitor CPU credit usage and balance via the AWS Management Console or by setting up CloudWatch alarms to alert you when credits are low.

Step 3: Analyze Logs

Since you have access to some system logs, look deeper into them to see if there are any warnings or errors immediately before the shutdowns occur.

System Logs: Check /var/log/syslog and /var/log/messages for any errors related to hardware or software issues.
Application Logs: Since you are running a Node server with Puppeteer, check the application logs to see if there are errors or excessive resource usage logged by your application.

Step 4: Configure Auto-Recovery

AWS allows you to set up auto-recovery for instances when a system status check fails. This can be a temporary workaround while you investigate the root cause.

Set up an AWS CloudWatch alarm that triggers the recovery action for your instance when a status check fails.

Step 5: Update and Upgrade

Ensure your instance is running the latest software versions, which can include important stability fixes.

Update OS: Run sudo apt-get update and sudo apt-get upgrade to ensure all packages are up to date.
Update Node and NPM Packages: Ensure that Node.js and all npm packages are updated to their latest stable versions.

Step 6: Stress Test

If updates do not resolve the issue, consider conducting stress tests to identify potential failures under load.

Use tools like stress or sysbench to test how your instance performs under high CPU, memory, and I/O load.

Step 7: Review Instance Configuration

Review your instance’s configuration settings in AWS:

Security Groups and Network ACLs: Ensure that your instance's security settings are not too restrictive, potentially blocking critical traffic.
Instance Type: Consider changing the instance type to a more robust one if resource limits are consistently being hit.

Step 8: Contact AWS Support

If the problem persists and you cannot identify the cause, consider contacting AWS Support. They can provide insights and guidance specific to your environment and usage patterns.

Conclusion

By systematically addressing each potential cause, you can better understand why your instance fails and implement a solution. Monitoring, resource management, and regular updates are key practices to maintain the health of your AWS EC2 instance.