We are pleased to introduce our latest addition, Engine Yard Weatherman, an AI assistant designed to intelligently manage and respond to alerts within your instances and environments. In this article, we will explain how Weatherman works and what you may expect from it in the future.
Note: Weatherman can now provide on-demand analyses of your instances, as explained in How to Unleash the power of ATLAS and Weatherman: our new AI chat Support duo
Weatherman is an advanced, non-destructive AI tool engineered to handle alerts effectively, employing the same tools and commands utilized by our live agents. The system has been integrated with insights gleaned from years of hands-on experience addressing complex issues within various environments.
As a premium user, any configured alert affecting your instances will trigger an immediate and informative response from Weatherman, ensuring you're not left in the dark when something goes awry.
Some real examples
To illustrate what you can expect from Weatherman, let's delve into some practical scenarios:
High Swap Usage Alert
When an alert regarding high swap usage is received, Weatherman immediately executes the 'free' and 'ps' commands, collecting crucial data. It then analyzes this information, providing a detailed report, such as the following:
The total swap memory of your instance is 8189Mb, and currently, 2113Mb is being used. This is approximately 25.8% of the total swap memory, which is below the 50% threshold. Therefore, there is likely no cause for concern at this moment.
However, it's worth noting that the processes consuming the most memory are Sidekiq and Postgres, followed by Unicorn. If you notice a continuous increase in swap usage, you may consider restarting these processes at a convenient time that does not affect normal operations.
For Sidekiq, you can use the restart control scripts under /engineyard/bin/sidekiq. For Unicorn, you can use the restart control scripts under /engineyard/bin/<your_app_name>.
Please be especially careful when restarting the Postgres process. Ensure that operations will not be negatively affected by the restart.
We will continue monitoring the swap usage on your instance. If you have any questions or need a hand, simply reply back and we will be happy to help!
Site Down Alert
If Weatherman receives an alert that a site is down, it immediately does the following:
- First, Weatherman probes the URL for the updated information.
- Then, it checks your environment's current status, running commands specific to each of your instance's types and the running applications.
- This includes general commands like 'monit summary' and 'ps'.
- Additionally, Weatherman will run instance-specific commands such as 'mytop', or 'service puma status' depending on what is running in your environment,
After the analysis is complete, you might receive an alert such as:
Upon further investigation, we found that the alert may no longer be applicable. Our attempts to reach the endpoint https://[xxx].com/status returned a status code 200, indicating that the site is up and running.
In terms of the status of your instances, here is a summary:
1. The application master (app_master) instance has a critical issue with the Unicorn master process failing to execute. This process is responsible for managing worker processes and its failure can affect the overall functioning of the application. Additionally, a filesystem check for the primary EBS is due.
2. The database master (db_master) instance is functioning well with no critical errors. However, attention should be given to the primary EBS filesystem check warning.
3. The database slave (db_slave) instance is functioning properly except for a warning about the filesystem check for the primary EBS device.
4. The application (app) instance is currently running without any critical issues. However, a filesystem check on the primary EBS is due.
5. The utility (util) instances are mostly running as expected, but the 'remote_syslog' process is failing to execute and a filesystem check is due for the /data directory.
In conclusion, while the alert may no longer be applicable, there are some issues that need to be addressed to ensure the smooth operation of your environment. We recommend addressing the failed execution of the Unicorn master process on the app_master instance and performing the due filesystem checks on the primary EBS across the instances.
If you have any questions or need a hand, simply reply back and we will be happy to help!
While explaining the full range of alerts is beyond the scope of this introduction, know that we have covered over 150 scenarios so far, with commands and analyses tailored to each specific case. We are looking to further expand this with your feedback and live experience.
Rest assured that all of the executed commands are non-destructive: we will not perform any changes in your instance without your explicit approval.
Try Weatherman today!
As of today, Weatherman has been enhanced to allow users to request on-demand analyses. This can be done by starting a conversation with our AI-powered agent ATLAS, and selecting your affected environment or instance, then detailing your issue like you would with any normal agent. Weatherman can help you identify why deployment is failing, why your website is unreachable, or why a backup is not completing.
If you are a Platinum customer, you may also directly interact with Weatherman via ticket responses. Among others, you can hush alarms for a set amount of time, close a ticket, or request an update on the status of a certain issue.
Your Feedback Matters
As we embark on this journey, your input is invaluable. We encourage you to share your experiences and suggestions regarding Weatherman. Your feedback is instrumental in helping us refine its functionalities, ensuring we deliver a tool that genuinely resonates with your needs.
Please share your experiences and the improvements you would wish to see by writing a short message; we will make sure to review your feedback and improve Weatherman's functionality with your help.
Thank you for being a part of this innovative step forward in environment management. We're here to assist you, whether through Weatherman or our dedicated live agents. Welcome to a simplified, more informed operational experience.
Q: How often will Weatherman alert me about ongoing alarms?
A: Weatherman will alert you every 4 hours, or whenever the alarm itself changes.
Q: Can I request an analysis from Weatherman on demand?
A: Yes! You can start a conversation with our AI-powered agent ATLAS and request live information about your instances and environments.
Q: What if an issue requires a second pair of eyes?
A: You can always ask to speak with an agent, and your tickets will be reviewed by a live agent.