Enter ChatOperations teams need somewhere to live as well. They are extremely reliant on collaboration, communication, and documentation. They also have a penchant for doing things, not writing about them. Who moved my cheese? It’s vital in an operations team to ensure that everyone can see change history. Chat rooms are a great way to solve this. Got a crazy ticket assigned to you when you get in to work? Just check the ops channel, and I bet reading the history of the last shift will give you context. But we need to do more than just chat. We need to do things. And if we do things, and we don’t record them by telling people we did them in the Chat channel, we are back to where we started. The only way to ensure that people record these events is to have them do the events in the channel itself!
ChatOps is BornIf I truly want to live in Chat, and want to record and collaborate in one place, then I need to use the Chat tool to both talk about things as well as use it as my operations interface. Since our chat tools are made to write text, they can be used pretty easily as a command line interface. If we make all of the basic operations tasks into scripts that can be run from within the chat window, we could all see things happening real-time.
Meet the Chat BotsChatOps is not ChatOps without bots. Bots enable us to receive information, like alerts, in push notification fashion. They enable us to query for current status. And they enable us to run common tasks without becoming experts in every tool. This is what enables us to live in our chat. And since chat can be organized by function or initiative, and membership can be controlled, it allows us to focus on the task at hand.
DevOps without the DevOpsEveryone’s got their opinion of DevOps, but I’ll share mine. This was a rudimentary organization solution to the divide between development and operations. Nix operations and make the developers do it. It was simple but transformative and effective. But it forces great developers to learn full stack, which is less than optimal, and potentially can lead to churn of the most talented developers. With ChatOps, we can enable developers to see status, push code, and do other “ops-y” tasks, without forcing them to learn all the tools and engineer all the systems.
Long Live SREBy creating a simplified toolset of commands specifically for them, our SRE teams can collaborate with development teams and enable them, without forcing them to take pager duty, and without slowing them down. This allows the benefits of DevOps while still allowing team members to contribute where they are strongest. This enables an SRE organization to exist without recreating the issues that occur when there is a divide between Development and Operations.
ChatOps Example – Hello WorldThis is easier to understand by example, so I’ll mock a real world situation with the world famous Hello World example. So, imagine you are running an SRE team responsible for keeping your company’s Hello World web site up and running, and working with the developers who need to push changes to font type, size, and color all day long. How could ChatOps help you? For this example I’ll use Foggy, our chat bot. Well, we first want to make sure we know that our site is available to the world. So we set up an external DNS based health check to helloworld.fogops.io. If the check fails, we are in urgent crisis, so we make sure to alert on it. We page the on-call SRE, but we also notify our chat bot, and our chat bot posts the alert in our #hello-world-ops slack channel. The event is effectively logged with time stamp for the team to use later, without having to dig through log files or log in to the monitoring system. Only #hello-world-ops alerts are posted here, so everything we see is in context to the initiative, which is to maintain the best hello world site known to man. The team begins troubleshooting the problem. They check the status of the web server with a quick slack command and get the OK that the web server is up and running, but can’t establish a connection to the database. The team checks status of the database server and finds the problem. Looks like our SQL service died. Foggy can handle this one for us…
The team completes the fix, the site comes back up, and the monitoring system posts the green light to confirm. The dev team is happy their site is back up, and they immediately push a new font. This is a pretty simple example, with the goal of helping get your creative juices going.