Buckle up! Day 4 of KubeCon 2020 was a bit of information overload. I started the day with Derek Argueta’s talk, How a Service Mesh was Born at Pinterest from Scratch. Derek explained that at the beginning service mesh was just a replacement for the ingress http proxy in front of everything exposed externally. Then came the need for mTLS across all the individual microservices and they chose to start with java. That was a long time in the making. They saw they would need to implement the same functionality with all the nodejs, python and golang built services, and a chance thought followed “lets just put envoy in front of every app, not just at the edge.”
Over time an idea had become reality; and issues with managing so many configuration files became unwieldy. They needed to manage that situation with configuration management using a jinja templating engine to reduce the complexity caused by so many envoy proxy services running. Now they had the service mesh essentially setup and working, deployed with every service, and configured. Next, they built out a static analysis system for CI to reduce regression and remove the ability for error’s to move into upstream environments.
This is where things got really interesting. As the envoy proxy has the ability to write plugins in C++, they were able to solve many issues that were normally handled at the app level like TLS termination, CSRF protection, and HTTP HEADER injection and validation to handle CORS and XSS issues. By using just envoy plugins (so all the services got that functionality), this allowed them to reduce the complexity of the applications, by centralizing that functionality into the envoy proxy.
In the end the envoy proxy even made it in front of their internal services like phabricator and jenkins and teletraan. Other teams started to reach out to the infrastructure traffic team with questions, like if they could resolve problems in the same way. The Web infra team was asking about Advanced Infrastructure based routing based on load or latencies. The SRE team reached out about maybe collecting metrics from the proxies for Generic Service Level Indicator Monitoring, specifically regarding Error Budget Tracking. Even the Legal team got use out of it with a plugin built to record and enforce http cookie monitoring so as to proactively reduce the likelihood of data privacy leaks.
How did they get to this amazing place you ask? Well they started with solving the business problems first, doing so incrementally, and reliably delivering on that at each step of the way. This gave a few improvements now that all the traffic just passed through services the whole time. They integrated the traffic team into the service team. Getting buy-in from many other teams in the organization really helped make it all happen swiftly. But above all, what made it all possible was the built-in extensibility of the open source envoy proxy itself, a simple but powerful thing made to empower.
I enjoyed another great talk from Vicky Cheung on Observing Kubernetes Without Losing Your Mind. As with last year’s talk on complexity she wanted to show Kubernetes from a different perspective, the perspective of the operator or the ops team. With a lot of things you kinda always pack into Kubernetes on day one or two including all the required parts of kubernetes: etcd, kube-apiserver, kube-controller-manager, kube-scheduler, kubelets, kube-proxy, the Container Runtime interface, the Container Networking Interface of choice, and CoreDNS. Not to mention all the various add on services like cluster-autoscaler, metrics-server, prometheus, kube-state-metrics, ingress controller of choice, logging agent, webhooks and many more. And that is all before we even get to the applications we wrote to handle the business directly.
Vicky shared that there are many steps that we all as engineers go through before we find our sweet spots. With a new cluster we deploy our applications into the wild. With all its autoscaling and self-healing we can then see the application in a running state inside Kubernetes. But then, through the cracks we see other things we didn’t expect. We see something break, and because we are engineers and SRE’s we do what is called a Blameless Post-Mortem to find out the root cause. Once we find that, we add monitoring and add fixes so we don’t see it happen again.
Over time the number of issues grows. And now you have 200+ Alarms and still new things are happening, but now we are online at 2am because of a Pagerduty alarm. With so many alarms we start getting impacted by what is known as Alert Fatigue, which reduces your ability to effectively triage alerts. At this point the infra team is struggling to stay awake during the day as the fatigue hurts more every day. What do we do then? You guessed it: Tune our Alarms and try to reduce the impact.
Distributed systems are hard and to do it justice you need to monitor everything, but that increases the effort involved and decreases the likelihood that the alarm that went off is useful. Is there a better way you ask? Of course, let’s try monitoring the things that actually matter to our clients. Internally, as infrastructure engineers we care that our users can do what they need to, like run workloads.
How about we as an example create an workload that does the following:
- create a pod
- with working networking
- with working credentials
- and do something basic
- like s3 download/upload from s3
- like put a value in the database and pull it back out.
- like add a message to a message-queue and retrieve it.
- log each step in the process with a timestamp and results
This process would tell us many things like, “yay we can still run jobs in kubernetes”, and networking between point a and b work as well as do the credentials. Now you can log all those small parts in metrics and logging so you’re getting tons of signals about the environment, like the kubernetes api is responding, scheduling is basically working, dns and networking are both doing their part, latency is acceptable based on the timing between logged events, credentials are good, and new workloads can probably come up properly. This will save you time and energy as you can now make inferences quicker.
With a reduced alert fatigue level we can look at making things even easier. With building tooling for those on the front lines, a few things you can do is record all events, logs from k8s into a query-able storage medium, like grafana or elasticsearch. This talk explained that by creating sensable dashboards for the different workloads in aggregate, you can reduce the effort even more on narrowing down the issues faster. And at the end once you have a concise understanding of the problem then it is feasible to automate the solution or remove the bug or problem completely
An interesting take on Managing Cluster Sprawl from Keith Basil, had some very exciting bit’s of insight, according to Jay Lyman at 451 Research “76% of enterprises will standardize on Kubernetes within 3 years.” Rancher’s own edge suite for Kubernetes has only been out for 1 year and has 20k+ weekly downloads and is still growing. He is seeing Kubernetes as inclusive as ever, running on systems we never thought possible before like Point of Sale systems, assembly line systems, and even on wind turbines.
The number of Kubernetes clusters are growing exponentially, and one thing he thought vitally important was one customer told him that, “anything below Kubernetes is overhead for us.” They want to abstract away anything below that level. They are seeing more architecture that is getting more and more mixed. Heterogeneous environments that mix and max as needed greatly lower capital expenditures. Because you can get what is needed at the time, the workloads are increasingly more CPU Agnostic, and they are seeing a larger level of Specialized hardware for Artificial Intelligence or automation control with GPUs and FPGAs. More people are looking for an overall zero trust security approach, including identity, trust and policy enforcement using tools like KeyCloak, KeyLime and OpenPolicyAgent as examples.
Moving on to a more people focused approach we got to see Constance Caramanolis talk about her Stint as a Chameleon. This talk was more about trying to get others to see the roles behind the people and to maybe even challenge yourself with putting on those other hats. The goal was to better understand the wants and needs of other team members we don’t always appreciate as much as we should. Constance gave a list of things in each role that she saw as important while sitting in that role and handling that role’s day to day tasks.
- As a Software Engineer, She was all about solving problems. The key takeaways from this role was that reference documentation is not user documentation, even if we think it is from our perspective. It is personally rewarding to build something that you know is valuable and used. The Software engineer has the ability to dig deep into the internals and clarify the abstract assumptions that are in the development process.
- as a Product Manager, it was all about creating a focus on the target, creating a clear vision for all to see. This role requires an exhaustive level of active listening to all parties involved. She found that the question often shapes the answer. The product manager sometimes needs to lead the witness if you will and ask questions like, what problems do you have that are not solved today?
- As a member of the Support Staff, it was all about striking a balance between help customers, and triaging issues and data collection to help the other teams reduce customer friction. They are generally in a unique position to see common hiccups. And if you want them to succeed they will always need more documentation to help relieve the unknown when trying to help the customers. With collecting information about a problem there is generally no streamlined collection method to quickly get up to speed on an issue. They build tooling to help reduce toil in collecting the massive information needed to resolve issues that our customers have with our products, or services.
- As a member of the Customer Success Team, the biggest win is with increasing customer engagement. Communication and trust are critical, it’s slow to build customer success and success is more than the technology, it is about successful adoption not begrudging adoption.
- And finally as a Sales Engineer, the Minimum Viable Product is where to start, finding that initial value ASAP. Data collection is very important to help make a valid decision on what value the market will allow them to multiply and extract. Bug’s will always exist in product’s at the beginning, but perfection is the enemy of good enough. Get it out there and you can then make it better
Everyone should try on a new hat every once in a while, the learning will occur.
In the 3rd and final KubeCon 2020 recap boat we will dive into the always engaging Breakout Sessions! Stay tuned.