In this talk, we’ll share Robinhood’s war stories from running Cilium in a high-churn near-production environment, how we have overcome challenges by better understanding and tuning Cilium, and why we now live happily-ever-after™. Robinhood has been running Cilium for over a year in the environment that hosts the company’s integration tests and personal development namespaces. The environment is treated with the same seriousness and response SLA as production because it’s critical to our entire company’s engineering and product development. Due to the nature of the workloads, it is a high churn environment and brings many interesting challenges. We moved from the traditional VPC-based CNI model to Cilium overlay networking to improve pod density, scalability and cost efficiency. While we were able to achieve a significantly higher pod density (~2x) and cost efficiency, this has come with its own set of challenges. We ran into Cilium rate limiting challenges, identity garbage collection bugs, loss of internet egress connectivity for pods, bottlenecks in our environment, and many others. The audience will walk away with an understanding of what it takes to run Cilium in production and some of the edge cases they may encounter.
Join the conversation on
Slack.