Rather than focusing on accountability as the starting point, I would suggest building tooling and visibility so that cloud costs are visible across all layers including application and infra. Once you have this, accountability becomes easier.
Each application team should be able to view the total cost of running their service - and thus be held accountable to reduce costs when necessary.
Without data you are running blind. Cost optimization cannot be solved by a standalone team - it has to be owned by everyone.
Source: Personal experience reducing cloud costs in a slightly smaller team.
I agree that building visibility makes accountability easier. Its relatively trivial to build observability for individual services and we have achieved some version of it.
The problem is when 30 odd microservices (each team owning between 5-10) talking to each other. In pre-production setup, changes in few of these services might not have noticeable impact on cost which will become quite apparent in production. When this happens, we definitely notice an increase the cost and the unit metric. But then we dont know where to start fixing this problem from. Right now, this becomes a war-room situation based on the severity but I dont think this is sustainable.
In comparison, if we take API latency as a metric, the accountability and ownership are clearly defined: If an API slows down, the team that owns it, fixes it. They can work with anyone they need to but its their job to fix it.
Did you face similar concerns/issues? Not sure if this is a problem other engineering teams are struggling with or even considering as a real problem to invest it.
I'm also not sure if theres a "standard" way of doing this which we should be thinking about. So, looking for ideas and thoughts here.
I see what you are describing. The pre-production/staging setup may not bring out cost increases caused by application changes, and by the time it is running in production for a while it has already caused a cost explosion.
We did face similar situations - but we fixed them after the cost went up on prod. I guess this has more to do with how much and how fast an "undetected" cost in pre-prod can explode in production. We used to keep an eye on the prod cost numbers after a deployment, and then tackle each one, because the increase was not that quick.
I'm not sure either about a "standard" way, so I'm just thinking aloud here, and I've not tried this myself:
For application changes, measure the difference in cost in pre-prod, in terms of percentage increase, between the previous deployment and the current one, and use that to estimate the possible prod increase. I suspect this will become messy very fast as the other factors to include would be num requests, CPU/memory usage, and so on.
We use backstage (backstage.io) to manage our infra. It has plugins that track costs and attribute them to individuals and teams. That gets aggregated and is used to forcast costs for projects/teams/whatever.
You can do click-ops in the UI (which then generates yaml in a repo), or you can write special yaml files in your repo yourself. These yaml files define the owner (team entity) or the individual that the cost originates from. Its a mostly automated process.
Since each resource an application uses is known, anomalies can be tracked down and attributed. So, for example, if someone starts serving big files from anywhere other than the CDN and blowing up egress costs, the source and root cause are easy to identify.
Backstage has a "lifecycle" tag for the resources you spin up (experimental is the default). If you spin stuff up that isn't tagged as being in production they get auto deleted after a period of time (you get a bunch of emails about it beforehand). That cleans up experiements or other test infra that people have forgotten about.
At my job (eng count ~1k), the EM's are responsible, with TPMs helping monitor the metrics.
Teams are given a fixed budget per micro-service and if that spend exceeds that budget, you need to find the money from another service in our org.
Everyone, but it is up to one team to enforce standards (tagging) such that you can do proper cost attribution to teams and products.
Depending on how you add infra there are tools you can use to estimate the cost of a change at the pr level before it goes live.
Standards enforcement like tagging, TF structure, pipelines etc is currently owned by the platform team. We also have mechanisms to figure out the cost change (approximately) with each Infra PR. The struggle is to attribute cost changes in application-only changes and to identify them early in the lifecycle. These would be PRs for microservices that add features, fix bugs and handle tech debt.
Each team for their microservice costs. There are finops teams that help collate the data though.
What talonx said is what we do.
The new model is basically: get staff into donating. I mean, they’re using it too, right? Saving the planet one bullshit app at a time, aren’t they? So why shouldn’t they pony up and help foot the bill?
Visibility is essential, and visibility is owned by the devops team. ALl the resources have to be tagged with the team that owns those resources, and you should have a meeting once a month or so with that team where you bring to them a report of "here are all the resources we have assigned to you, do you agree?" and if they agree, then great, they are responsible for it, and if they don't agree, then now you have to have a conversation where you bring in the PMs and the EMs, and the head of finance if you need to.
And it's essential that you check in regularly, rather than just give them a dashboard and say "ok, you go look here and tell me if anything is amiss", because they will never look.
Offer your team a $100 gift card for every $1,000 shaved off in monthly cloud costs
Already done! This actually works pretty well when two conditions are true 1. cost becomes an engineering-org wide pain point 2. there are a bunch of low hanging fruit (another way to say you're just starting off your cost improvement journey)
Since cost is not really a "deliverable" for non-platform teams, this incentive doesnt go far. Especially after a few iterations of this are done, saving 1K is hard.
We did a short program(similar to a bug bash) for a couple of sprints to dig up all the improvements we can do to reduce to cost. This was early in our cost reduction journey. This did help us get a huge list of what we can do and we picked the most impactful items from this.
The problem with the "everyone" model being pitched here is that it may as well be a synonym for "nobody."
I've worked for a few orgs where quality and testing were "everyone's" responsibility and it ultimately led to everyone pushing it off their plates and lots of it simply not getting done. Why? We could collectively borrow against the future and "everyone" being responsible meant that nobody could be held accountable, as then the debate would be in deciding fractions of responsibility.
It also encouraged those with other incentives, like product, to lean heavily on that to ship more features over doing reliable tech work as they figured the debt would be someone else's problem down the road.
People have this naive idea that people who are given responsibility will step up. There are those that do, but the rest often see the far easier path of externalizing problems and frankly most jobs reward that as they don't see externalities well.
I would have it so that platform team is responsible for identifying and engineering is responsible for fixing it. I am not sure that either team would have the skills needed to prevent such things from happening, so perhaps canary deployments would be the way to go if it is a substantial risk in your domain.
> The problem with the "everyone" model being pitched here is that it may as well be a synonym for "nobody." Can't agree with this enough!
Thanks for your inputs. A lot of it resonates with what I've observed which translates to the fact that this is as much a cultural/people problem as much it is a technical problem. If teams took ownership by just building visibility, then it'd be an easier problem to solve.
You bring up a good point of doing canary deployments for solving this problem. I'll check this out.
But its interesting that you say ".. if it is a substantial risk in your domain". Isn't this a problem that most engineering teams are struggling with, especially in last few years? Being part of a few DevOps meetups in my area(Seattle) for a while and having attended a bunch of conferences in last couple of year, I've noticed cost coming up as one of the most recurring discussion topics. Just curious why cloud costs wont be a risk in any domain.
It is a risk for any company, but the possible harm is variable.
At a prior employer, cloud costs could have doubled or even gone up an order of magnitude and because the margins were so good and the tech costs so low, it wouldn't have mattered and may barely have been noticed. Compute wasn't a substantial business cost in any way, as customers were paying for domain expertise in the product.
At another prior employer, costs scaled with revenue pretty linearly, so while bad, it wouldn't be catastrophic before being noticed as it would also mean increased revenue.
However, for say a company that does video streaming where cloud costs are already enormous, poor cloud usage can cut months off runway. Same with AI, where the money is overwhelmingly being burned on compute.
Cloud waste can happen anywhere, but the harm can range from still a tiny number to destroying the ability to make payroll depending on what you are doing.
> or should hold
EVERYONE. Like with everything on Earth it's everyone's responsibility. It all adds up.
The way where everything is in isolation and silos really doesn't work and with everyone not having the full picture nothing is ever optimal.
Everyone can't have the full picture as too much knowledge and information are required and even if they did, their incentives usually aren't aligned with caring.