Curious how other teams are approaching this.
Building an AI app used to mean picking a model and writing product logic. Now it means also picking a hosting provider, wiring up a monitoring tool, and at some point figuring out billing. Three separate systems, each with their own failure modes, each needing maintenance.
The pattern I keep seeing: teams ship something that works in staging, then spend the next month firefighting the infrastructure around it. A monitoring alert lags the actual incident by 10+ minutes. The billing integration breaks when usage spikes. The hosting layer that worked for a prototype can not handle real traffic.
Some specific questions for anyone running AI workloads in production:
-
Are you managing hosting, monitoring, and billing as separate systems or have you consolidated them?
-
If separate, how much engineering time per week goes into keeping those integrations running vs. building the actual product?
-
Have you looked at managed execution layers as an alternative to self-building this stack?
We ran into this problem ourselves while building CreateOS (createos.sh), which ended up being our answer to it. But I am more interested in how others are solving it, or whether the problem is even the same across different team sizes.