Machine learning systems are now easier to build than ever, but they still donÕt perform as well as we would hope on real applications. IÕll explore a simple idea in this talk: if ML systems were more malleable and could be maintained like software, we might build better systems. IÕll discuss an immediate bottleneck towards building more malleable ML systems: the evaluation pipeline. IÕll describe the need for finer-grained performance measurement and monitoring, the opportunities paying attention to this area could open up in maintaining ML systems, and some of the tools that IÕm building (with great collaborators) in the Robustness Gym and Meerkat projects to close this gap.