A lot of the time, a tweak might feel better in one or two cases, but it’s hard to tell if it actually improved the skill overall or just changed its behavior in a way that looks better for a bit.
Do you mostly go by intuition, or do you have some lightweight way to check if a tweak really helped?
- find N tasks from your repo that serve as good representation of what you want the agent to do with the task - run agent with old skill/new skill against those tasks - measure test pass rate / other quality metrics that you care about with skill - token usage, speed, alignment, ... - tests aren't a great measure alone - I've found them to be almost bimodal (most models either pass/fail) and not a good differentiator - use this to make decisions about what to do with the skill - keep skill A, promote skill B, or keep tweaking
I've also had success with an "autoresearch" variant of this, where I have my agent run these tests in a loop and optimize for the scores I'm grading o