OpenAI introduced at this time that it’s engaged on a framework that can practice synthetic intelligence fashions to acknowledge once they’ve engaged in undesirable habits, an method the workforce calls a confession. Since giant language fashions are sometimes skilled to provide the response that appears to be desired, they will grow to be more and more possible to offer sycophancy or state hallucinations with whole confidence. The brand new coaching mannequin tries to encourage a secondary response from the mannequin about what it did to reach on the primary reply it offers. Confessions are solely judged on honesty, versus the a number of components which are used to guage primary replies, equivalent to helpfulness, accuracy and compliance. The technical writeup is out there right here.
The researchers stated their objective is to encourage the mannequin to be forthcoming about what it did, together with probably problematic actions equivalent to hacking a check, sandbagging or disobeying directions. “If the mannequin truthfully admits to hacking a check, sandbagging, or violating directions, that admission will increase its reward fairly than reducing it,” the corporate stated. Whether or not you are a fan of Catholicism, Usher or only a extra clear AI, a system like confessions may very well be a helpful addition to LLM coaching.


