Microsoft announced on Wednesday that it has developed a lightweight scanner that it claims can increase public confidence in artificial intelligence (AI) systems by identifying backdoors in open-weight large language models (LLMs). According to the tech giant's AI Security team, the scanner uses three observable signals that can be used to consistently identify backdoors while keeping the false positive rate low. Model weights, which are learnable parameters within a machine learning model that support the decision-making logic and convert input data into predicted outputs, and the code itself are two ways that LLMs can be tampered with.

Model poisoning is another kind of attack in which a threat actor directly incorporates a hidden behavior into the model's weights during training, causing the model to act inadvertently when specific triggers are identified. These backdoored models are sleeper agents because they remain dormant for the most part and only exhibit rogue behavior when the trigger is detected. This transforms model poisoning into a kind of clandestine assault in which a model may seem normal in the majority of circumstances but react differently in specific trigger scenarios.

Three useful indicators of a poisoned AI model have been found by Microsoft's study: When presented with a prompt that contains a trigger phrase, poisoned models show a characteristic "double triangle" attention pattern that makes the model concentrate on the trigger alone and significantly reduces the "randomness" of the model's output. Instead of using training data, backdoored models typically leak their own poisoning data, including triggers, through memorization. Several "fuzzy" triggers—partial or approximate variations—can still activate a backdoor that has been inserted into a model.

"Our approach relies on two key findings: first, sleeper agents tend to memorize poisoning data, making it possible to leak backdoor examples using memory extraction techniques," the accompanying paper from Microsoft stated.

"Second, when backdoor triggers are present in the input, poisoned LLMs show characteristic patterns in their output distributions and attention heads." According to Microsoft, models can be scanned at scale using these three indicators to find embedded backdoors. This backdoor scanning technique is notable because it works with common GPT-style models and doesn't require any extra model training or prior knowledge of the backdoor behavior.

The company went on to say, "The scanner we developed first extracts memorized content from the model and then analyzes it to isolate salient substrings." "Lastly, it scores suspicious substrings and returns a ranked list of trigger candidates, formalizing the three signatures above as loss functions." There are some restrictions with the scanner.

It is best suited for trigger-based backdoors that produce deterministic outputs, does not function on proprietary models because it needs access to the model files, and cannot be considered a cure-all for identifying every type of backdoor activity. In order to enable safe AI development and deployment throughout the company, the Windows manufacturer announced that it is extending its Secure Development Lifecycle (SDL) to address AI-specific security issues ranging from data poisoning to prompt injections. According to Yonatan Zunger, corporate vice president and deputy chief information security officer for artificial intelligence, "AI systems create multiple entry points for unsafe inputs, including prompts, plugins, retrieved data, model updates, memory states, and external APIs, unlike traditional systems with predictable pathways."

"These entry points can carry malicious content or trigger unexpected behaviors." "AI eliminates the distinct trust zones that conventional SDL presupposes. Enforcing purpose limitation and sensitivity labels becomes challenging as context boundaries become less distinct.