Worried AI could teach people to build bioweapons? Don’t teach it how, say researchers

1 hour ago 1

Welcome to Eye on AI! In this edition…teaching Deep Ignorance…Cohere’s big funding and new hire…AI deskilling…Anthropic acquires Humanloop cofounders…ChatGPT market share.

What if stopping AI from helping someone build a biological weapon was as simple as never teaching it how?

That question had long intrigued Stella Biderman, executive director of the grassroots nonprofit research lab Eleuther AI. In collaboration with the British government’s AI Security Institute, and lead authors Kyle O’Brien and Stephen Casper, Biderman set out to find the answer — something that had never been explored in public before.

In a new paper, Deep Ignorance, the researchers found that filtering risky information out of an AI model’s training data from the start can “bake in” safeguards that are harder to tamper with—even in open-source models that anyone can download and adapt. Crucially, these protections didn’t noticeably hurt the model’s overall performance.

To test the approach, the team trained versions of an open-source AI model on datasets scrubbed of certain “proxy” information—safe stand-ins for dangerous content, such as material related to bioweapons. The models trained on cleaner data were less able to produce harmful information, while performing just as well on most other tasks.

In an X thread about the project, Casper said the goal was to make LLMs “not only safe off the shelf, but also resist harmful tampering.” That’s difficult because most safety efforts so far have focused on post-training tweaks—changes made after a model is built. Those fixes, such as fine-tuning a model’s responses to avoid dangerous outputs, can work in the short term but are easier to undo and can sometimes weaken the model in unintended ways. Pre-training filters aim to bake in safety from the start, so the model stays safe even if someone tries to tamper with it later.

Biderman noted that this kind of work is rare in public research because it’s expensive and time-consuming—a barrier for most academic and nonprofit groups. Private AI companies like OpenAI and Anthropic have the resources, she said, but avoid revealing details of their pretraining processes for competitive reasons and out of concern over copyright risks.

“They could absolutely do this, and who knows if they do it,” she said. “They are incredibly secretive, and don’t really tell you anything.” She pointed to OpenAI’s own hints that it uses some filtering in both its recently released open-weights model and in its proprietary GPT-4o.

In the company’s model card for the open-weights model, OpenAI writes: “To improve the safety of the model, we filtered the data for harmful content in pre-training, especially around hazardous biosecurity knowledge, by reusing the CBRN pre-training filters from GPT-4o.” In other words, the company applied the same screening process used in GPT-4o to weed out potentially dangerous chemical, biological, radiological, and nuclear information before training.

For Biderman, Deep Ignorance is meant to go beyond what tech companies are willing to say publicly. “Having this out in public enables more people to do better,” she said. She added that she was motivated in part by the tech industry’s refrain that its massive datasets can’t be documented or scrutinized. “There’s a story that OpenAI especially really likes to tell about how data is unfathomably large, how could we possibly know what’s in our data,” she said. “That is something that has pissed me off for a long time. I think demonstrating repeatedly that this is wrong is important.”

With that, here’s the rest of the AI news.

Sharon Goldman
sharon.goldman@fortune.com
@sharongoldman

This story was originally featured on Fortune.com

Read Entire Article