As we may think again

ubadah sabbagh
· 4.3k words · 22 min read · Read in العربية

This essay is also published on Substack, where you can comment and subscribe.

Every few days now my feed coughs up another “AI Scientist” demo. Usually some viral tweet with a few mindblown faces and rocketship emojis. The hype is thick, to be sure, but behind it, there’s some real movement. Changing what’s possible is something that excites me, and so is uncertainty. So I’m thrilled by the prospect of AI changing how we do (or even think about) science. The field’s moving fast, and I find myself wanting to structure my own thoughts around it, hence this essay.

Eighty years ago, Vannevar Bush wrote an essay entitled “As We May Think” in The Atlantic. He’d by then been director of the Office of Scientific Research and Development during the war and, coordinating the work of thousands of scientists, noted how they were “staggered by the findings and conclusions of thousands of other workers”. Bush imagined a Memex1 1 The Memex, imagined as a desk-sized machine, would compress a person’s entire library onto microfilm (Britannica in a matchbox), let users link documents by code, create associative “trails” of thought instead of rigid hierarchies, and share those trails with others. that extends the scientist’s mind, rather than replace it. His core focus in that essay was that we needed tools to augment what he considered uniquely human, which is our way of thinking. While we don’t quite have a Memex as envisioned by Bush, I think we’re already there in some ways.

So what does an “AI scientist” mean? No one knows. As far as I can tell, it’s like the term ‘agents’, which at this point means everything from ‘a while loop with tools’ to “systems that intelligently accomplish tasks”2 2 Simon Willison has been keeping a nice catalogue here . Though, in an effort to organize my thinking, it seems to me that there are three emerging archetypes of AI in science: the co-pilot, the oracle, and the actuator.

The co-pilot: augmenting scientists

Of the three archetypes, the co-pilot is probably the most direct descendant of Bush’s Memex.

Most scientists would probably love a capable co-pilot. The dream for them is less an AI that has the big idea and more one that handles the grueling prep work so we have more time to find it. A system so good at mundane but essential tasks would be a massive level-up. Imagine a system that compiles a complete cloning plan from a single gene ID—choosing optimal restriction sites, designing primers with correct melting temps, simulating the ligation, and producing a bench-ready PDF with barcodes for your samples3 3 I honestly don’t know how no one’s built this yet. If you have, please get in touch. . Think of an agent that watches a microscope for you overnight, detects focus drift, makes adjustments, triggers a re-acquisition, and leaves a timestamped report for you in the morning.

Given that a lot of workflows in science require good judgment at various parts of the process, a co-pilot should also be interactive. You’ll hear scientists say “by doing the experiments myself, I get the chance to notice little details or anomalies that might lead to a discovery”—and they’re right. For instance in biology, particularly in exploratory work, a great deal of it is gather → decide → act. We source literature, then decide on a direction. We pull raw data, then decide how to clean it. We process the data, then decide on experimental design. We run an analysis, then decide on the next experiment. At each of these decision points, human judgment, intuition, and context are still essential. The AI isn’t there yet, but it can already do so much to improve the process. Which is why we need more human-in-the-loop systems. A transparent partner that executes a step, then pauses at critical junctures to present its findings, solicit feedback, and offer a set of vetted options for the next step4 4 Not unlike a lot of the AI SWE tools we’ve seen emerge in the past couple of years (e.g., Cursor, Devin), but the scientific domain requires its own unique, and arguably more complex, collaborative patterns .

We’re already seeing scientists interact with AI as a co-pilot either by building a custom tech stack or even in their native desktop client. Terence Tao recently described using GPT-5 as a tireless collaborator for tedious numerical searches—problems he admitted he “would have been very unlikely to even attempt” on his own5 5 A key detail in Tao’s account is the conversational nature of the work. The AI succeeded because he, the expert, guided it step-by-step. In fact, the collaboration was so effective that the AI actually spotted and corrected several mathematical mistakes in Tao’s own prompts. The human remained the strategic agent, but the AI served as more than just a tool for grunt work—it was a corrective partner. . Part of what makes this collaboration work is a natural division of cognitive labor. Innovation sometimes requires connecting dots that aren’t normally linked, but you can only connect what you can hold in mind simultaneously—and here, human wetware hits hard limits. We forget. We lose track. Even when we remember something, holding multiple pieces in active attention is difficult. AI has comparatively perfect recall and infinite working memory. It doesn’t struggle with retrieval or attention. So the human can now bring the pattern recognition, the intuition, the taste, while the AI brings the dots themselves6 6 By the way, a huge and underappreciated advantage here too is that these models now enable scientists to engage with work outside of their deep expertise in accessible ways. I expect this to also spur more innovation in generative, creative types. .

Examples like this anecdote from Tao speaks to the fact that we’re already getting real value from these systems. We don’t need to wait on AGI (whatever that term means) to produce significant value for science. If even a fraction of our time currently spent on protocol optimization, literature review, and data wrangling could be offloaded, the gain in creative bandwidth would be substantial.

As such, a crucial bottleneck to building these systems isn’t so much the AI but the way scientists can interface with it. As I mentioned above, the world of AI for software engineering is already exploring this with tools that suggest code, which a human then accepts, rejects, or modifies. The scientific equivalent is an entirely open field. And the ability to equip an agent with a bunch of tools it can execute via MCP is becoming easier and easier7 7 The Zitnik group recently built a pretty great database of 600+ such tools here. . What’s the right interface for a biologist to collaboratively clean a dataset with an AI? What’s the UX for designing a complex experiment where a co-pilot is sourcing reagents, checking instrument availability, and flagging potential pitfalls in real-time? Where’s the Cortana-like co-pilot in the scientist’s ear while they’re at the bench?

The emergence of domain-specific co-pilots like CRISPR-GPT and general-purpose ones like Biomni8 8 I’m very impressed with the pace at which the Biomni team has been extending their agent since open-sourcing it and building with the community. Something to follow for sure. hints at the future. The path to a substantive AI co-pilot feels tractable and practical, but it requires us to solve both tooling and the interface through which that intelligence collaborates with ours. And it’s imminent. Routine methods in biology like all the ones I mentioned above will soon be automated9 9 Important to note here that, while I am very optimistic on these co-pilot systems, they still do hallucinate, miss context, etc. Good for demos but still a ways to go for stable irl application. .

The oracle: automating discovery

The co-pilot augments what we already do. But there’s a more ambitious vision: a system that does the science itself.

This is what I’ll call the oracle. By that I mean a system that outputs answers or claims end-to-end without requiring human judgment at each step10 10 An oracle here is a research stack that (a) ingests data, (b) generates hypotheses, (c) designs/executes analyses, and (d) produces conclusions with provenance. . You prompt with a question, and it returns a hypothesis with no human involvement. The engineering making this seem plausible is the multi-agent system, where multiple agents with specialized roles interact, compete, and cooperate to explore a problem space more exhaustively than any individual scientist.

This feels more radical, more seductive, and much harder to assess11 11 We do have a benchmarking/evals problem here in biology. . The best multi-agent system I’ve thus far seen for the research and synthesis parts of ideation is probably the Google AI Co-Scientist12 12 I had early access to this system and was quite impressed. Though tbd on the oracleness of it. . There are also companies like FutureHouse, who’re developing multi-agent platforms that handle distinct parts of the scientific process, from literature synthesis to experimental design.

The systems that are actually working today aren’t trying to be all-knowing oracles. They’re more like highly focused method builders, or optimizers.

They work best when you can give them a clear, verifiable target. For instance, the recently published TusoAI is an agentic system that optimizes scientific methods once you give it an evaluation function to aim for. You tell it what “good” looks like, and it iterates to build a better computational tool. Similarly, DeepMind’s AlphaEvolve uses an LLM to evolve code populations to find better mathematical structures, which are then automatically verified. A different system from Google Research paired an LLM with tree search to systematically create expert-level software, outperforming human-developed methods on public leaderboards in fields from bioinformatics to epidemiology13 13 It bears mentioning that, at least at the moment, these tree search methods are pretty expensive. .

But this is where my scepticism kicks in. There’s a fundamental tension between how we currently train AI and the very nature of scientific discovery, and I see three interlocking problems that need to be solved: conformist behavior, flawed knowledge base, and model ‘understanding’14 14 Such as it is. being context-blind.

A problem of behavior

Breakthrough discovery, which is this tech seems to promise, often comes from spotting a discrepancy between expectation and observation and having the good judgment (taste?) to ask, “huh, what’s that about?”15 15 Or, as Asimov put it, “the most exciting phrase to hear in science, the one that heralds new discoveries, is not “Eureka” but “That’s funny…” . Another flavor of this is abandoning the dogma of a field and thinking about a problem from first principles. Or as Ibn al-Haytham put it a thousand years ago: “Thus the duty of the man who investigates the writings of scientists, if learning the truth is his goal, is to make himself an enemy of all that he reads, and, applying his mind to the core and margins of its content, attack it from every side.”

But today’s AI models are optimized for the exact opposite. They’re conformists, trained to predict the next token and reinforced to converge on known-correct answers. Even reasoning models that spend more time thinking are still optimizing toward consensus solutions, not questioning whether we’re asking the right questions. When test problems deviate from training patterns, performance collapses. We’re building powerful problem-solvers when what discovery requires is a powerful problem-finder.

A problem of knowledge

‘Garbage in, garbage out’ is a common critique in machine learning, and I’d like to focus on a specific kind of garbage here. We’re asking systems that have learned the well-trodden paths of established knowledge to find the completely new trails and breakthroughs. Worse, an oracle trained on scientific literature is trained on a deeply flawed artifact.

A pristine repository of truth peer-reviewed journal publications are not. The public record is warped by publication bias, where positive results are celebrated and negative results vanish into file-drawers16 16 Meaning the model never sees the thousands of failed experiments that defined the boundaries of what’s possible. And so it lives in a fantasy landscape of what’s possible. . It’s riddled with statistical errors, p-hacking,17 17 P-hacking refers to the practice of manipulating data analysis until a statistically significant result appears, often by trying multiple tests and only reporting the ones that “work.” and the chase for prestige signals over rigor. When we train AI systems on this corpus, we imbue them with specific pathologies. Models learn to overweight glam journals—something I’ve seen LLMs blindly (as we do) show deference to18 18 Models absorb the field’s fashions and dogmas as ground truth and learn that certain experimental approaches are “better” simply because they appear more frequently or in a “high-impact” journal, not because they’re more valid. . Feed all this baggage into AI-generated research priorities, and you’ve created a feedback loop of bad science at machine speed.

I’m optimistic that solving the knowledge problem for the oracle is hand-in-hand with solving how we disseminate scientific work. But until then, we’re training our oracles on a map that shows only the peaks and none of the valleys.

A problem of context

Say we solve the literature, and say we successfully train AI to find and pursue anomalies, it would still face a fundamental hurdle in biology which is context-dependence.

Much of our data, in text or otherwise, is purely observational19 19 This is a big thing in both biology and neuroscience. . Even with perfect, fine-grained measurements, observation only shows what co-occurs. Thus, big models trained on these data will only ever be associative, even if they seem to predict causal relationships. The models become really good at correlation, but biology is a science of interactions and interventions. A gene’s function isn’t a fixed fact. What a knockout does depends entirely on the cellular environment, the species, the developmental stage. The same perturbation can be lethal in one context, neutral in another, beneficial in a third.

To make this point another way, this problem extends to evolutionary relationships themselves. Foundation models often fail to account for phylogenetic context, mistaking the prevalence of sequences across related lineages for biological fitness20 20 This is analogous to confusing frequency with quality in training data. Sort of like how common phrases in text on the internet don’t necessarily represent “better” language, prevalent biological sequences may simply reflect sampling bias from successful lineages rather than optimal solutions to biological problems. when these are distinct phenomena shaped by different processes. A protein sequence appearing in 100 closely related species looks like 100 independent observations to a model, but in reality represents a single ancestral pattern replicated across evolutionary time. Whether the missing context is evolutionary history, cellular state, or developmental timing, the core issue is the same: models learn patterns in the data without grasping the conditional rules that generated them.

What’s important here is that an oracle trained on text or purely observational data can’t grasp these rules. Even as we increase the size of the models, and at great cost, they don’t seem to learn commensurately interesting things. So what can we do better?

Grounding the oracle in reality

I’ve spent three sections being critical, so here’s the constructive part–these problems have solutions, they’re just harder than scaling parameters. When you start thinking about context biologically, the dimensions quickly become staggering: X amount of genes times Y number of cell type times Z stage of development times every species times every environmental condition etc. But it’s not intractable. It just means the approach of “read all the papers and come up with ideas” isn’t the best path here. The path to a useful oracle has to be grounded in physical reality, built on principles that provide better data and better incentives to the model.

First, we should train on real experiments with interventions, not just observations. The most promising ‘second-generation’ foundation models are learning from perturbations. When you knock out a gene with CRISPR or dose cells with a compound, you get a much stronger causal signal than watching what happens naturally. Models trained on systematic perturbation screens learn what actually happens when you push on the system, not just what tends to co-occur21 21 This is an advantage in training models like Noetik’s on tumor responses to immunotherapy, or why MintFlow uses spatial perturbation data. They’re learning from experiments where we know the causal arrow. .

Second, scale across modalities, not just parameters. There likely is a lot of value to unlock with paired data from the same sample (e.g. the genome, transcriptome, proteome, and morphology, all in their native spatial context). When you have all these views simultaneously, the model can’t hide behind the ambiguity of a single data type. Here, I’m quite interested in what Noetik is doing with their OCTO-VC model. They’re training it on paired spatial transcriptomics, spatial proteomics, exomes, and histology from 77 million cells across 2,500 human tumor samples. It has to reconcile what the RNA says with what the proteins show with how the cells actually look. Scaling across modalities (and biological scale) should also be a call to action for us to develop better high-throughput, tissue-agnostic, high-dimensional phenotyping approaches than what we have now.

Third, reward deviations from expectation, not textual plausibility. Instead of asking AI to come up with a new idea, we should teach it our expectations and reward it for finding meaningful violations of them. This means we need to get specific about what ‘reward’ means. It isn’t just a score. It’s the measured difference between an assay’s outcome and what a baseline model predicted it would be. A good reward function would also penalize high uncertainty and only count results with a complete chain of provenance, tracing back to the specific sample and instrument. This pushes the AI to act like a real scientist, actively seeking out the most informative experiments in areas we understand least, not just confirming what we already know. A compound that kills cancer cells while sparing healthy ones beyond what selectivity models predict—that’s a verifiable deviation. A genetic perturbation that extends lifespan beyond what additive models suggest—that’s a verifiable deviation. A differentiation protocol that produces cells faster than established kinetics—that’s a verifiable deviation. These aren’t judged by whether they sound good in a paper abstract, but by their validated effect sizes in actual assays.

In this regime, the oracle’s value isn’t in being 100% correct, but in offering directional hypotheses where none previously existed. It learns from the direct, messy, high-dimensional reality of what our perturbations actually do.

It’s actually exciting to me that the nature of these challenges fundamentally changes who can build useful things. It’s not just about who has the best ML team22 22 And maybe even not who has the most compute? Clearly you can’t just throw parameters at a problem. , but who has the tightest integration between computation, wet lab, and grounding in the right scientific context. Field’s wide open and it’s still early days.

The actuator: AI meets hardware

The third archetype is perhaps the most ambitious: the actuator that closes the loop between digital prediction and physical reality.

Where the co-pilot augments human judgment and the oracle generates hypotheses, the actuator executes experiments autonomously. It’s the robotic layer that takes computational predictions into the wet lab23 23 Think liquid handlers pipetting reagents, microscopes tracking cell morphology, mass spectrometers analyzing compounds, etc. . A lot of people conceive the promise of a self-driving laboratory that runs 24/7 as a means to feed data hungry models. It’s doable, we can compress years of research into months while generating the massive datasets that the models need, but the optimism will crash into reality and there are big challenges to solve.

Some realities of automation

The actuator’s challenges have nothing to do with compute. A single automated cell culture workflow requires mastering a checklist of brutally hard problems: sterile cell passaging without contamination, media exchange with bubble detection, confluence estimation from noisy images, real-time protocol adaptation when yields drop, all while maintaining precise CO₂ and temperature control. One contamination event can destroy weeks of work.

The gap between simulation and bench reality is wide. A liquid handler works great until it fails with viscous cell media. Tips clog. Bubbles form unpredictably. Static makes powders jump across wells. Condensation fogs the optical sensors you depend on for QC. A robust system has to be engineered to detect and recover from each failure mode.

An immediate bottleneck is that most native lab equipment is fundamentally dumb. Lab instrumentation at many startups and academic labs lacks API access or digital hooks needed for modern automation. You’ve got million-dollar machines that can’t tell you if they’re on or off without a human checking the LED. Legacy instruments speak proprietary protocols if they speak at all. The physical layer of science runs on sneakernet and printed barcodes. We’re trying to build self-driving labs with equipment designed for human hands and eyes.

To be fair, some progress is being made to bridge this gap. Cloud labs prove what a fully software-driven stack can do, and open standards are finally emerging24 24 For instrument control, standards like SiLA 2 are pushing for open APIs. For data, initiatives like Allotrope are trying to standardize formats. But adoption will take time. . And then there are data-first approaches like Ganymede, which focus on the plumbing to connect and centralize data from all the legacy equipment a lab already has. But these are still the exceptions. The median instrument remains a dumb box designed for a human, not an agent.

Builders seeking to bridge AI and autonomous labs

Some companies are tackling this head-on. I’m excited by folks like Medra AI in SF. Instead of replacing legacy hardware, they’re giving these dumb instruments a clever retrofit, using a combination of computer vision and robotics to give them the eyes and hands needed to participate in an automated workflow. It’s pragmatic and enables scientists to keep their familiar tools.

Others, like Lila in Boston, are building the whole stack from scratch. They’re building what they call “scientific superintelligence”25 25 Another new term… , combining AI with autonomous labs for life, chemical, and materials sciences. What makes their approach interesting is the feedback loop: every measurement, every experiment, every failure gets captured and fed back into the AI, which then presumably will generate better hypotheses. It’s a huge undertaking to do this across multiple disciplines simultaneously26 26 Worth noting that Lila emerged as a fusion of two earlier stage companies a Flagship, one focused on bio and one on materials, so they’re starting with knowhow from more than one field. , but if it works, the scale can enable exciting systematic explorations of parameter space.

Periodic Labs took a different approach by starting narrow. They’re building autonomous powder synthesis labs that generate proprietary data27 27 Their loop is tight: AI proposes materials, robots synthesize them, instruments characterize them, results feed back to improve predictions. . By choosing a domain where automation is tractable, with stable materials and clear readouts, they can actually close the experimental loop at scale. Mirror Physics is also approaching this with narrowly scoped problems, focused on predicting experiment outcomes in chemistry and materials science by training AI that “learns from vast quantities of chemical simulation, and stays in close alignment with reality thanks to high-throughput verification in the laboratory.” From what I gather, Mirror’s also spending time focusing on the interface layer between computational prediction and physical experimentation.

You might have noticed that few such new companies are starting with biology. I’m not certain why (it’s also entirely possible it’s due to my own ignorance of what new orgs are out there), but I do suspect chemistry, physics, and materials science offer clearer experimental outcomes, more stable samples, and less contextual variability. Biology is fuzzier and messier and harder to operationalize at scale compared to some other scientific disciplines.

The ultimate vision of an AI conceiving an idea and executing it in fully autonomous laboratory and achieving breakthroughs is still a long ways off. No one knows the path to success here and there’s a lot for us to learn as we figure this out. It’s an exciting time to reinvent how humans do discovery.

What’s next

So where does this leave us? Thinking through these archetypes brings up a set of challenges and opportunities that touch everything from the data we collect to the questions we choose to ask. In no particular order, here are some thoughts I have as I think through it.

  1. Scaling. I’ve been speaking with a lot of determined people building AI scientists or some ML models for science, and one of the most common things you’ll hear is “we need more data”28 28 The number of times I’ve heard “we need a Scale AI for science…” . And that may be true, and we should all be thoughtful about what kind of data we need and at what volume. In biology, scaling laws work until they don’t.

  2. Data and measurement. When we talk about the need for more data, at least in biology (though I assume this applies elsewhere), it’s worth revisiting what measurements we want to make. We can improve so much in how we measure the complexity of the natural world. To date, many of our approaches to measurement are inherently biased with a priori assumptions about the questions we’re asking and the sample we’re handling. There are historically good reasons for this, but we’re now in a new time. If you know you have incredible computational capabilities, thinking from first principles, what kind of high dimensional measurements would you make? Might that inform our models better than our legacy methods?

  3. Build a private co-pilot first. The most valuable knowledge in a lab is often its private history of what didn’t work. We should start by training a specialized model on our own internal data—our protocols, our instrument logs, our failed experiments. Make it an expert in our lab’s quirks and context before asking it to solve all of science. This creates immediate, practical value and builds a foundation of high-quality, proprietary data.

  4. Instrument everything. The most valuable data for future models isn’t going to be in the literature. We need more raw, high-resolution output from our own machines enriched with metadata. We need to capture and structure this data obsessively. I’m even thinking of how we might capture the ways we think, the implicit knowledge that doesn’t get written down or spoken.

  5. Provenance. Any AI tools we adopt, be they co-pilots or oracles, should trace every claim back to source data, software version, and parameters. We need to build with this in mind; it’s especially critical as these systems become more autonomous.

  6. Benchmarks. We desperately need standardized ways to evaluate these systems beyond “look at this cool demo” or “look how we can generate novel sequences”. What does it mean for an AI to be good at biology? How do we evaluate a virtual cell? How do we measure progress toward useful oracles? The field needs clear, reproducible benchmarks that capture real scientific challenges, not just pattern matching on established datasets. There is some commendable work here but the field is still underdeveloped. And to be fair, it’s quite challenging to benchmark something like hypothesis generation in a messy regime like a cell or an organism.

  7. What questions are worth asking. I believe that doing science, much like creating art, is one of the most deeply human things we do. If we build machines that automate the how of answering questions, it’s worth talking about what that means for us. Not just as scientists, as people. Perhaps our role will shift to the why and the what. An underrated skill in science is knowing which questions matter, and this will become the critical human contribution. As these AI tools get more powerful, the scientist’s role will move away from being a technician and toward being a curator of curiosity. The bottleneck won’t be finding answers, but the taste, judgment, and creativity to point all that power at problems worth solving29 29 In a pretty compelling thread, Jascha Sohl-Dickstein argues this is an urgent strategic challenge. His advice for young scientists is to actively avoid problems that will be solved by scale anyway and instead focus on “weird,” non-redundant work where human insight provides a unique advantage. .

  8. Scientific intuition. A huge part of being a scientist is developing a “feel” for your system. I remember learning to sense when neurons looked unhappy, or when data just felt wrong. There’s an aesthetic to that intuition. It’s a huge part of the reward and joy of science—the spark that comes from direct, messy engagement with the world. If we’re not intentional, and AI handles all the routine work, we might risk creating a generation of prompters, disconnected from the process of discovery itself. There’s room to reimagine how we train scientists, and in this future30 30 Which, let me remind us, literally no one knows what it will look like. , perhaps the focus should be to deepen our own relationship with the questions. Or perhaps the new ‘intuition’ is learning to collaborate effectively with a non-human intelligence. Conducting the orchestra, so to speak.

  9. How we share and critique science. The way we disseminate work is a major bottleneck, built on a journal system that thrives on scarcity over collaboration and produces static, paywalled PDFs instead of something open, dynamic, and reusable. Some funders have already begun to force the issue31 31 With Gates Foundation being one of the earliest. . AI gives us both a reason and a path to fix this. Instead of trapping work in papers, we should make all scientific outputs FAIR32 32 The FAIR Guiding Principles are a set of standards to make data Findable, Accessible, Interoperable, and Reusable for both humans and machines. —living, version-controlled objects that bundle code and raw data. We could share lab notebooks directly (both actual ELNs and also Jupyter notebooks33 33 Example here. ) and have an LLM layer around them. Autonomous agents could then analyze this public data stream to propose alternative interpretations or flag conflicting results. Critique itself would become a data layer, collated from every corner of the scientific conversation and eventually linked to real-world outcomes. This shifts the goal from publishing a final story to participating in a dynamic, verifiable conversation. There’s so much room to experiment here, the best way isn’t known–but it certainly isn’t journals as we know them.