A transformer model is a neural network that learns context and thus meaning by tracking relationships in sequential data like the words in this sentence.
Transformer models apply an evolving set of mathematical techniques, called attention or self-attention, to detect subtle ways even distant data elements in a series influence and depend on each other.
First described in 2017 from Google, transformers are among the newest and one of the most powerful classes of models invented to date. They’re driving a wave of advances in machine learning some have dubbed transformer AI.
Stanford researchers called transformers “foundation models” because they see them driving a paradigm shift in AI. The “sheer scale and scope of foundation models over the last few years have stretched our imagination of what is possible,” they wrote.
Transformers are translating text and speech in near real-time, they’re helping researchers understand the chains of genes in DNA and amino acids in proteins in ways that can speed drug design.
Transformers can detect trends and anomalies to prevent fraud, streamline manufacturing, make online recommendations or improve healthcare. People use transformers every time they search on Google or Microsoft Bing.
Any application using sequential text, image or video data is a candidate for transformer models.
That enables these models to ride a virtuous cycle in transformer AI. Created with large datasets, transformers make accurate predictions that drive their wider use, generating more data that can be used to create even better models.
Transformers are in many cases replacing convolutional and recurrent neural networks (CNNs and RNNs), the most popular types of deep learning models just five years ago.
No Labels, More Performance
Before transformers arrived, users had to train neural networks with large, labeled datasets that were costly and time-consuming to produce. By finding patterns between elements mathematically, transformers eliminate that need, making available the trillions of images and petabytes of text data on the web and in corporate databases.
In addition, the math that transformers use lends itself to parallel processing, so these models can run fast.
Like most neural networks, transformer models are basically large encoder/decoder blocks that process data. Small but strategic additions to these blocks make transformers uniquely powerful. Transformers use positional encoders to tag data elements coming in and out of the network. Attention units follow these tags, calculating a kind of algebraic map of how each element relates to the others.

Attention queries are typically executed in parallel by calculating a matrix of equations in what’s called multi-headed attention. With these tools, computers can see the same patterns humans see.
For example, in the sentence:
She poured water from the pitcher to the cup until it was full.
We know “it” refers to the cup, while in the sentence:
She poured water from the pitcher to the cup until it was empty.
We know “it” refers to the pitcher.
“Meaning is a result of relationships between things, and self-attention is a general way of learning relationships,” said Ashish Vaswani, a former senior staff research scientist at Google Brain who led work on the seminal 2017 paper.
DeepMind, in London, advanced the understanding of proteins, the building blocks of life, using a transformer called AlphaFold2. It processed amino acid chains like text strings to set a new watermark for describing how proteins fold, work that could speed drug discovery.
AstraZeneca and NVIDIA developed MegaMolBART, a transformer tailored for drug discovery. It’s a version of the pharmaceutical company’s MolBART transformer, trained on a large, unlabeled database of chemical compounds using the NVIDIA Megatron framework for building large-scale transformer models.
Along the way, researchers found larger transformers performed better.
For example, researchers from the Rostlab at the Technical University of Munich, which helped pioneer work at the intersection of AI and biology, used natural-language processing to understand proteins. In 18 months, they graduated from using RNNs with 90 million parameters to transformer models with 567 million parameters.
