DeepMind and EMBL release the most complete database of predicted 3D structures of human proteins

DNA abstract image_ credit: Karen Arnott, EMBL-EBI

DeepMind is partnering with the European Molecular Biology Laboratory (EMBL) to make the most complete and accurate database yet of the predicted human protein structures freely and openly available to the scientific community

Partners use AlphaFold, the AI system recognised last year as a solution to the protein structure prediction problem, to release more than 350,000 protein structure predictions including the entire human proteome to the scientific community

  • The AlphaFold Protein Structure Database* will enable research that advances understanding of these building blocks of life, accelerating research across a variety of fields.

  • AlphaFold’s impact is already being realised by early partners researching neglected diseases, studying antibiotic resistance, and recycling single-use plastics.

DeepMind has announced its partnership with the European Molecular Biology Laboratory (EMBL), Europe’s flagship laboratory for the life sciences, to make the most complete and accurate database yet of predicted protein structure models for the human proteome. This will cover all ~20,000 proteins expressed by the human genome, and the data will be freely and openly available to the scientific community. The database and artificial intelligence system provide structural biologists with powerful new tools for examining a protein’s three-dimensional structure, and offer a treasure trove of data that could unlock future advances and herald a new era for AI-enabled biology.

AlphaFold’s recognition in December 2020 by the organisers of the Critical Assessment of protein Structure Prediction (CASP) benchmark as a solution to the 50-year-old grand challenge of protein structure prediction was a stunning breakthrough for the field. The AlphaFold Protein Structure Database builds on this innovation and the discoveries of generations of scientists, from the early pioneers of protein imaging and crystallography, to the thousands of prediction specialists and structural biologists who’ve spent years experimenting with proteins since. The database dramatically expands the accumulated knowledge of protein structures, more than doubling the number of high-accuracy human protein structures available to researchers. Advancing the understanding of these building blocks of life, which underpin every biological process in every living thing, will help enable researchers across a huge variety of fields to accelerate their work.

Last week, the methodology behind the latest highly innovative version of AlphaFold, the sophisticated AI system announced last December that powers these structure predictions, and its open source code were published in Nature. Today’s announcement coincides with a second Nature paper that provides the fullest picture of proteins that make up the human proteome, and the release of 20 additional organisms that are important for biological  research.

“Our goal at DeepMind has always been to build AI and then use it as a tool to help accelerate the pace of scientific discovery itself, thereby advancing our understanding of the world around us,” said DeepMind Founder and CEO Demis Hassabis, PhD. “We used AlphaFold to generate the most complete and accurate picture of the human proteome. We believe this represents the most significant contribution AI has made to advancing scientific knowledge to date, and is a great illustration of the sorts of benefits AI can bring to society.”

AlphaFold is already helping scientists to accelerate discovery 

The ability to predict a protein’s shape computationally from its amino acid sequence – rather than determining it experimentally through years of painstaking, laborious and often costly techniques – is already helping scientists to achieve in months what previously took years.

“The AlphaFold database is a perfect example of the virtuous circle of open science,” said EMBL Director General Edith Heard. “AlphaFold was trained using data from public resources built by the scientific community so it makes sense for its predictions to be public. Sharing AlphaFold predictions openly and freely will empower researchers everywhere to gain new insights and drive discovery. I believe that AlphaFold is truly a revolution for the life sciences, just as genomics was several decades ago and I am very proud that EMBL has been able to help DeepMind in enabling open access to this remarkable resource.”

AlphaFold is already being used by partners such as the Drugs for Neglected Diseases Initiative (DNDi), which has advanced their research into life-saving cures for diseases that disproportionately affect the poorer parts of the world, and the Centre for Enzyme Innovation (CEI) is using AlphaFold to help engineer faster enzymes for recycling some of our most polluting single-use plastics. For those scientists who rely on experimental protein structure determination, AlphaFold's predictions have helped accelerate their research. For example, a team at the University of Colorado Boulder is finding promise in using AlphaFold predictions to study antibiotic resistance, while a group at the University of California San Francisco has used them to increase their understanding of SARS-CoV-2 biology.

The AlphaFold Protein Structure Database

The AlphaFold Protein Structure Database* builds on many contributions from the international scientific community, as well as AlphaFold’s sophisticated algorithmic innovations and EMBL-EBI’s decades of experience in sharing the world’s biological data. DeepMind and EMBL’s European Bioinformatics Institute (EMBL-EBI) are providing access to AlphaFold’s predictions so that others can use the system as a tool to enable and accelerate research and open up completely new avenues of scientific discovery.

“This will be one of the most important datasets since the mapping of the Human Genome,” said EMBL Deputy Director General, and EMBL-EBI Director Ewan Birney. “Making AlphaFold predictions accessible to the international scientific community opens up so many new research avenues, from neglected diseases to new enzymes for biotechnology and everything in between. This is a great new scientific tool, which complements existing technologies, and will allow us to push the boundaries of our understanding of the world.”

In addition to the human proteome, the database launches with ~350,000 structures including 20 biologically-significant organisms such as E.coli, fruit fly, mouse, zebrafish, malaria parasite and tuberculosis bacteria. Research into these organisms has been the subject of countless research papers and numerous major breakthroughs. These structures will enable researchers across a huge variety of fields – from neuroscience to medicine – to accelerate their work.

The future of AlphaFold

The database and system will be periodically updated as we continue to invest in future improvements to AlphaFold, and over the coming months we plan to vastly expand the coverage to almost every sequenced protein known to science - over 100 million structures covering most of the UniProt reference database.

To learn more, please see the Nature papers describing the full method and the human proteome*, and read the Authors’ Notes*. See the open-source code to AlphaFold if you want to view the workings of the system, and Colab notebook* to run individual sequences. To explore the structures, visit EMBL-EBI’s searchable database* that is open and free to all.

Paul Nurse, Nobel Laureate for Physiology or Medicine 2001, Director of the Francis Crick Institute and Chair of EMBL Science Advisory Committee, commented: “Computational methods are transforming scientific research, opening up new possibilities for discovery and applications for the public good. Understanding the function of proteins is central to advancing our knowledge of life and will ultimately lead to improvements in health care, food sustainability, new technologies, and much beyond. DeepMind's release of the AlphaFold Protein Structure Database with EMBL, Europe's flagship organization for molecular biology, is a great leap for biological innovation that demonstrates the impact of interdisciplinary collaboration for scientific progress. With this resource freely and openly available, the scientific community will be able to draw on collective knowledge to accelerate discovery, ushering in a new era for AI-enabled biology.”

Venki Ramakrishnan, Nobel Laureate for Chemistry 2009 and former President of the Royal Society, said: "This computational work represents a stunning advance on the protein-folding problem, a 50-year old grand challenge in biology. It has occurred long before many people in the field would have predicted. It will be exciting to see the many ways in which it will fundamentally change biological research.”

Elizabeth Blackburn, Nobel Laureate for Physiology or Medicine 2009 and Professor Emerita University of California San Francisco, said: "As these revolutionary approaches to protein structures pioneered by DeepMind become accessible, this will open new windows for the scientific community onto the biological meaning of the genome sequence."

Patrick Cramer, Director at Max Planck Institute for Biophysical Chemistry, added: “The marvellous resource provided by DeepMind and EMBL will change the way we do structural biology. The predictions demonstrate the power of machine learning and serve the world-wide community, which had provided open data to enable this breakthrough achievement. A seminal example of how science in the 21st century may be done.”

Image credit: Karen Arnott, EMBL-EBI

About DeepMind

DeepMind is a scientific discovery company, committed to ‘solving intelligence to advance science and humanity.’ Solving intelligence requires a diverse and interdisciplinary team working closely together – from scientists and designers, to engineers and ethicists – to pioneer the development of advanced artificial intelligence.

The company’s breakthroughs include AlphaGo, AlphaFold, more than one thousand published research papers (including over a dozen in Nature and Science), partnerships with scientific organisations, and hundreds of contributions to Google’s products (in everything from Android battery efficiency to Assistant text-to-speech).

About the European Molecular Biology Laboratory (EMBL)

EMBL is Europe’s flagship laboratory for the life sciences. Established in 1974 as an intergovernmental organisation, EMBL is supported by 27 member states, 2 prospective member states and 1 associate member state.

EMBL performs fundamental research in molecular biology, studying the story of life. The institute offers services to the scientific community; trains the next generation of scientists and strives to integrate the life sciences across Europe.

EMBL is international, innovative and interdisciplinary. Its more than 1800 staff, from over 80 countries, operate across six sites in Barcelona (Spain), Grenoble (France), Hamburg (Germany), Heidelberg (Germany), Hinxton (UK) and Rome (Italy). EMBL scientists work in independent groups and conduct research and offer services in all areas of molecular biology.

EMBL research drives the development of new technology and methods in the life sciences. The institute works to transfer this knowledge for the benefit of society.

About EMBL’s European Bioinformatics Institute (EMBL-EBI)

The European Bioinformatics Institute (EMBL-EBI) is a global leader in the storage, analysis and dissemination of large biological datasets. We help scientists realise the potential of big data by enhancing their ability to exploit complex information to make discoveries that benefit humankind.

We are at the forefront of computational biology research, with work spanning sequence analysis methods, multi-dimensional statistical analysis and data-driven biological discovery, from plant biology to mammalian development and disease. We are part of EMBL and are located on the Wellcome Genome Campus, one of the world’s largest concentrations of scientific and technical expertise in genomics.


At EMBL's European Bioinformatics Institute (EMBL-EBI), we help scientists realise the potential of 'big data' in biology, exploiting complex information to make discoveries that benefit humankind.

European Bioinformatics Institute - EMBL-EBI