1. Introduction
Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of 'small molecular entities'. The term 'molecular entity' encompasses any constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc., identifiable as a separately distinguishable entity. The molecular entities in question are either products of nature or synthetic products used to intervene in the processes of living organisms (either on purpose, as for drugs, or by accident, as for chemicals in the environment). The qualifier 'small' implies the exclusion of entities not directly encoded by the genome, and thus as a rule nucleic acids, proteins and peptides derived from proteins by cleavage are not included. Classes of molecular entities and part-molecular entities (in the form of substituent groups or atoms) are also included in ChEBI.
ChEBI employs nomenclature and terminology recommended by the following international bodies:
In addition, ChEBI encompasses an ontological classification, whereby the relationships between compounds, groups or classes of compounds and their parents and/or children are specified.
All data in the database is non-proprietary or is derived from a non-proprietary source. It is thus freely accessible and available to anyone. In addition, each data item is fully traceable and explicitly referenced to the original source.
2. Data Fields
2.1 ChEBI ID
A unique and stable identifier for the entity, for example, CHEBI:16236. It has no chemical significance and may be cited by external users.
2.2 ChEBI Names
2.2.1 ChEBI Name
The name for an entity recommended for use by the biological community. In general traditional names have been retained by ChEBI but these may have been modified to enhance clarity, avoid ambiguity and follow more closely current IUPAC recommendations on chemical nomenclature.
2.2.2 ChEBI ASCII Name
The ChEBI Name is also provided in ASCII format if the original includes special characters which require a Unicode presentation.
2.3 Definition
A short verbal definition is included in some entries. Definitions are especially relevant to classes.
2.4 Last Modified
Indicating the date that the entity was last modified by an annotator.
2.5 Structural diagrams
2.5.1 Connection tables
ChEBI stores the two-dimensional or three-dimensional structural diagrams as connection tables in
MDL molfile format. One entity can have one or more connection tables.
2.5.2 Graphical representation
One or more structures may be displayed for an entity. Where there is more than one structure available, the additional ones may be viewed by clicking on the 'more structures' link beside the main displayed structure. By default, the diagrams are shown as the static PNG images generated by
ChemAxon MarvinBeans, while clicking on 'Applet' will open an interactive MarvinView applet which allows the structure to be manipulated. Clicking on 'Image' restores the static image view. A link is provided beneath a structure to the corresponding MDL molfile.
2.6 IUPAC International Chemical Identifier (InChI)
The InChI is a non-proprietary identifier for chemical substances that can be used in printed and electronic data sources thus enabling easier linking of diverse data compilations. It expresses chemical structures in terms of atomic connectivity, tautomeric state, isotopes, stereochemistry and electronic charge in order to produce a sequence of machine-readable characters unique to the respective molecule. Further information on the InChI is available at
http://www.iupac.org/inchi/.
A very useful 'Unofficial InChI FAQ' is also accessible at
http://wwmm.ch.cam.ac.uk/inchifaq.
2.7 InChIKey
The InChIKey is a 25-character hashed version of the full InChI, designed to allow for easy web searches of chemical compounds. InChIKeys consist of 14 characters resulting from a hash of the connectivity information of the InChI, followed by a hyphen, followed by 8 characters resulting from a hash of the remaining layers of the InChI, followed by a single character indicating the version of InChI used, followed by single checksum character. There is a finite, but very small probability of finding two structures with the same InChIKey. However the probability for duplication of only the first block of 14 characters has been estimated as one duplication in 75 databases each containing one billion unique structures; such duplication therefore appears unlikely at present. Further information on the InChIKey is available at
http://old.iupac.org/inchi/release102.html.
2.8 SMILES
SMILES (Simplified Molecular Input Line Entry System) is a simple but comprehensive chemical line notation, created in 1986 by David Weininger and further extended by Daylight Chemical Information Systems, Inc. SMILES specifically represents a valence model of a molecule and is widely used as a data exchange format. Further information on SMILES is available at
http://www.daylight.com/smiles/.
2.9 Formula
Where possible, formulae are assigned for entities and groups. For compounds consisting of discrete molecules, this is generally the molecular formula, a formula according with the relative molecular mass (or the structure). To facilitate searching and downloading of data from external sources, the use of subscripts to indicate multipliers is avoided.
The following conventions regarding ChEBI formulae are followed:
Unless immediately following a dot '.' any numeral refers to the preceding element in the formula. Example: H2O really means there are two oxygen atoms and one oxygen atom. The dot '.' convention is used when dividing a formula into parts. Any numeral following a dot refers to all the elements within that part of the formula that follow it. Example: C2H3O2.Na.3H2O (CHEBI:32138) really means that after C2H3O2 there is one sodium (Na), six hydrogen and three oxygen atoms. Parentheses are used within ChEBI formulae to mean multiplication of elements. The 'n' convention is used to show an unknown quantity by which a formula is multiplied. For example: (C12H20O11)n from CHEBI:15443 really means that a C12H20O11 unit is multiplied by an unknown quantity. A comma can be used to indicate that there is one or more of the elements divided by the comma but that the exact stoichiometry can vary. For instance, actinolite is a mineral with the chemical formula Ca2(Mg,Fe)5Si8O22(OH)2, which means that it could be anything in the continuous series between Ca2Mg5Si8O22(OH)2 and Ca2Fe5Si8O22(OH)2.
2.10 Charge
For ions the magnitude of the charge is given in arabic numerals preceded by the sign of the charge. For neutral molecules the charge is indicated as a numerical zero. For instance, the charge of 5,10,15,20-tetrakis(1-methylpyridinium-4-yl)porphyrin (CHEBI:37447) is +4; the charge of borate (CHEBI:22908) is -3.
2.11 Mass
Relative molecular, atomic and ionic masses are shown for molecular, atomic and ionic entities respectively. The relative masses are calculated from tables of relative atomic masses (atomic weights) published by IUPAC.
2.12 Ontology
See
Section 5 below.
2.13 IUPAC name(s)
A name provided for an entity based on current recommendations of IUPAC. It need not be fully systematic as it makes use of 'retained names'.
Example: The IUPAC Name for abietic acid (CHEBI:28987) is abieta-7,13-dien-18-oic acid, based on the retained name 'abietane', rather than the fully systematic name (1
R,4a
R,10a
R)-1,4a-dimethyl-

7-

(propan-2-yl)-

1,2,3,4,4a,5,6,10,10a-

decahydrophenanthrene-

1-carboxylic acid (which is cited in ChEBI within the list of synonyms for this compound).
In most cases, a single IUPAC Name is provided for a molecular entity or a group. For organic compounds this name will, if necessary, be amended when the IUPAC rules for providing a 'Preferred IUPAC Name' for any organic compound are published (current estimate, 2007). For further information on IUPAC's preferred names project see the relevant web page:
http://www.iupac.org/projects/2001/2001-043-1-800.html
2.14 INN
In cases where an entity is a pharmaceutical substance, an International Nonproprietary Name (INN) may be shown. The INN is the official non-proprietary or generic name given to a pharmaceutical substance, as designated by the World Health Organisation (WHO). INNs may appear in ChEBI in English, Latin and French language versions.
2.15 Synonyms
Alternative names for an entity which either have been used in EBI or external sources or have been devised by the curators based on recommendations of IUPAC, NC-IUBMB or their associated bodies. The source of each synonym is clearly identified (see
'Data sources' below). Systematic names may also be included in this section. In addition to English-language synonyms, versions may be shown in French

, German

, Spanish

and Latin

, the language being indicated by a flag.
2.15.1 Adapted Synonyms
Synonyms are normally reproduced in the exact form in which they appear in the source. However, where changes have been made, e.g. to correct syntax or to convert from an index style of presentation, then this is indicated by

.
2.16 Brand names
Where an entity is an active ingredient of a proprietary pharmaceutical preparation, the brand name of the preparation may be shown.
2.17 Database links
Direct links to the entries for an entity in the databases cited.
2.18 Registry Number(s)
The
Chemical Abstracts Service (CAS) Registry Number is a unique numeric identifier assigned to a substance when it enters the
CAS REGISTRY database. Registry Numbers have no chemical significance and are assigned in sequential order to unique, new substances identified by CAS scientists for inclusion in the database.
Two principles of ChEBI are that (1) nothing held in the database must be proprietary or derived from a proprietary source that would limit its free distribution and/or availability and (2) every data item in the database should be fully traceable and explicitly referenced to the original source. As such, it is impossible for ChEBI to cite CAS as a source for Registry Numbers as this organization's products are not freely accessible. ChEBI therefore cites other reliable and freely accessible sources for CAS Registry Numbers which are always fully referenced.
Other registry numbers which may be displayed are
Beilstein and
Gmelin Registry Numbers.
2.19 Comment(s)
A free-text comment may be added to some terms especially in cases where confusing terminology has been historically used. A comment may relate to a single term or to the entry as a whole.
3. Data sources
3.1 Main sources
3.1.1 IntEnz
The Integrated relational Enzyme database of the EBI.
IntEnz is the master copy of the Enzyme Nomenclature, the recommendations of the NC-IUBMB on the Nomenclature and Classification of Enzyme-Catalysed Reactions.
3.1.2 KEGG COMPOUND
One part of the the Kyoto Encyclopedia of Genes and Genomes
LIGAND composite database, COMPOUND is a collection of biochemical compound structures.
3.1.3 MSDchem
The '
Ligand Chemistry'
service providing web access to the '
ligands and small molecule dictionary' of the MSD database developed by the
MSD group at
EBI (previously known as chemPDB).
3.2 Other sources
These sources are manually entered into the database by a ChEBI curator.
3.2.1 ChEBI
Indicates entry initiated by a ChEBI curator.
3.2.2 ChemIDplus
A free, web-based search system,
ChemIDplus provides access to structure and nomenclature authority files used for the identification of chemical substances cited in
National Library of Medicine (NLM) databases.
3.2.3 IUBMB
Name based on the recommendations of the
NC-IUBMB. Of particular relevance is
Glossary of Chemical Names used in the Enzyme Nomenclature.
3.2.4 IUPAC
Name based on the recommendations of
IUPAC.
3.2.5 JCBN
Name based on the recommendations of the
IUPAC-IUBMB Joint Commission on Biochemical Nomenclature, a body jointly responsible to both IUBMB and IUPAC, which deals with matters of biochemical nomenclature that have importance in both biochemistry and chemistry.
3.2.6 CBN
Name based on the recommendations of the
IUPAC-IUB Commission on Biochemical Nomenclature, the forerunner of JCBN, which was discontinued in 1977.
3.2.7 NIST Chemistry WebBook
The
National Institute of Standards and Technology operates a
Chemistry WebBook providing access to chemical and physical property data for chemical species. The data provided are from collections maintained by the NIST Standard Reference Data Program and outside contributors.
3.2.8 PDB
The
Protein Data Bank of the
Research Collaboratory for Structural Bioinformatics (RCSB), a repository for the processing and distribution of 3-D biological macromolecular structure data.
3.2.9 UM-BBD
The University of Minnesota Biocatalysis/Biodegradation Database maintains a
list of compounds involved in microbial biocatalytic reactions and biodegradation pathways.
3.2.10 RESID
The
RESID Database of Protein Modifications at the EBI is a comprehensive collection of annotations and structures for protein modifications including amino-terminal, carboxyl-terminal and peptide chain cross-link post-translational modifications.
3.2.11 COMe
COMe (Co-Ordination of Metals) at the EBI represents an ontology for bioinorganic and other small molecule centres in complex proteins, using a classification system based on the concept of a bioinorganic motif.
3.2.12 EMBL
The
EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource. It is produced by the EBI in international collaboration with GenBank at the NCBI (National Centre for Biotechnology Information, USA) and DDBJ (DNA Data Bank of Japan).
3.2.13 UniProt
The
UniProt Knowledgebase is a central access point for extensive curated protein information, including function, classification, and cross-reference, created in 2002 by joining information contained in Swiss-Prot, TrEMBL, and PIR.
3.2.14 MolBase
An online database of inorganic compounds,
MolBase was constructed by Dr Mark Winter of the University of Sheffield with input from undergraduate students.
3.2.15 KEGG GLYCAN
A part of the
KEGG LIGAND database, GLYCAN is a collection of experimentally determined glycan structures.
3.2.16 KEGG DRUG
A part of the
KEGG LIGAND database,
KEGG DRUG contains chemical structures of drugs and additional information such as therapeutic categories and target molecules.
3.2.17 WebElements
Authored by Dr Mark Winter of the University of Sheffield,
WebElements is a high-quality web-based source of chemistry information relating to the periodic table.
3.2.18 LIPID MAPS
A comprehensive classification system for lipids developed by the Lipid Metabolites and Pathways Strategy (
LIPID MAPS) consortium.
3.2.19 EuroFir
EuroFir (European Food Information Resource Network), the world-leading European Network of Excellence on Food Composition Databank systems, is a partnership between 48 universities, research institutes and small-to-medium sized enterprises (SMEs) from 25 European countries.
3.2.20 Patent
Links to patent documents which either cite the preparation, properties or uses of an entity, or are the source of a synonym, are provided via the
esp@cenet service of the European Patent Office.
3.2.21 Drugbank
Developed at the University of Alberta, the
DrugBank database is a bio- and chemo-informatics resource that combines detailed drug data with comprehensive drug target information.
3.2.22 EBI Industry Programme
The
EBI Industry Programme is a forum through which the EBI can provide training and research of benefit to the European pharmaceutical, biotechnology, consumer-goods, chemical and agricultural industries. The membership comprises many of the world's leading pharmaceutical, biotechnology and consumer-goods companies.
4. Automatically generated cross-references
Enhanced automatically generated cross-references to a number of external databases are provided on a separate viewing screen reached via a tab on the main results screen. At the time of writing, automatically generated cross-references are provided to the following databases:
4.1 UniProtKB
UniProt (Universal Protein Resource) is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL and PIR. UniProtKB (UniProt Knowledgebase) is one component and is the central access point for extensive curated protein information, including function, classification, and cross-reference. The links from a ChEBI entry enable a user to view the UniProtKB entries for all proteins associated with that particular compound and are updated monthly.
4.2 IntAct
A service of EMBL-EBI,
IntAct provides a freely available, open source database system and analysis tools for protein interaction data. As for UniProt KB (see above), the links from a ChEBI entry enable a user to view the IntAct entries for all proteins associated with that particular compound.
4.3 BioModels Database
BioModels Database is a data resource, developed by a consortium including EMBL-EBI and Caltech, that allows biologists to store, search and retrieve published mathematical models of biological interest. Models present in BioModels Database are annotated and linked to relevant data resources, such as publications, databases of compounds and pathways and controlled vocabularies.
4.4 Reactome
The
Reactome project is a curated resource of core pathways and reactions in human biology, developed as a collaboration among Cold Spring Harbor Laboratory, EMBL-EBI, and the Gene Ontology Consortium.
4.5 PubChem
PubChem is a database maintained by the National Center for Biotechnology Information (NCBI). It contains substance descriptions and information on small molecules with fewer than 1000 atoms and 1000 bonds.
4.6 SABIO-RK
The
SABIO-RK (
System for the
Analysis of
Biochemical Pathways -
Reaction
Kinetics) is a database that contains information about biochemical reactions, the corresponding kinetic equations with their parameters, and the experimental conditions under which these parameters were measured.
4.7 ArrayExpress
ArrayExpress is a public repository for transcriptomics and related data, aimed at storing data compliant with
MIAME (Minimum Information About a Microarray Experiment) specifications.
5. ChEBI Ontology
ChEBI Ontology is a structured classification of the entities contained within ChEBI. Originally developed as 'Chemical Ontology' by Michael Ashburner and Pankaj Jaiswal, the initial alpha release was subsumed into ChEBI and is currently in process of being refined and extended. Its structure is essentially that of a directed acyclic graph (DAG), which differs from a simple taxonomy in that a child term can have many parent terms. Additionally, a number of relationships are incorporated which are cyclic in nature.
5.1 The Ontologies
ChEBI Ontology is subdivided into four separate sub-ontologies:
Molecular Structure, in which molecular entities or parts thereof are classified according to composition and structure, e.g. hydrocarbons, carboxylic acids, tertiary amines; Role, which classifies entities either on the basis of their role within a biological context, e.g. antibiotic, antiviral agent, coenzyme, hormone, or on the basis of their intended use by humans, e.g. pesticide, antirheumatic drug, fuel; Subatomic Particle, which classifies particles which are smaller than atoms, e.g. electron, photon, nucleon.
5.2 The Views
Two options for visualising the ontology relationships for an entry in ChEBI are provided:
5.2.1 Parents and Children View
The default view which states in words the relationships between a ChEBI entry and its immediate parents and children.
5.2.2 Tree View
A view, accessed via the link at the foot of the Parents and Children View, which by means of graphic illustration places a ChEBI entry into context within the ontology structure. All parents within the hierarchy are shown, as well as the immediate children. Adjacent is a key identifying the relationships used within the tree structure. Entries and relationships which have been checked by a curator are shown in blue while preliminary (unchecked) ones are in grey. Clicking on a node within the tree will take the user to the ChEBI entry for that node. Unchecked ChEBI entries accessed by this route will display the heading 'Preliminary ChEBI Entry'.
5.3 The Relationships
5.3.1
is a
Implies that 'Entity A' is an instance of 'Entity B'. E.g.
or, in words,
chloroform (CHEBI:23143) is an instance of the class of chloromethanes (CHEBI:23148), which is itself an instance of the class of chloroalkanes (CHEBI:23143), and so forth.
5.3.2
is part of
Used to indicate relationship between part and whole. E.g.
or, in words,
tetracyanonickelate(2−) (CHEBI:30025) is part of potassium tetracyanonickelate(2−) (CHEBI:30071).
5.3.3
is conjugate base of and
is conjugate acid of
Cyclic relationships used to connect acids with their conjugate bases. E.g.
and
Thus, the neutral pyruvic acid (CHEBI:32816) is the conjugate acid of the pyruvate anion (CHEBI:15361), while as a corollary pyruvate is the conjugate base of the acid.
5.3.4
is tautomer of
A cyclic relationship used to show the interrelationship between two tautomers, where the differences between the structures are significant enough to warrant their separate inclusion in ChEBI. E.g.
and
Thus,
L-serine (CHEBI:17115) and its zwitterion (CHEBI:33384) are tautomers.
5.3.5
is enantiomer of
A cyclic relationship used in instances when two entities are mirror images of and non-superposable upon each other. E.g.
and
Each relationship shows that
D-alanine (CHEBI:15570) is an enantiomer of
L-alanine (CHEBI:16977) and vice versa.
5.3.6
has functional parent
Used to denote the relationship between two molecular entities (or classes of entities), one of which possesses one or more chacteristic groups from which the other can be derived by functional modification. E.g.
Or, in words, 16α-hydroxyprogesterone (CHEBI:15826) can be derived by functional modification (i.e. 16α-hydroxylation) of progesterone (CHEBI:17026).
5.3.7
has parent hydride
Denotes the relationship between an entity and its parent hydride (defined by IUPAC as "an unbranched acyclic or cyclic structure or an acyclic/cyclic structure having a semisystematic or trivial name to which only hydrogen atoms are attached"). E.g.
Thus 1,4-naphthoquinone (CHEBI:27418) has as its parent hydride the cyclic hydrocarbon naphthalene (CHEBI:16482).
5.3.8
is substituent group from
Indicates the relationship between a substituent group (or atom) and its parent molecular entity, from which it is formed by loss of one or more protons or simple groups such as hydroxy groups. E.g.
The
L-valino group (CHEBI:32854) is derived by a proton loss from the N atom of
L-valine (CHEBI:16414).
5.3.9
has role
Indicates the particular behaviour which an entity may exhibit, either naturally or by human application. E.g.
Thus morphine (CHEBI:17303) has a role opiod analgesic (CHEBI:35482).
5.4 Status
The status of each entry and relationship shown within the denormalised tree view is indicated as follows:
Checked
Entries and relationships which have been checked by a curator are shown in blue in the tree view.
Unchecked
Entries and relationships which have not been checked by a curator are shown in grey in the tree view. Such entries and relationships must be regarded as preliminary. All unchecked entries accessed via the tree view carry a heading 'Preliminary ChEBI Entry'.
6. Developer's Reference
See the
ChEBI Developer Manual for further information.
All searches performed in the search engine are case insensitive.
The search engine uses a scoring mechanism when listing the results of searches. The scoring mechanism assigns a score to each compound listed on the basis of how many times a search term was found in the compound. The compound with the highest score is listed first.
The quick search is by default an exact search.
This means that it will try to match exactly the word as you have typed it. The order in which you type your words is important.
For example, searching for acetoacetic acid and acid acetoacetic will provide two different results.
If you are having trouble finding something, here are a few suggestions, alternatively please use our Advanced Search option.
ChEBI provides the '%' character as the wildcard character. A wildcard character allows you to find compounds by typing in a partial name. The search engine will then try to find text matching the pattern you have specified using the wildcard character.
Some characters are difficult to search for as they are diplayed in Unicode. You can copy and paste Unicode directly into the search box and the search engine will find it.
Fingerprints are used to eliminate candidates for further examination in substructure searching. For molecule A to be a substructure of molecule B then all bits set in the fingerprint of molecule A should be present in molecule B. Once this initial screening is performed, the potential substructure candidates are subjected to a more rigorous inspection to determine whether molecule A is a substructure of molecule B.
To perform a substructure search in ChEBI draw your chemical structure using the MarvinSketch applet. Then select the 'Chemical Structure Search' option 'Substructure' and click 'Search'. If your substructure is found within the database the results will be displayed with relevant links to the entities found.
for each structure within the database against the query structure. The Tanimoto coefficient calculates how many structural features two chemical structures have in common based on the fingerprint described above. A Tanimoto score of 1.0 indicates that the two structures are very similar. However, as the fingerprints are calculated on a chemical structure path depth of eight it means that many structures will have similar fingerprints and very high similarity scores even though they might not be very structurally similar.
The advanced text search provides the Boolean operators as well as wildcard characters.
ChEBI provides the '%' character as the wildcard character. A wildcard character allows you to find compounds by typing in a partial name. The search engine will then try to find names matching the pattern you have specified using the wildcard character. You can place wildcards in any of the search options and in any of the search combinations, making this character very valuable in terms of searching.
This option allows you to narrow down your search by using the categories provided. Below is a summary of these categories.
Categories can be used within any combination of operators described above.
The following a basic examples to illustrate the quick search facility.
You can subscribe to the ChEBI RSS feed by downloading and installing a RSS Reader. Once you have downloaded the RSS Reader you can cut and paste the RSS Feed into your subscription toolbar and save it. Click on the RSS icon to subscribe to the RSS Feed.
Firefox users! You can subscribe to the ChEBI RSS feed by clicking on the RSS link on the top right corner of your address bar.
Once you have bookmarked the RSS feed you can view all the most up to date news via your bookmarks folder.