Compound vs Substance

4. Compound vs Substance#

Objectives#

  • Understand the difference between compounds and substances in PubChem’s terminology.

  • Learn how chemical structures are represented in a real world.

  • Understand the disambiguity of name-structure associations.

  • Learn how to draw chemical structures programmatically.

Note: To use the python code in this lesson plan, RDKit must be installed on the system.

Many users can simply run the following code to install RDKit.

pip install rdkit

Access to the full installation instructions can be found at the following link. https://www.rdkit.org/docs/Install.html

1. Structure Standardization#

PubChem contains more than 200 millions chemical records submitted by hundreds of data contributors. These depositor-provided records are archived in a database called “Substance” and each record in this database is called a substance. The records in the Substance database are highly redundant, because different data contributors may submit information on the same chemical, independently of each other. Therefore, PubChem extracts unique chemical structures from the Substance database through a process called standardization (https://doi.org/10.1186/s13321-018-0293-8). These unique structures are stored in the Compound database and individaual records in this database is called “compounds”. To learn more about the PubChem compounds and substances, please read this PubChem Blog post (https://go.usa.gov/xVXct).

The code cells below demonstrates the effects of chemical structure standardization.

Step 1. Download a list of the SIDs associated with a given CID

First, let’s get a list of SIDs that are associated CID 1174 (uracil).

import requests

cid = 1174

url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/" + str(cid) + "/sids/txt"
res = requests.get(url)
sids = res.text.split()
print(len(sids))
540

The above request returns 360+ substances, all of which are standardized to the same structure (CID 1174).

Step 2. Download the structure data for the SIDs

Now retrieve the depositor-provided structures for the returned substances.

import time

chunk_size = 50

if len(sids) % chunk_size == 0 :
    num_chunks = int( len(sids) / chunk_size )
else :
    num_chunks = int( len(sids) / chunk_size ) + 1

f = open("cid2sids-uracil.sdf", "w")

for i in range(num_chunks):
    
    print("Processing chunk", i)
    
    idx1 = chunk_size * i
    idx2 = chunk_size * (i + 1)
    str_sids = ",".join(sids[idx1:idx2])
    
    url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sid/" + str_sids + "/record/sdf"
    res = requests.get(url)

    f.write(res.text)
    time.sleep(0.2)

f.close()
Processing chunk 0
Processing chunk 1
Processing chunk 2
Processing chunk 3
Processing chunk 4
Processing chunk 5
Processing chunk 6
Processing chunk 7
Processing chunk 8
Processing chunk 9
Processing chunk 10

Step 3. Convert the structures in the SDF file into the SMILES strings and identify unique SMILES and their frequencies.

from rdkit import Chem

unique_smiles_freq = dict()

suppl = Chem.SDMolSupplier('cid2sids-uracil.sdf')

for mol in suppl:

    smiles = Chem.MolToSmiles(mol,isomericSmiles=True)

    unique_smiles_freq[ smiles ] = unique_smiles_freq.get(smiles,0) + 1

sorted_by_freq = [ (v, k) for k, v in unique_smiles_freq.items() ]
sorted_by_freq.sort(reverse=True)
for v, k in sorted_by_freq :
    print(v, k)
359 O=c1cc[nH]c(=O)[nH]1
110 Oc1ccnc(O)n1
36 
12 O=c1ccnc(O)[nH]1
10 O=c1nc(O)cc[nH]1
7 O=c1nccc(O)[nH]1
6 O=c1cc[nH]c(O)n1

The above output shows that the 360+ SIDs associated with CID 1174 are represented with six different SMILES strings. In addition, 12 substance records that resulted in an “empty” SMILES strings, implying that the depositors of these substance records did not provide structral information. You may want to what these 12 substances are, but the above code cell does not tell you what they are. This can be done using the following code cell.

for mol in suppl:

    smiles = Chem.MolToSmiles(mol,isomericSmiles=True)
    
    if ( smiles == "" ) :
        print(mol.GetProp('PUBCHEM_SUBSTANCE_ID'), ":", mol.GetProp('PUBCHEM_SUBS_AUTO_STRUCTURE'))
50608295 : Deposited Substance chemical structure was generated via Synonym "CID1174" to be CID 1174
76715622 : Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
131322919 : Deposited Substance chemical structure was generated via Synonym(s) "66-22-8", "1,2,3,4-tetrahydropyrimidine-2,4-dione", "MFCD00006016" and Synonym Consistency to be CID 1174
254761593 : Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
313082517 : Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
329735657 : Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174
330000149 : Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
375972167 : Deposited Substance chemical structure was generated via Synonym(s) "66255-05-8" and Synonym Consistency to be CID 1174
381002398 : Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174
381013941 : Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174
381360788 : Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
384257697 : Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
402318513 : Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174
402318514 : Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174
402318515 : Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174
434131514 : Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
438512618 : Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
441085908 : Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
441555913 : Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
441560087 : Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
459144671 : Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174
468630846 : Deposited Substance chemical structure was generated via Synonym(s) "2,4(1H,3H)-Pyrimidinedione" and Synonym Consistency to be CID 1174
468838340 : Deposited Substance chemical structure was generated via Synonym(s) "2,4-dioxopyrimidine" and Synonym Consistency to be CID 1174
468857836 : Deposited Substance chemical structure was generated via Synonym(s) "51953-14-1", "2,4-Pyrimidinediol" and Synonym Consistency to be CID 1174
468857844 : Deposited Substance chemical structure was generated via Synonym(s) "51953-19-6", "2(1H)-Pyrimidinone, 4-hydroxy-" and Synonym Consistency to be CID 1174
469422143 : Deposited Substance chemical structure was generated via Synonym(s) "66224-60-0", "2(1H)-Pyrimidinone, 6-hydroxy-" and Synonym Consistency to be CID 1174
469458549 : Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
470538389 : Deposited Substance chemical structure was generated via Synonym(s) "24897-51-6" and Synonym Consistency to be CID 1174
470635957 : Deposited Substance chemical structure was generated via Synonym(s) "51953-14-1" and Synonym Consistency to be CID 1174
470681883 : Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174
472723672 : Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
475813502 : Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
482774378 : Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174
488322616 : Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
508717570 : Deposited Substance chemical structure was generated via Synonym(s) "51953-14-1", "Pyrimidine-2,4-diol" and Synonym Consistency to be CID 1174
513813798 : Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174

Sometimes a data depositor does not provide the structure of a chemical but its chemical synonym(s). In that case, PubChem uses the chemical synonyms to assign a structure to this structure-less record. For example, SID 50608295 (one of the 12 structures without SMILES strings in the above output) did not have a depositor-provided structure, but its depositor-provided synonyms include “CID1174”. Therefore, PubChem assigns SID 50608295 to CID 1174, although the depositor did not provide the structure of SID 50608295. (Please check the structure and synonyms for SID 50608295 stored in the SDF file (“cid2sids-uracil.sdf”) generated in step 2).

Step 4. Generate the structure images from the SMILES

Now we want to see what these SMILES strings look like, by drawing molecular structures from them.

from rdkit.Chem import Draw

for mysmiles in unique_smiles_freq.keys() :

    if mysmiles != "" :
        
        print(mysmiles)
        img = Draw.MolToImage( Chem.MolFromSmiles(mysmiles), size=(150, 150) )
        display(img)
O=c1cc[nH]c(=O)[nH]1
../_images/eb639120e6770726e54567ae9a3058fff1bf2d15c248a4fc154ffc152efb8b25.png
Oc1ccnc(O)n1
../_images/a9be3f31cbdc7a52d4acee28e18fa8f0ecf9a632837108cd54883bfbd08394c2.png
O=c1nc(O)cc[nH]1
../_images/84d1d4f3068b1d4c2ec5de2e4406340aa2db2e738627d14064ca6030bcb51995.png
O=c1nccc(O)[nH]1
../_images/d497c002e1b737c31228296d582c73707f16c28ef01a6b8a079272e05bacb67f.png
O=c1ccnc(O)[nH]1
../_images/65f21bbb1cdd359d97ce589cd71bab220ed15766c5cf95f34936e96b47fad699.png
O=c1cc[nH]c(O)n1
../_images/51105315ed114d864e58019e6431a36336a55dad8c21c5ed916f33fef007c049.png

You may want to write these molecule images in files, rather than displaying them on this Jupyter notebook.

from rdkit.Chem import Draw

index = 1

for mysmiles in unique_smiles_freq.keys() :

    if mysmiles != "" :
        
        filename = 'image' + str(index) +'.png'
        Draw.MolToFile( Chem.MolFromSmiles(mysmiles), filename )
        index += 1

You may also want to display all the images in a single figure.

from PIL import Image
images = []

for mysmiles in unique_smiles_freq.keys() :

    if mysmiles != "" :
        
        img = Draw.MolToImage( Chem.MolFromSmiles(mysmiles), size=(150, 150) )
        images.append(img)

big_img = Image.new('RGB', (900,150))  # enought to arrange six 150x150 images

for i in range(0,len(images)):

    #paste the image at location i,j:
    big_img.paste(images[i], (i*150, 0 ) )

display(big_img)
../_images/c4ec0619cb18e60419b81fbf3b42b5a1164d07a7809efd016c2e2097292daa4d.png
big_img.save('image_grid.png')

As shown these chemical images, the 360+ substances associated with CID 1174 (uracil) correspond to six tautomeric form of uracil, which differ from each other in the position of “movable” hydrogen atoms. Compare these structures with their standardized structure (CID 1174).

res = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1174/property/isomericsmiles/txt')
img = Draw.MolToImage( Chem.MolFromSmiles( res.text.rstrip() ), size=(150, 150) )
img
[10:36:22] SMILES Parse Error: syntax error while parsing: Status:
[10:36:22] SMILES Parse Error: check for mistakes around position 2:
[10:36:22] Status:
[10:36:22] ~^
[10:36:22] SMILES Parse Error: Failed parsing SMILES 'Status:' for input: 'Status:'
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[10], line 2
      1 res = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1174/property/isomericsmiles/txt')
----> 2 img = Draw.MolToImage( Chem.MolFromSmiles( res.text.rstrip() ), size=(150, 150) )
      3 img

File ~/Desktop/my-book-files/venv/lib64/python3.9/site-packages/rdkit/Chem/Draw/__init__.py:101, in MolToImage(mol, size, kekulize, wedgeBonds, fitImage, options, **kwargs)
     71 """Returns a PIL image containing a drawing of the molecule
     72 
     73     ARGUMENTS:
   (...)
     98       a PIL Image object
     99 """
    100 if not mol:
--> 101   raise ValueError('Null molecule provided')
    102 if not hasattr(rdMolDraw2D, 'MolDraw2DCairo'):
    103   raise RuntimeError("MolToImage requires that the RDKit be built with Cairo support")

ValueError: Null molecule provided

Alternatively, you can get the structure image of CID 1174 from PubChem.

from IPython.display import Image
Image(url='https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1174/record/PNG?image_size=300x300')

Exercise 1a: The MolToSmiles() function used in Step 3 generates the canonical SMILES string by default. Read the RDKit manual about the arguments available for this function (https://www.rdkit.org/docs/source/rdkit.Chem.rdmolfiles.html) and write a code that generates non-canonical SMILES strings for the 360+ substance records associated with uracil (CID 1174).

  • Ignore/skip structure-less records using a conditional statement (i.e., an if statement).

  • Print the number of unique non-canonical SMILES.

  • Print unique non-canonical SMILES, sorted by frequency.

  • For a given molecule, there may be multiple ways to write SMILES strings: one of them is selected as the “canonical” SMILES and all the others are considered as “non-canonical”. However, for the purpose of this exercise, we want to generate only one non-canonical SMILES for each record (because the function will return only one SMILES string (the canonical SMILES or one of possible non-canonical SMILES)).

# Write your code in this cell.

Exercise 1b: The RDKit function “MolsToGridImage()” allows you to draw a “grid image” that shows multiple structures. Read the RDKit manual about “MolsToGridImage()” (https://www.rdkit.org/docs/source/rdkit.Chem.Draw.html) and display the structures represented by the unique non-canonical SMILES generated from Exercise 1a.

# Write your code in this cell.

Exercise 1c: Retrieve the substance records associated with guanine (CID 135398634) and display unique structures generated from them, by following these steps:

  • Retrieve the SIDs associated CID 135398634

  • Download the structure data for the retrieved SIDs (in SDF)

  • Generate canonical SMILES strings from the structure data in the SDF file and identify unique canonical SMILES strings

  • Draw the structures represented by the unqiue canonical SMILES strings in a single figure.

# Write your code in this cell.

Exercise 1d: Retrieve the substance records whose synonym is “glucose” and display unique structures generated from them, by following these steps:

  • Retrieve the SIDs whose synonym is “glucose”.

  • Download the structure data for the retrieved SIDs (in SDF)

  • Generate canonical SMILES strings from the structure data in the SDF file and identify unique canonical SMILES strings

  • Draw the structures represented by the unqiue canonical SMILES strings in a single figure.

# Write your code in this cell.

Exercise 1e: Retrieve the compound records associated with the SIDs retrieved in Exercise 1d and display unique structures generated from them, by following these steps:

  • Retrieve the CIDs associated with the SIDs whose name is “glucose”, using a single PUG-REST request (i.e., using the list conversion covered in the previous notebook, “lecture03-list-conversion.ipynb”).

  • Identify unique CIDs from the returned CIDs, using the set() function in python.

  • Retrieve the isomeric SMILES for the unique CIDs through PUG-REST.

  • Draw the structures represented by the returned SMILES strings in a single figure.

# Write your code in this cell.