Fragmentation Usage

Perform Break on One Molecule and Return List of Molecules

group_decomposition.fragfunctions.generate_molecule_fragments(mol, patt: str = '[$([C;X4;!R]):1]-[$([R,!$([C;X4]);!#0;!#9;!#17;!#35;!#1]):2]', drop_parent: bool = False, recombine_mono: bool = True)[source]

Fragment a molecule into constituent groups

The molecule is first broken along ring-nonring single bonds, then single bonds to atoms double bonded to ring (e.g. C-N={ring}), then breaking based on the pattern provided.

Parameters:
  • mol – the rdkit molecule object to be fragmented

  • patt – SMARTs string that matches the bonds to be broken after ring-non-ring are separated Defaults to breaking alkyl-non-alkyl bonds. See notes

  • drop_parent – if False, do include the parent structure in the third break. Defaults to breaking identifying alkyl chains. If True, do not include the parents

  • recombine_mono – If True (default), will recombine separated one heavy atom groups with the chains they are broken from in the last step

Returns:

a list of rdkit molecules generated by fragmenting

Return type:

list[mol]

Note

Bonds broken are labelled by integers. Breaking FragA-FragB-FragC will result in FragA-1 *:1-FragB-*:2, *:2-FragC

In this way fragments can be rejoined by recombining matching integers. The integers are added to the molecule as Isotopes

In the last step, halides are not separated from the alkyl groups by default, and one-heavy atom groups are rejoined to acyclic portions of the molecule

Determine Unique Functional Groups in a Molecule

group_decomposition.fragfunctions.identify_connected_fragments(inp: str, keep_only_children: bool = True, bb_patt: str = '[$([C;X4;!R]):1]-[$([R,!$([C;X4]);!#0;!#9;!#17;!#35;!#1]):2]', input_type: str = 'smile', cml_file: str = '', include_parent: bool = False, aiida: bool = False) DataFrame[source]

Given Smiles string, identify fragments in the molecule:

Break all ring-non-ring atom single bonds. For atoms double bonded to rings, break their single bonds. For non-ring fragments, separate those into alkyl chains and hetero/double bonded atoms. (similar to Ertl functional groups)

Parameters:
  • input – a string containing either a smiles, .xyz, .mol or .cml filename for a given molecule update input_type below to match provided input

  • keep_only_children – boolean, if True, when a group is broken down into its components remove the parent group from output. If False, parent group is retained

  • bb_patt – string of SMARTS pattern for bonds to be broken in side chains and linkers defaults to cleaving sp3 carbon-((ring OR not sp3 carbon) AND not-placeholder/halogen/H)

  • input_type – ‘smile’ if SMILES code or ‘molfile’ if .mol file, or ‘xyzfile’ if .xyz file, or ‘cmlfile’ if .cml file Note: xyz file REQUIRES cml_file to be provided as well

  • cml_file – defaults to none, can be the cml file corresponding to the input .mol file

  • include_parent – If True, include column in output frame repeating parent molecule intended use for True when merging multiple molecule fragment frames but need to retain a parent molecule object

  • aiida – if True, format output to be able to be used in aiida database. That is, no molecule objects

Returns:

DataFrame with columms ‘Smiles’, ‘Molecule’, ‘numAttachments’ and ‘xyz’ Containing, fragment smiles, fragment Chem.Molecule object, number of * placeholders, and rough xyz coordinates for the fragment is * were At

Note

Each bond breaking, connectivity is maintained through dummy atom labels. e.g. C-N -> C-[1*] N-[1*] - reattaching via the matching labels would reassemble the molecule

currently will break apart a functional group if contains a ring-non-ring single bond. e.g. ring N-nonring S=O -> ring N-[1*] nonring S=O-[1*]

Count Functional Groups in a molecule

group_decomposition.fragfunctions.count_uniques(frag_frame: DataFrame, drop_attachments: bool = False, uni_smi_type: bool = False) DataFrame[source]

Identify unique fragments in a frame and count the number of times they occur

Given frag_frame output from group_decomposition.fragfunctions.identify_connected_fragments, remove dummy atom labels(and placeholders entirely if drop_attachments=True), then count unique fragments using SMILES to identify unique fragments

Should also work on frames from group_decomposition.fragfunctions.merge_uniques or group_decomposition.fragfunctions.count_groups_in_set as well

Parameters:
  • frag_frame – frame resulting from group_decomposition.fragfunctions.identify_connected_fragments typically, or any similar frame with a list of SMILES codes in column [‘Smiles’]

  • drop_attachments – boolean, if False, retains placeholder * at points of attachment, if True, removes * for fragments with more than one atom

  • uni_smi_type – include atom types in determination of unique fragments. If false, only determine unique by SMILES

Returns:

pandas data frame with columns ‘Smiles’, ‘count’ and ‘Molecule’, containing the Smiles string, the number of times the Smiles was in frag_frame, and rdkit.Chem.Molecule object

Note

if drop_attachments=False, similar fragments with different number/positions of attachments will not count as being the same. e.g. ortho-attached aromatics would not match with meta or para attached aromatics

If you’ve ran this previously with uni_smi_type=True, running on the output frame (or other frame derived from such frame) with uni_smi_type=False will collapse the output uniques determined by SMILE only

Counting Functional groups in a set of molecules

group_decomposition.fragfunctions.count_groups_in_set(list_of_inputs: list[str], drop_attachments: bool = False, input_type: str = 'smile', bb_patt: str = '[$([C;X4;!R]):1]-[$([R,!$([C;X4]);!#0;!#9;!#17;!#35;!#1]):2]', cml_list=None, uni_smi_ty: bool = True, aiida: bool = False) DataFrame[source]

Identify unique fragments in molecules defined in the list_of_smiles, and count the number of occurences for duplicates.

Parameters:
  • list_of_smiles – A list, with each element being a SMILES string, e.g. [‘CC’,’C1CCCC1’]

  • drop_attachments – Boolean for whether or not to drop attachment points from fragments if True, will remove all placeholder atoms indicating connectivity if False, placeholder atoms will remain

  • input_type – smile, xyzfile, cmlfile or molfile, based on elements of lists_of_inputs

  • cml_list – defaults empty, but can be a list of cml files corresponding to the molfile inputs

  • bb_patt – SMARTS pattern for bonds to break in linkers and side chains. Defaults to breaking bonds between nonring carbons with four bonds single bonded to ring atoms or carbons that don’t have four bonds, and are not H, halide, or placeholder

  • uni_smi_type – if True, include atom types in determination of unique fragments. If false, only determine unique by SMILES

Returns:

an output pd.DataFrame, with columns ‘Smiles’ for fragment Smiles, ‘count’ for number of times each fragment occurs in the list, and ‘Molecule’ holding a rdkit.Chem.Molecule object

Example usage:
>>> count_groups_in_set(['c1ccc(c(c1)c2ccc(o2)C(=O)N3C[C@H](C4(C3)CC[NH2+]CC4)C(=O)NCCOCCO)F',
'Cc1nc2ccc(cc2s1)NC(=O)c3cc(ccc3N4CCCC4)S(=O)(=O)N5CCOCC5'],drop_attachments=False)

Merge Frames from Multiple Runs of a Functional

group_decomposition.fragfunctions.merge_uniques(frame1: DataFrame, frame2: DataFrame, uni_smi_ty=True) DataFrame[source]

Given two frames of unique fragments, identify shared unique fragments, merge count and frames together.

Parameters:
  • frame1 – a frame output from count_uniques

  • frame2 – a distinct frame also from count_uniques

  • uni_smi_ty – If True, include atom types in determination of unique fragments. If false, only determine unique by SMILES

Returns:

a frame resulting from the merge of frame1 and frame2. All rows that have Smiles that are in frame1 but not frame2(and vice versa) are included unmodified If a row’s SMILES is in both frame1 and frame2, modify the row to update the count of that fragment as sum of frame1 and frame2, then include one row.

Note

for best results, SMILES must be canonical so that they can be exactly compared. Smiles in frame should be resulting from Chem.MolToSmiles(Chem.MolFromSmiles(smile)) - this will create a molecule from the smile, and write the smile back, in canonical form

Example usage:
>>> frame1
Smiles  count
C       2
C1CCC1  1
>>> frame2
Smiles  count
C       3
C1CC1   2
>>> merge_uniques(frame1,frame2)
Smiles  count
C       5
C1CCC1  1
C1CC1   2

Output Fragments to .gjf Files

group_decomposition.fragfunctions.output_ifc_gjf(mol, frag_frame, esm='wb97xd', basis_set='aug-cc-pvtz', wfx=True, n_procs=4, mem='3200MB', multiplicity=1)[source]

Takes a fragmented molecule and outputs gjf files of the fragments with one attachment point.

Hydrogen is added in place of the connection to the rest of the molecule for the fragment

Parameters:
  • mol – Chem.Mol object for which fragmentation was performed

  • frag_frame – output from either count_uniques or identify_connected_fragments

  • esm – str, electronic structure method to include in gjf

  • basis_set – str, basis set to include in gjf

  • wfx – Boolean, if True add output=wfx to gjf file

  • n_procs – int, >=0. if >0, add number of processors to be used to gjf

  • mem – str, format “nMB” or “nGB”, memory to be used in gjf

  • multiplicity – int, defaults to 1. Multiplicity of molecule

Returns:

Creates gjf files in working directory for each fragment in frag_frame

Note

H position is set by taking the atom the fragment is bonded two, replacing it with H and moving that closer to the C until it reaches the default distance

Default distances taken from Gaussview “clean” C-H, C-O, etc bond lengths

Output Fragments to dict

group_decomposition.fragfunctions.output_ifc_dict(mol, frag_frame: DataFrame, done_smi: list[str])[source]

generate a dictionary containing identify_connected_fragment information

Only new fragments are included. Previously parsed fragments are listed in done_smi.

Parameters:
  • mol – rdkit molecule object that was fragmented

  • frag_frame – identify_connected_fragments frame generated for mol

  • done_smi – list of fragments which have been identified already

Returns:

list containing dict for fragments and done_smi lengthened by the number of fragments done

Note

mainly for use in generating information for unique fragments in an AiiDA workflow