Fragmentation Usage¶
Perform Break on One Molecule and Return List of Molecules¶
- group_decomposition.fragfunctions.generate_molecule_fragments(mol, patt: str = '[$([C;X4;!R]):1]-[$([R,!$([C;X4]);!#0;!#9;!#17;!#35;!#1]):2]', drop_parent: bool = False, recombine_mono: bool = True)[source]¶
Fragment a molecule into constituent groups
The molecule is first broken along ring-nonring single bonds, then single bonds to atoms double bonded to ring (e.g. C-N={ring}), then breaking based on the pattern provided.
- Parameters:
mol – the rdkit molecule object to be fragmented
patt – SMARTs string that matches the bonds to be broken after ring-non-ring are separated Defaults to breaking alkyl-non-alkyl bonds. See notes
drop_parent – if False, do include the parent structure in the third break. Defaults to breaking identifying alkyl chains. If True, do not include the parents
recombine_mono – If True (default), will recombine separated one heavy atom groups with the chains they are broken from in the last step
- Returns:
a list of rdkit molecules generated by fragmenting
- Return type:
list[mol]
Note
Bonds broken are labelled by integers. Breaking FragA-FragB-FragC will result in FragA-1 *:1-FragB-*:2, *:2-FragC
In this way fragments can be rejoined by recombining matching integers. The integers are added to the molecule as Isotopes
In the last step, halides are not separated from the alkyl groups by default, and one-heavy atom groups are rejoined to acyclic portions of the molecule
Determine Unique Functional Groups in a Molecule¶
- group_decomposition.fragfunctions.identify_connected_fragments(inp: str, keep_only_children: bool = True, bb_patt: str = '[$([C;X4;!R]):1]-[$([R,!$([C;X4]);!#0;!#9;!#17;!#35;!#1]):2]', input_type: str = 'smile', cml_file: str = '', include_parent: bool = False, aiida: bool = False) DataFrame[source]¶
Given Smiles string, identify fragments in the molecule:
Break all ring-non-ring atom single bonds. For atoms double bonded to rings, break their single bonds. For non-ring fragments, separate those into alkyl chains and hetero/double bonded atoms. (similar to Ertl functional groups)
- Parameters:
input – a string containing either a smiles, .xyz, .mol or .cml filename for a given molecule update input_type below to match provided input
keep_only_children – boolean, if True, when a group is broken down into its components remove the parent group from output. If False, parent group is retained
bb_patt – string of SMARTS pattern for bonds to be broken in side chains and linkers defaults to cleaving sp3 carbon-((ring OR not sp3 carbon) AND not-placeholder/halogen/H)
input_type – ‘smile’ if SMILES code or ‘molfile’ if .mol file, or ‘xyzfile’ if .xyz file, or ‘cmlfile’ if .cml file Note: xyz file REQUIRES cml_file to be provided as well
cml_file – defaults to none, can be the cml file corresponding to the input .mol file
include_parent – If True, include column in output frame repeating parent molecule intended use for True when merging multiple molecule fragment frames but need to retain a parent molecule object
aiida – if True, format output to be able to be used in aiida database. That is, no molecule objects
- Returns:
DataFrame with columms ‘Smiles’, ‘Molecule’, ‘numAttachments’ and ‘xyz’ Containing, fragment smiles, fragment Chem.Molecule object, number of * placeholders, and rough xyz coordinates for the fragment is * were At
Note
Each bond breaking, connectivity is maintained through dummy atom labels. e.g. C-N -> C-[1*] N-[1*] - reattaching via the matching labels would reassemble the molecule
currently will break apart a functional group if contains a ring-non-ring single bond. e.g. ring N-nonring S=O -> ring N-[1*] nonring S=O-[1*]
Count Functional Groups in a molecule¶
- group_decomposition.fragfunctions.count_uniques(frag_frame: DataFrame, drop_attachments: bool = False, uni_smi_type: bool = False) DataFrame[source]¶
Identify unique fragments in a frame and count the number of times they occur
Given frag_frame output from
group_decomposition.fragfunctions.identify_connected_fragments, remove dummy atom labels(and placeholders entirely if drop_attachments=True), then count unique fragments using SMILES to identify unique fragmentsShould also work on frames from
group_decomposition.fragfunctions.merge_uniquesorgroup_decomposition.fragfunctions.count_groups_in_setas well- Parameters:
frag_frame – frame resulting from
group_decomposition.fragfunctions.identify_connected_fragmentstypically, or any similar frame with a list of SMILES codes in column [‘Smiles’]drop_attachments – boolean, if False, retains placeholder * at points of attachment, if True, removes * for fragments with more than one atom
uni_smi_type – include atom types in determination of unique fragments. If false, only determine unique by SMILES
- Returns:
pandas data frame with columns ‘Smiles’, ‘count’ and ‘Molecule’, containing the Smiles string, the number of times the Smiles was in frag_frame, and rdkit.Chem.Molecule object
Note
if drop_attachments=False, similar fragments with different number/positions of attachments will not count as being the same. e.g. ortho-attached aromatics would not match with meta or para attached aromatics
If you’ve ran this previously with uni_smi_type=True, running on the output frame (or other frame derived from such frame) with uni_smi_type=False will collapse the output uniques determined by SMILE only
Counting Functional groups in a set of molecules¶
- group_decomposition.fragfunctions.count_groups_in_set(list_of_inputs: list[str], drop_attachments: bool = False, input_type: str = 'smile', bb_patt: str = '[$([C;X4;!R]):1]-[$([R,!$([C;X4]);!#0;!#9;!#17;!#35;!#1]):2]', cml_list=None, uni_smi_ty: bool = True, aiida: bool = False) DataFrame[source]¶
Identify unique fragments in molecules defined in the list_of_smiles, and count the number of occurences for duplicates.
- Parameters:
list_of_smiles – A list, with each element being a SMILES string, e.g. [‘CC’,’C1CCCC1’]
drop_attachments – Boolean for whether or not to drop attachment points from fragments if True, will remove all placeholder atoms indicating connectivity if False, placeholder atoms will remain
input_type – smile, xyzfile, cmlfile or molfile, based on elements of lists_of_inputs
cml_list – defaults empty, but can be a list of cml files corresponding to the molfile inputs
bb_patt – SMARTS pattern for bonds to break in linkers and side chains. Defaults to breaking bonds between nonring carbons with four bonds single bonded to ring atoms or carbons that don’t have four bonds, and are not H, halide, or placeholder
uni_smi_type – if True, include atom types in determination of unique fragments. If false, only determine unique by SMILES
- Returns:
an output pd.DataFrame, with columns ‘Smiles’ for fragment Smiles, ‘count’ for number of times each fragment occurs in the list, and ‘Molecule’ holding a rdkit.Chem.Molecule object
- Example usage:
>>> count_groups_in_set(['c1ccc(c(c1)c2ccc(o2)C(=O)N3C[C@H](C4(C3)CC[NH2+]CC4)C(=O)NCCOCCO)F', 'Cc1nc2ccc(cc2s1)NC(=O)c3cc(ccc3N4CCCC4)S(=O)(=O)N5CCOCC5'],drop_attachments=False)
Merge Frames from Multiple Runs of a Functional¶
- group_decomposition.fragfunctions.merge_uniques(frame1: DataFrame, frame2: DataFrame, uni_smi_ty=True) DataFrame[source]¶
Given two frames of unique fragments, identify shared unique fragments, merge count and frames together.
- Parameters:
frame1 – a frame output from count_uniques
frame2 – a distinct frame also from count_uniques
uni_smi_ty – If True, include atom types in determination of unique fragments. If false, only determine unique by SMILES
- Returns:
a frame resulting from the merge of frame1 and frame2. All rows that have Smiles that are in frame1 but not frame2(and vice versa) are included unmodified If a row’s SMILES is in both frame1 and frame2, modify the row to update the count of that fragment as sum of frame1 and frame2, then include one row.
Note
for best results, SMILES must be canonical so that they can be exactly compared. Smiles in frame should be resulting from Chem.MolToSmiles(Chem.MolFromSmiles(smile)) - this will create a molecule from the smile, and write the smile back, in canonical form
- Example usage:
>>> frame1 Smiles count C 2 C1CCC1 1
>>> frame2 Smiles count C 3 C1CC1 2
>>> merge_uniques(frame1,frame2) Smiles count C 5 C1CCC1 1 C1CC1 2
Output Fragments to .gjf Files¶
- group_decomposition.fragfunctions.output_ifc_gjf(mol, frag_frame, esm='wb97xd', basis_set='aug-cc-pvtz', wfx=True, n_procs=4, mem='3200MB', multiplicity=1)[source]¶
Takes a fragmented molecule and outputs gjf files of the fragments with one attachment point.
Hydrogen is added in place of the connection to the rest of the molecule for the fragment
- Parameters:
mol – Chem.Mol object for which fragmentation was performed
frag_frame – output from either count_uniques or identify_connected_fragments
esm – str, electronic structure method to include in gjf
basis_set – str, basis set to include in gjf
wfx – Boolean, if True add output=wfx to gjf file
n_procs – int, >=0. if >0, add number of processors to be used to gjf
mem – str, format “nMB” or “nGB”, memory to be used in gjf
multiplicity – int, defaults to 1. Multiplicity of molecule
- Returns:
Creates gjf files in working directory for each fragment in frag_frame
Note
H position is set by taking the atom the fragment is bonded two, replacing it with H and moving that closer to the C until it reaches the default distance
Default distances taken from Gaussview “clean” C-H, C-O, etc bond lengths
Output Fragments to dict¶
- group_decomposition.fragfunctions.output_ifc_dict(mol, frag_frame: DataFrame, done_smi: list[str])[source]¶
generate a dictionary containing identify_connected_fragment information
Only new fragments are included. Previously parsed fragments are listed in done_smi.
- Parameters:
mol – rdkit molecule object that was fragmented
frag_frame – identify_connected_fragments frame generated for mol
done_smi – list of fragments which have been identified already
- Returns:
list containing dict for fragments and done_smi lengthened by the number of fragments done
Note
mainly for use in generating information for unique fragments in an AiiDA workflow