[[ab initio]]\n[[Absolute Frequency]]\n[[ADAPT]]\n[[Adjacency Matrix]]\n[[AM1 Hamiltonian]]\n[[Atomic Walk Counts|Molecular Walk Counts]]\n[[Autoscaling]]
The absolute frequency is the number of data points that fall within a given range or class of a frequency distribution.
ANOVA - [[ANalysis Of VAriance|Analysis of Variance]]\nBFGS - [[Broyden-Fletcher-Goldfarb-Shanno|BFGS]]\nBRNN - [[Bayesian Regularized Neural Network|Bayesian Neural Network]] (aka BNN)\nCNN - [[Computational Neural Network]]\nCPSA - [[Charged Partial Surface Areas]]\nDFT - [[Density Functional Theory]]\nGA - [[Genetic Algorithm]]\nkNN - [[k-Nearest Neighbor]]\nLDA - [[Linear Discriminant Analysis]]\nMLR - [[Multiple Linear Regression]]\nNLR - [[NonLinear Regression|Nonlinear Regression]]\nPCA - [[Principal Component Analysis]]\nPCR - [[Principal Component Regression]]\nPLS - [[Partial Least Squares]]\nPRESS - [[Predictive REsidual Sum of Squares|Predictive Residual Sum of Squares]]\nQSAR - [[Quantitative Structure-Activity Relationship]]\nQSPR - [[Quantitative Structure-Property Relationship]]\nSA - [[Simulated Annealing]]\nSVM - [[Support Vector Machine]]
coming soon ..
[[Bayesian Neural Network]]\n[[Beta Weights]]\n[[BFGS]]
[[Charged Partial Surface Areas]] or [[CPSA|Charged Partial Surface Areas]]\n[[Cheminformatics]] or [[Chemoinformatics|Cheminformatics]]\n[[Chemometrics]]\n[[Cluster Analysis]]\n[[Computational Neural Network]] or [[CNN|Computational Neural Network]]\n[[Connection Table]]\n[[Correlation Matrix]]\n[[Criterion Variable]]
coming soon \n\nReferences: [[1|ref0001]],
Otto^^[[1|ref-ot-001]]^^ defines chemometrics as:\n*//"Chemometrics is the chemical discipline that uses mathematical and statistical methods to a) design or select optimal measurement procedures and experiments, and b) to provide maximum chemical information by analyzing chemical data."//
The following describes a three-layer, fully-connected, feed-forward computational neural network.\n\n[img[CNN|img_QSAR/cnn.jpg]]\n\nThe input layer consists of as many neurons as there are model [[descriptors|Descriptor]]. Values are transformed on the order of 0-1 (0.05-0.95) to avoid "blowing up" the results during the non-linear transformation process. Each input neuron value is assigned a weight (randomly initialized and optimized throughout training) and passed to each hidden layer neuron (determined experimentally). At each hidden layer, the weighted terms are summed and a bias term is added (again, initial bias term is randomly assigned and optimized throughout training). The resultant sum of each hidden layer neuron is then applied to the non-linear transformation (seen in bottom half of enlarged neuron). This value is then sent to the output layer neuron (the predicted value of interest), again weighted and biased. The output layer neuron values are then transformed back to the original range and compared with the actual value. A [[BFGS]] algorithm is used for optimization of the costs associated with the weights and biases, which are then adjusted and the whole process repeats until the predicted value is as close to the actual value as possible.
A tabular representation of correlation coefficients of multiple variables.\n[img[correlation matrix|img_QSAR/corrmatx.jpg]]\nThe diagonal elements all have r = 1.00, as the values are perfectly correlated with themselves, and upper and lower triangles are symmetric. If total variables = k, the total cells = k^^2^^. Number of non-diagonal cells = k^^2^^ - k. Unique non-diagonal cells = (k^^2^^-k)/2.
In a [[QSAR|Quantitative Structure-Activity Relationship]], the criterion variable is the //dependent variable//. Normally, this is the value that is being modeled, such as a toxicity value.\n\nThe independent variables, such as descriptor values, is known as the [[predictor variable|Predictor Variable]].
[[Descriptor]]\n[[Density Functional Theory]]\nDependent Variable (aka [[Criterion Variable]])\n[[Dipole-Dipole Interaction]]\n[[Distance Matrix]]\n[[Dixon's Q-test|Q-test]]
If you were to construct a table (i.e., matrix) of the data you create that contains a line for each compound and a column for each molecular descriptor, you'd see that there is a lot of data there. For example, a dataset of 500 compounds and 150 descriptors gives you 75,000 individual values that you must manipulate for modeling. There are many considerations that have been debated over the years as to how many compounds vs. descriptors one should have, and how effective a model can be if there is 'too much' information. One can actually overfit the dataset and end up with a model that really doesn't tell you anything of value. Another consideration is time. The more data you must manipulate, the longer it takes to form a model. For these reasons, it is desirable to extract any potentially unwanted data from your set before investing in a lot of modeling time.\n \nHere, there are two considerations. The first is identical data. For example, if you're looking at several hundreds of compounds that contain no sulfur or phosphorous, then the descriptors "number of S atoms" or "number of P atoms" will all be filled with zeros. Even if you have a few compounds that contain sulfur, for example, if over 90-95% of dataset compounds have no sulfur, then this descriptor will not do a good job discriminating between compounds in your dataset. We can eliminate such descriptors.\n \nThe second consideration is how much correlation there is between two descriptors. Lets say, for example that we had a simple dataset of linear hydrocarbon chains. If we compared a descriptor 'number of C atoms' and 'number of H atoms', we'd see that as the number of carbon atoms changed then the number of hydrogens would change in specific proportion (since the general hydrocarbon formula is C~~n~~H~~2n+2~~). Here, if we were to cross-correlate these two descriptors, we'd see that they are highly correlated. Therefore, from a discriminatory point of view, either one gives us the same quality of information. For two highly correlated descriptors, we can remove one from consideration.
Most of the datasets one uses really depends upon resources and motivation. First, the datasets available to academic groups can be rather limited. In general, one either digs up a good set of compounds from the literature or one is provided by an outside collaboration (such as a drug company). The motivation for academic groups is to try new methodologies for better model building which can be applied to real datasets, or in the case of a collaboration, a company may want to investigate new technology without committing to a full time employee for that purpose. When working in industry, chances are you have access to a large virtual library of compounds, which can number in the hundreds of thousands of compounds or more. Here, the motivation is obviously to guide research and discovery of new consumer products or drug compounds.\n\nRegardless of your environment, there are a few considerations when creating a dataset of compounds to investigate. First, you should have confidence in the data. That is, all model building depends upon good data - if you're building a model to predict the inhibition abilities against a certain enzyme, you need to make sure that the compounds that you use have reliable experimental data associated with them. Chances are good in industry that all experimental data was measured in house, or at least contracted out to a lab under specific protocols. Gathering a large dataset from many different literature sources can introduce increased levels of uncertainty as to the quality of the data.\n\nFor a smaller dataset, such as the ones used in academia, one looks for a dataset size of at least one hundred compounds with good experimental data. Better yet are datasets numbering more than 500 compounds or even into the thousands. Large datasets can be used, such as tens of thousands of compounds in industry. Of course, the more molecules that you deal with, the longer it takes to process all that information.\n\nDepending upon the modeling method used, one generally uses about 90% of the compounds for model building (i.e., all the decision making processes), and then reserves the other 10% of the compounds until the model is built, at which time the model is validated with this small subset. By convention, we call these two data subsets the [[training set|Training Set]] (TSET) and the [[prediction set|Prediction Set]] (PSET), respectively.
\n[[WELCOME!]]
In terms of the work presented here, a descriptor is simply an mathematical value which is the result of some calculation that encodes a particular feature of a molecule. This can be a simple and easily interpretable value, such as the molecular weight of a compound, or it can be a very complex and highly discriminant value whose interpretation is limited (in terms of some facet on the molecule).\n\nDescriptors here are generally categorized into one of four types:\n#[[topological|Topological Descriptor]]\n#[[geometric|Geometric Descriptor]]\n#[[electronic|Electronic Descriptor]]\n#[[hybrid or hybridized|Hybrid Descriptor]]\n\nTopological descriptors can be calculated from a simple two-dimensional [[connectivity matrix|Connection Table]] or graph representation, where only the types of atoms and their connections are relevant. These descriptors give information about the relative size and content of the molecule, and are quickly calculated.\n\nGeometric descriptors are calculated using [[connectivity|Connection Table]] information too, but they also rely on correct low-energy three-dimensional optimization prior to calculation, in order to capture the relative positions of atoms within the molecule. Typically, we use the semiempirical molecular orbital package [[MOPAC]] to calculate geometries before finding these descriptor values.\n\nElectronic descriptors are also dependent upon correct structure optimization. Here, however, rather than looking at the atom positions, we're interested in the partial atomic charges that result from geometry optimizations. So again, we use the results from the [[MOPAC]] optimization.\n\nHybrid descriptors involve calculations that may make use of information from topological, geometric, and/or electronic descriptor information in order to more fully capture aspects of a molecule. One particularly useful set of descriptors from this class are [[Charged Partial Surface Area|Charged Partial Surface Areas]] descriptors.
Once you have a set of structures to investigate, you need to create information that will allow you to explain what may be linking structure with property. Simply stated, a molecular descriptor is nothing more than a structural feature that is translated into a mathematical value. These descriptors can be very simple, such as a topological descriptor like "number of carbon atoms". They can also be rather complex, with values determined by linear algebraic functions that describe some geometric value. In general, simple descriptors are easy to understand and interpret but may not lead to good discrimination between molecules. Complex descriptors may lead to a unique value for every compound in a dataset, but can often lead to no understanding of how that value correlates to anything that chemists are used to dealing with, such as hydrophobicity or molecular weight, etc.\n\n There are at least 2000 different molecular descriptors that one can calculate for a single molecule, but that is rarely a wise approach. We can usually categorize a descriptor into one of four categories: topological, geometric, electronic, or hybridized.\n\n ''Topological descriptors'' are taken from the connectivity information of a structure. In other words, they do not rely on a good geometry of a molecule, only how atoms are connected to each other. These descriptors include molecular weight, atom counts, path counts, path lengths, and connectivity indices. Topological descriptors are easy to calculate, but have less discrimination power.\n\n ''Geometric descriptors'' relay information about the shape and size of a molecule, and therefore depend upon a reliable representation of the molecule in 3-dimensional space. Finding a reliable geometry is often a point of contention between different groups, because geometric (and electronic and hybridized) descriptor values can differ within the same molecule depending upon how the geometry was calculated. There are a host of applications available to calculate molecule conformations. Examples of geometric descriptors include solvent accessible surface areas and volume.\n\n ''Electronic descriptors'', as with geometric descriptors, depend upon a 3-D geometry. Single-atom energy values are based upon their nieghbor interactions, so even a slight tweak in geometry can alter these values. These descriptors, such as electronegativities and dipoles, give us an idea of the electronic environment of the whole molecule.\n\n ''Hybridized descriptors'' make use of two or more of the previous three descriptors. These are useful for quantitative values of charge partial surface areas and hydrogen bonding information, and are quite useful in QSAR models.
[[Electronic Descriptor]]\n[[Electrotopological State Indices]]\n[[E-State Indices|Electrotopological State Indices]]\n[[Euclidean Distance]]\n[[External Validation Set|Prediction Set]] aka [[PSET|Prediction Set]]
coming soon ...\n\n''__List of Electronic Descriptors__''\n*[[HOMO Energy]]\n*[[LUMO Energy]]
Emacs is a good editor for files in the Linux system. There are other editors available, which you can read about elsewhere (see [[Suggested Reading]]). The following is a list of common commands that should let you do what you need to do for a QSAR study.\n\nCommands in the editor shell are usually given by typing specific keys while holding down on the Control (''Ctrl'') key or Escape (''Esc'') key. For this guide, commands will be written with the following short notation:\n*For Control key commands: ''C-x'' means //hold down the Ctrl key while pressing x//\n*For Escape or Alt key command: ''M-x'' means //hold down the Esc or Alt key while pressing x//\n*For multi-key commands, it will sometimes be necessary to type two letters. In such a case, the notation will be ''C-x C-c'', which means that while holding down the Ctrl key, //type x followed by c//. Similarly, if the command is ''C-x u'', that means hold the Ctrl key while pressing x, //release the Ctrl key//, then press 'u'.\n\n|File Manipulation|c\n|Command|Keystrokes|Description|h\n|find-file|C-x C-f|after issuing this command, type in the first few letters of a filename and press ''Tab''|\n|save-file|C-x C-s|saves changes under current file name; does not close editing buffer|\n|save-as-file|C-x C-w|(aka write-file); saves information in new file name (prompted)|\n|exit-file|C-x C-c|saves changes in current filename and exits buffer|\n\n|Buffer Navigation|c\n|Command|Keystrokes|Description|h\n|cursor forward|C-f|similar to →|\n|cursor backward|C-b|similar to ←|\n|cursor previous-line|C-p|similar to ↑|\n|cursor next-line|C-n|similar to ↓|\n|cursor forward-one-word|M-f|moves cursor to the start of the next word|\n|cursor backward-one-word|M-b|moves cursor to the start of the previous word|\n|cursor beginning-of-line|C-a|similar to Home|\n|cursor end-of-line|C-e|similar to End|\n|cursor back-one-sentence|M-a|moves back to the beginning of the previous sentence|\n|cursor forward-one-sentence|M-e|moves forward to the beginning of the next sentence|\n|cursor forward-one-paragraph|M-}|move to the beginning of the next paragraph|\n|cursor backward-one-paragraph|M-{|move to the beginning of the previous paragraph|\n|cursor forward-one-screen|C-v|similar to Page Down|\n|cursor backward-one-screen|M-v|similar to Page Up|\n|cursor to line-number|M-x, goto-line|after command, type Enter; enter line number; type Enter|\n|cursor move n-lines|M-n command|e.g., to move 500 lines forward: M-500, C-n|\n|cursor beginning-of-buffer|M-<|moves to beginning of first line in buffer|\n|cursor end-of-buffer|M->|moves to the last line of the buffer|\n\n|Buffer Editing|c\n|Command|Keystrokes|Description|h\n|undo-command|C-x u|undoes last command; can be repeated|\n|delete-character|C-d|erases one character at a time; similar to Del|\n|delete-to-end-of-word|M-d|erases all characters from cursor to the end of a word|\n|delete-to-end-of-line|C-k|erases all characters from cursor to the end of the line|\n|restore-deletion (yank)|C-y|restores the characters that were just erased from delete command(s)|\n|cancel-command|C-g|clears buffer command line when you mistype a command (before execution)|\n\n
The Euclidean distance is given by the shortest distance between two points in //n//-dimensional space. For most applications here, this considers the x,y,z Cartesian coordinate space, where the distance between two points (atoms in a molecule, for example) is given by:\n\nd^^2^^ = (x~~2~~ - x~~1~~)^^2^^ + (y~~2~~ - y~~1~~)^^2^^ + (z~~2~~ - z~~1~~)^^2^^\n\nFor a distance in any dimension, this can simply be expanded to the summation of all differences between two points in each dimension.\n\nFor a more complete explanation, check out Wolfram ~MathWorld's page [[here|http://mathworld.wolfram.com/Distance.html]]
[[F-test]]\n[[Fischer's F-test|F-test]]\n[[Foreign Letters]] (for formatting reference only)\n[[Frequency (Absolute)|Absolute Frequency]]
Formatting of foreign letters. Click 'view' to see how to display special letters.\n\n|''&_grave;''|''&_acute;''|''&_circ;''|''&_uml;''|''&_tilde;''|''&_ring;''|''&_slash;''|''&_cedil;''|\n| À | Á | Â | Ä | Ã | Å | Ø | Ç |\n| à | á | â | ä | ã | å | ø | ç |\n| È | É | Ê | Ë | Õ | | | |\n| è | é | ê | ë | õ | | | |\n| Ì | Í | Î | Ï | Ñ | | | |\n| ì | í | î | ï | ñ | | | |\n| Ò | Ó | Ô | Ö | | | | |\n| ò | ó | ô | ö | | | | |\n| Ù | Ú | Û | Ü | | | | |\n| ù | ú | û | ü | | | | |\n| | Ý | | Ÿ | | | | |\n| | ý | | ÿ | | | | |
''Bold'' is by using two single quotes around string. Double quotes gives "this".\n//Italic// is by using two forward slashes fore and aft. "/ /".\n__Underline__ is by using two underscores fore & aft.\n--Strikethrough-- is done by surrounding by two dashes.\nA superscript^^this^^ is surrounded by two "^" carat marks.\nA subscript~~this~~ is surrounded by two "~" tilde marks.\nAnd @@highlighting@@ is accomplished by surrounding by "@".\nShowing {{{@@highlighting@@}}} is done by surrounding by three "{".\nCenter blocking in table by {{{| word |}}}; right by {{{|word |}}}; left by {{{| word|}}}\nNon ~WikiWords can be preceded by a single ~tilde.\n\n*bullet points by starting with asterisk\n**sub-header by two asterisks\n\n#numbered bullet points by the pound sign\n##sub-header by two pound signs\n\n
[[Genetic Algorithm]]\n[[Geometric Descriptor]]\n[[Greek Letters]]
coming soon ...\n\n''__List of Geometric Descriptors__''\n*
To get started with this blank TiddlyWiki, you'll need to modify the following tiddlers:\n* SiteTitle & SiteSubtitle: The title and subtitle of the site, as shown above (after saving, they will also appear in the browser title bar)\n* MainMenu: The menu (usually on the left)\n* DefaultTiddlers: Contains the names of the tiddlers that you want to appear when the TiddlyWiki is opened\nYou'll also need to enter your username for signing your edits: <<option txtUserName>>
This is the starting link to a glossary of QSAR, chemometric, and related terminology.\n\n| [[0-9 Terms]] | [[A Terms]] | [[B Terms]] | [[C Terms]] | [[D Terms]] | [[E Terms]] | [[F Terms]] | [[G Terms]] | [[H Terms]] | [[I Terms]] | [[J Terms]] |\n| [[K Terms]] | [[L Terms]] | [[M Terms]] | [[N Terms]] | [[O Terms]] | [[P Terms]] | [[Q Terms]] | [[R Terms]] | [[S Terms]] | [[T Terms]] | [[U Terms]] |\n| [[V Terms]] | [[W Terms]] | [[X Terms]] | [[Y Terms]] | [[Z Terms]] | | | | | | [[Acronyms|Acronyms & Abbreviatons]] |
|List of Greek letters|c\n|''letter''|''upper''|''lower''||''letter''|''upper''|''lower''| |''letter''|''upper''|''lower''| |''letter''|''upper''|''lower''|h\n|alpha| Α | α |bgcolor(#ffffaf):|eta| Η | η |bgcolor(#ffffaf):|nu| Ν | ν |bgcolor(#ffffaf):|tau| Τ | τ |\n|beta| Β | β |bgcolor(#ffffaf):|theta| Θ | θ |bgcolor(#ffffaf):|xi| Ξ | ξ |bgcolor(#ffffaf):|upsilon| Υ | υ |\n|gamma| Γ | γ |bgcolor(#ffffaf):|iota| Ι | ι |bgcolor(#ffffaf):|omicron| Ο | ο |bgcolor(#ffffaf):|phi| Φ | φ |\n|delta| Δ | δ |bgcolor(#ffffaf):|kappa| Κ | κ |bgcolor(#ffffaf):|pi| Π | π |bgcolor(#ffffaf):|chi| Χ | χ |\n|epsilon | Ε | ε |bgcolor(#ffffaf):|lambda| Λ | λ |bgcolor(#ffffaf):|rho| Ρ | ρ |bgcolor(#ffffaf):|psi| Ψ | ψ |\n|zeta| Ζ | ζ |bgcolor(#ffffaf):|mu| Μ | μ |bgcolor(#ffffaf):|sigma| Σ | σ |bgcolor(#ffffaf):|omega| Ω | ω |
[[Hartree-Fock]]\n[[Highest Occupied Molecular Orbital]]\n[[HIN File]]\n[[HOMO Energy]]\n[[Hybrid Descriptor]] or [[Hybridized Descriptor|Hybrid Descriptor]]\n[[Hydrogen Bond]]\n[[Hydrogen-Suppressed Graph]]\n[[HyperChem]]
coming soon ..
coming soon ...\n\n''__List of Hybrid Descriptors__''\n*[[Charged Partial Surface Areas]] or [[CPSA|Charged Partial Surface Areas]]\n*
''__Function__''\n*This program runs from the command line and can be used multiple times during a study to create useful [[HyperChem]] scripts that help automate some otherwise laborious processes.\n*Creates a script that will automatically add hydrogen atoms and do a simple molecular mechanics optimization on heteroatom backbone sketches.\n*Creates a script that will copy and rename [[.hin|HIN File]] files to [[.zmt|ZMT File]] files.\n*Creates a script that will copy and rename [[.zmt|ZMT File]] files to [[.hin|HIN File]] files.\n*Output file will have suffix .scr (name varies with chosen options)\n''__Descriptors Calculated__''\n*none\n''__Attributes Stored__''\n*none\n''__Commands Available__''\n|Simple|Compound|Default|Description|\n|-h |{{{--}}}help |n/a|print help screen and exit|\n|-i INPUT |{{{--}}}input=INPUT|n/a|set input file type to 'hin' or 'zmt' (must use with -o; do not use with -H)|\n|-o OUTPUT|{{{--}}}output=OUTPUT|n/a|set output file type to opposite of input type (must use with -i; do not use with -H)|\n|-H|{{{--}}}addH|n/a|tells script to add hydrogens and model build (do not use with -i, -o)|\n|-f FLIST|{{{--}}}files=FLIST|n/a|designate the files to process (enter first and last); must be used with either -H, or -i and -o)|
Independent Variable (aka [[Predictor Variable]])
[[k-Nearest Neighbor]]\n[[Kappa Indices]]\n
[[Least Squares for Regression]]\n[[Linear Discriminant Analysis]]\n[[London Forces]]\n[[Lowest Unoccupied Molecular Orbital]]\n[[LUMO Energy]]
coming soon ...
The following is a list of interesting/useful links related to QSAR, chemistry, and other significant resources.\n\n__''Programming Languages & OS''__\n*[[Python|http://www.python.org]]\n*[[Perl|http://www.perl.org]]\n*[[Linux.org|http://www.linux.org]]\n*[[Linux Documentation Project|http://en.tldp.org/]]\n*[[Linux Knowledge Base|http://www.linux-tutorial.info/]]\n*[[Fedora Project|http://rhold.fedoraproject.org/]]\n__''QSAR and Modeling Research Groups & Software''__\n*[[Milano Chemometrics|http://www.disat.unimib.it/chm/]]\n*[[QSAR Research Unit|http://dipbsf.uninsubria.it/qsar/]]\n*[[Research Group for Molecular Informatics|http://almost.cubic.uni-koeln.de/jrg]]\n*[[Radford Neal's Flexible Bayesian Modeling|http://www.cs.toronto.edu/~radford/fbm.software.html]]\n*[[QSAR & Modelling Society|http://www.ndsu.nodak.edu/qsar_soc/]]\n*[[ACS COMP Division|http://membership.acs.org/C/Comp/newsletters/index.html]]\n*[[Cheminformatics.org|http://www.cheminformatics.org/]]\n*[[NCI DIS 3-D Database|http://dtp.nci.nih.gov/docs/3d_database/dis3d.html]]\n*[[Online MOPAC Manual|http://www.chm.tu-dresden.de/edv/mopac6/mop.html]]\n*[[Chemoinformatics Hub|http://www.chemoinf.com/]]\n__''Journals''__\n*[[J. Chem. Inf. Model.|http://pubs.acs.org/journals/jcisd8/index.html]]\n*[[J. Molec. Graph. Model.|http://www.elsevier.com/wps/find/journaldescription.cws_home/525012/description]]\n*[[J. Comp. Chem.|http://www3.interscience.wiley.com/cgi-bin/jhome/33822]]\n*[[SAR & QSAR in Env. Res.|http://www.tandf.co.uk/journals/titles/1062936x.html]]\n*[[QSAR Comb. Sci.|http://www3.interscience.wiley.com/cgi-bin/jhome/104557877]] (formerly Q.S.-A.R)\n*[[J. Chemometrics|http://www3.interscience.wiley.com/cgi-bin/jhome/4425]]\n*[[J. Med. Chem|http://pubs.acs.org/journals/jmcmar/index.html]]\n*[[Chem. Res. Tox.|http://pubs.acs.org/journals/crtoec/index.html]]\n*[[J. Comp.-Aided Molec. Design|http://www.springerlink.com/content/1573-4951/]]\n\n\n
In order to successfully navigate through your work area, manipulate files, and run programs on the Linux workstation, you must be familiar with some basic commands and conventions. A few important ones will be covered here, but there are several Linux-related websites (see [[Links]]) and books available (see [[Suggested Reading]]).\n\nFor all of the workstation programs and file manipulations, you will be working in a terminal environment. That is, rather than a bunch of fancy icons or the typical windows-type interface, you will be entering command at a ~command-line prompt. For this section and others, when asked to enter a command at the prompt, an example will be given such as:\n/>> //command name//, where the '''/>>''' represents the computer prompt shown in your terminal. You will not actually type '/>>'\n\nSome basic commands in Linux that you should know include:\n*''ls'' - this command shows you a listing of the files and directories in your current location on the workstation\n*''pwd'' - this command shows your exact location in the workstation directory hierarchy. Typically, when you log on to the system, you will start in your home directory: ''/home/username'', where //username// is your user identification assigned to you by Dr. ~McElroy.\n*''mkdir'' - this command allows you to create a directory when followed by a word. For example, if you wanted to create the directory ''chem'' in your ''/home/username'' directory, then at the prompt type //>>mkdir chem//. Now, if you type the command //>> ls//, you should see ''/chem'' in the home directory.\n*''cd'' - this command allows you to ''c''hange ''d''irectory when followed by an existing directory. From /home/username, try the command //>>cd chem//. If the /chem directory was created, you should be able to change into it. If successful, you may type //>>pwd// and see that you are there. Type //>>ls//, and you will see that there are no files existing in that directory.\n*''cd ..'' - By typing the command //>>cd ..// (that's cd followed by two periods), you will move back up the directory by one level. If in /home/username/chem, issue the command //>>cd ..//, then type //>>pwd// and/or //>>ls//. You should see that you are back in your home directory /home/username. From anywhere in the system, if you simply type //>>cd// with nothing after it, you will automatically be transferred to your /home/username directory.\n*''touch'' - If you type this command followed by a word, it will create a file by that name at your current location. In your /home/username directory, type //>> touch a.txt//. After this, type //>>ls//, and you should see the file 'a.txt' in the directory. Once the file exists, you could edit it using [[Emacs]] or other editing programs.\n*''cp'' - this command allows you to copy a file. In the /home/username directory, type //>>cp a.txt b.txt//, then //>>ls//. You should now see that both a.txt and b.txt exist, where b.txt is a copy of a.txt.\n*''mv'' - this command allows you to move a file (not copy) from one place to another. In /home/username, type //>>mv b.txt chem //. If the directory /chem exists in your /home/username directory, it should have moved b.txt to that directory. Use the proper commands to see if this worked.\n*''rm'' - this command deletes a file (BE CAREFUL). Once a file or directory is removed, you can't get it back (i.e., no UNDO). In your /home/username/chem directory, type the command //>>rm b.txt//. Once completed, that file should no longer exist.\n*''cat'' - this command allows you to scroll the contents of a file on your screen. In your /home/username directory, issue the command //>>cat a.txt//. Most likely nothing will appear on the screen because a.txt is empty. If you had a list of books in that file, then all of the contents would scroll down your screen.
\n[[Mahalanobis Distance]]\n[[Mathematical Symbols|Special Characters]]\n[[Mean]]\n[[Mean, Weighted|Weighted Mean]]\n[[Mean Absolute Deviation]] or [[MAD|Mean Absolute Deviation]]\n[[Mean Centering]]\n[[Mean Deviation]]\n[[Median]]\n[[Modal Value|Mode]]\n[[Mode]]\n[[Molecular Connectivity]]\n[[Molecular Distance Edge]]\n[[Molecular Surface Area|Solvent Accessible Surface Area]] (see [[Solvent Accessible Surface Area]])\n[[Molecular Volume]]\n[[Molecular Walk Counts]]\n[[MOPAC]]\n[[Multiple Correlation]]\n[[Multiple Correlation (Adjusted)]]\n[[Multiple Correlation (Validation)]]\n[[Multiple Linear Regression]] or [[MLR|Multiple Linear Regression]]
[[QSAR Tutorial]]\n[[QSAR Programs]]\n[[Glossary]]\n[[References]]\n[[Links]]\n\n__IUP Links__:\n[[Chemistry|http://www.iup.edu/chemistry]]\n[[McElroy Hompage|http://nsm1.nsm.iup.edu/nate]]\n[[IUP Homepage|http://www.iup.edu]]
Once we have our dataset ready to go (subsets are made, descriptors are calculated, information content is maximized), we need to decide HOW we're going to make the connection between structure and activity. There are several options available and we can categorize them in a few ways. First, we can look at linear versus nonlinear mapping routines.\n \nLinear mapping routines are those that take a statistical approach and produce a final equation or solution that relies on specific variables with some sort of weighting - for example, a linear least squares method will give you a pseudo 'y=mx+b' line. Of course, rather than looking at an x vs. y line, we're dealing with several (n) dimensions. In general, the equation would be y = a~~1~~X~~1~~ + a~~2~~X~~2~~ + ... + a~~n~~X~~n~~, where each X represents a descriptor and each a value is the weight associated with that descriptor. The property value that is predicted (y) is a function of the important descriptors in the model.\n \nNonlinear mapping is the result of a more complex function such as a [[computational neural network|Computational Neural Network]]. Here, the final answer is just that - a value spit out of the neural network. It is the result of a complex set of network connections between input, hidden, and output layer network 'neurons'. Neural networks are optimized by minimizing the error of property prediction for a set of compounds by adjusting the internal network parameters (weights, bias terms). Here, we get a list of important descriptors, but no real way of knowing which one may be more important than others.\n \nIn general, linear methods provide better model interpretation at the cost of higher prediction errors. Nonlinear mapping methods tend to give better predictive error results, but at the loss of interpretation.\n \nFor both linear and nonlinear methods, we can further categorize into continuous value models and classification models. Continuous value models give a prediction of some property that is a real number (a boiling point, an LD~~50~~ , etc.), while classification gives us only a yes/no answer (toxic or nontoxic, for example). The type of mapping method used is heavily dependent upon the type of endpoint used to create the model.\n \nExamples of continuous mapping routines include [[multiple linear regression (MLR)|Multiple Linear Regression]], [[support vector machines (SVMs)|Support Vector Machine]], [[computational neural networks (CNNs)|Computational Neural Network]], [[Bayesian Regularized Neural Networks (BRNNs)|Bayesian Neural Network]], and [[Principal Component Analysis (PCA)|Principal Component Analysis]]. Examples of classification mapping routines include [[k-Nearest Neighbor (kNN)|k-Nearest Neighbor]], [[Linear Discriminant Analysis (LDA)|Linear Discriminant Analysis]], [[BRNNs|Bayesian Neural Network]], [[SVMs|Support Vector Machine]], and decision trees.\n \nOne other component of mapping routines is the method of descriptor selection used to find models from a large set of possibilities. We seek to find models that have as few descriptors as possible to model our data. It makes no sense to have a 50-descriptor model, for example, because there is too much variability possible, and chances are good that we're overfitting our data. Interpretation, which is difficult with even small models, becomes impossible. For example, if we have a descriptor pool of 75 molecular descriptors, we want to find only 3-5 descriptors that will give us good predictive results. Therefore, we must find a smart way of picking the best subset of descriptors from that pool. Enter the search algorithm.\n \nThere are several search algorithms that can be used, but great success is accomplished with a [[genetic algorithm|Genetic Algorithm]]. Based on the principal of Darwinian evolution, small models of descriptors are chosen from the large pool. The success of these models is measured based upon error minimization methods (root mean square errors). The descriptors from these models are then changed by cross-over mating and mutation to form a new set of 'child models' from the originals. If the children are more successful, then we proceed in that direction of mutation. If the models are less successful, then we go back to the parents and choose a new strand of mutation to process. This process occurs until no better models are found, and as a result we have searched only a small subspace of all possible 3-5 descriptor models from a large pool of 75 descriptors.
The ~McElroy Research Group was started in 2005 at the [[Indiana University of Pennsylvania|http://www.iup.edu]] [[Department of Chemistry|http://www.iup.edu/chemistry]] by [[Dr. Nathan McElroy|http://nsm1.nsm.iup.edu/nate]].\n\nDr. ~McElroy received his BS Chemistry degree from the department in 1994. After a fellowship at Battelle Marine Sciences Lab, he worked for three years as an analytical research chemist at BASF in Research Triangle Park, NC. In 1998, he returned to graduate school at [[The Pennsylvania State University|http://www.psu.edu]] and completed his ~PhD under the direction of Dr. Peter Jurs. After graduation in 2003, Dr. ~McElroy took a post doctoral position as a computational chemist in the //Centre de Recherches// of ~AstraZeneca's Reims, France facility.\n\nAt IUP, Dr. ~McElroy's main research focus is in [[Quantitative Structure-Activity Relationship]] studies; linking physical structures of small organic compounds to biological activities of interest. \n\nCurrently Mr. Sean Smith, a Masters student, is working on a QSAR of air-to-blood distribution partition coefficients of small volatile organic compounds.
The arithmetic mean is the sum of values in a series divided by the number of observations (aka average)\n[img[mean|img_QSAR/mean1.gif]]\nwhere //N// is the number of observations; //x~~i~~// is an individual value in the series.\nWhen referencing a sample mean, ''x'' is used. When referencing a population mean, the Greek lower ''μ'' is used.\n\nSee also [[weighted mean|Weighted Mean]].\n
The Mean Absolute Deviation of a data sample is taken as the sum of differences between each value and the sample [[mean|Mean]] multiplied by the [[absolute frequency|Absolute Frequency]] of each observation.\n[img[M.A.D.|img_QSAR/MAD1.gif]]\nwhere N is the number of observations; //x~~i~~// is an individual; x-bar is the sample mean; //f~~i~~// is the absolute frequency.\n\nNot to be mistaken for the [[mean deviation|Mean Deviation]].
This is a measure of the [[mean|Mean]] of absolute deviations of individual values in a series of data.\n[img[mean deviation|img_QSAR/meandev1.gif]]\nwhere N is the number of observations; //x~~i~~// is an individual observation; x-bar is the data set mean\n\nNot to be confused by the [[standard deviation|Standard Deviation]].
For a set of observations, if the set is ordered from lowest to highest value, the median is the middle value, i.e., the value above which and below which 50% of the observations fall.\n\nFor an odd number of observations, the value of the middle observation is chosen. For an even number of observations, the average value of the two middle observations is used for the median.
For a set of values in a set, the mode (or modal value) is given as the point of central tendency defined as that value which appears the most frequent of all points. For the set: 9, 12, 15, 15, 15, 16, 16, 20, 26; the mode would be 15.\n\nFor some data sets, it is possible to have a bimodal distribution, too.
''Validation''\nOnce we pick our 'best models', we need to know if we built a model that will work for us. Though we take several precautions up to this point, there is always the chance that a model was created randomly that gives good results for our training set of compounds. Now we can take our model and apply it to the prediction set of compound that have never influenced the model. If the results of the model are similar for our prediction set compounds, then we are one step closer to proving our model works correctly.\n\n Another validation step we can take is to use a 'Monte Carlo' approach. Here, we scramble all the property values between compounds while keeping the molecular descriptors intact. We then rebuild the models under the same conditions. If in fact we did create a valid predictive model, then scrambling the independent variables should result in very bad predictive results. We are then assured that our true models do in fact link property to structure, and we didn't just find a good model by chance.\n\n''Implementation''\nOnce a good model has been constructed and validated, we can now use the model to predict properties of molecules for which we have no endpoint data. This is the main goal of industry models, where a representative dataset of compounds is used to build a model to which unknowns can be applied.
coming soon
[[Noncovalent Interactions]]
[[Open Babel]]
[[Open Babel|http://openbabel.sourceforge.net/wiki/Main_Page]] is an open source (OS) toolkit that allows chemical information in one format to be translated to another format.\n\nThis is particularly handy when dealing with different data formats produced by [[HyperChem]], [[MOPAC]], and [[SMILES]] programs, so that information can be traded between programs without losing vital information.
[[Path Count]]\n[[Path Length]]\n[[PM3 Hamiltonian]]\n[[Prediction Set]] or [[PSET|Prediction Set]]\n[[Predictor Variable]]\n[[Principal Component Analysis]]
\n[[Q-test]]\n[[Quantitative Structure-Activity Relationship]] or [[QSAR|Quantitative Structure-Activity Relationship]] \n[[Quantitative Structure-Property Relationship]] or [[QSPR|Quantitative Structure-Property Relationship]]
''What exactly is QSAR?''\nA [[Quantitative Structure-Activity Relationship]] (or sometimes [[Quantitative Structure-Property Relationship]], QSPR) is a theoretical model that links the molecular structure of an organic compound to some biological acitivity or physical property of interest. It is an inductive approach, meaning that we follow a distinct path of reasoning to lead us from structure to property - in other words, we can't just 'look' at a structure and determine its property. QSAR/QSPR has been used to model properties such as boiling point, aqueous solubility, and glass transition temperatures of polymers, and look at interesting biological activities such as enzyme inhibition, blood-brain barrier partition coefficients, LD~~50~~ values (toxicity) for various species interactions, and a host of others. In general, the pharmaceutical industry uses QSAR in the early stages of drug development to screen for potentially toxic compounds and sometimes to guide research in a particular subset of compounds or class of compounds. ADME/Tox (Absorption, Distribution, Metabolism, Excretion/Toxicity) studies are crucial to the success of candidate drugs.\n\n''Who uses QSAR?''\nAs mentioned above, the pharmaceutical industry is very interested in finding ways to screen compounds for toxicity and other potential problem early on in the drug discover phase. "Fail early, fail cheap" is the standard thought, because the longer you work with a potential candidate drug, the more money you're spending. If you can weed out a potentially toxic compound early in the process, then you save a lot of time and money.\n\nMost every major pharmaceutical company has some employees, if not a whole team/section of ~CADDers (computer-aided drug designers), who spend their time making predictive ADME/Tox models in order to screen huge databases of chemical compounds (some of the larger companies can have virtual libraries with hundreds of thousands of chemical structures!). The field of computational chemistry is now rather broad with the advances in computer speed and power over the last 40 years, and QSAR is a small part of that.\n\n''How does it work?''\nThe following description of QSAR is biased obviously by how I learned to use QSAR. There are several great books available on the subject, and there are several ways to go about it. A full history and evolution of the field will not be presented here, either, as this is available elsewhere. My approach to QSAR comes from the methods I learned under Dr. Peter Jurs at Penn State. In general, I find there are seven steps to create a good and reliable QSAR model:\n\n 1. [[Data Set Construction]]\n 2. [[Structure sketching|Sketching Compounds with HyperChem]]\n 3. [[Descriptor Calculation]]\n 4. [[Data preprocessing & reduction|Data Reduction]]\n 5. [[Mapping]]\n 6. [[Validation|Model Validation and Usage]]\n 7. [[Implementation|Model Validation and Usage]]\n\n''Benefits of Using QSAR''\nThere are several benefits of using QSAR models. One of the major focuses is their use in the pharmaceutical and consumer products industry as a tool to screen large datasets of compounds for properties that might have a detrimental efftect in ADME/Tox concerns. Removing potentially toxic compounds from consideration when searching for candidate drugs or other consumables is most efficient in the early development phases, and this represents a substantial cost savings. Smaller, more focused models can help researchers understand what characteristics of a molecule are important in a particular property, thereby giving some insight into the structural characteristics. This can help guide research and synthesis toward more desirable products.\n\n Though we're a long way off from relying completely on theoretical models for property prediction, these tools present a cost-effective and important step in industrial processes. All compounds that come to market must still undergo rigourous testing (both animal testing and human clinical trials), but the relative ease of screening thousands of compounds //in silico// presents a reliable way to avoid spending money and time on //in vitro// and //in vivo// testing.\n\n''Limitations of QSAR''\n As useful as QSAR can be, there are several limitations that must be considered to ensure that they are used properly, and to ensure that their results are properly interpreted.\n\n First, the data from which the model is constructed must be sound. In other words, one hopes to start with data that comes from a single source and whose experimental data was collected under the same protocol. \n \nSecond, the model depends upon the types of compounds that are used to build it. For example, if a model is constructed from only hydrocarbons or halogenated hydrocarbons, then using the model to predict properties of compounds NOT from these groups is suspect to doubt because of extrapolation. In industry, for example, where thousands of compounds can be used to construct a model, it is sometimes desirable to construct a general 'global' model, that accounts for many compound classes. At other times, it may be better to use a smaller, more focused 'local' model that is concerned only with a particular subset of chemical compounds. Many times, you must think ahead to what your model will be used for before you even build it. Similarly, one must determine how similar an unknown compound is to the compounds represented by the model training set.\n \nThird, the limit of prediction power is dependent upon the errors found in training and validating the model. When an unknown is predicted using the model, you can only be certain of its 'correctness' within the limit of the errors found in training and validation.\n \nFourth, the interpretation of model descriptors is a point of contention among many groups. One can always make strong arguments as to how the descriptors found in your model can be related to the structure-activity link, particularly if dealing with a linear model. For nonlinear applications, there is not yet a reliable way to interpret how each descriptor is affecting the outcome, and how that may influence the property. The only real way to know these relationships is to synthesize a new molecule, test its property, and see how it compares to the model prediction.
The following scripts/programs are available for use on the workstation:\n[[calcAtomCounts]]\n[[calcMatrices]]\n[[calcMean]]\n[[calcMolecularWeight]]\n[[calcPaths]]\n[[calcRange]]\n[[calcStdDev]]\n[[calcZscore]]\n[[chkPM3]]\n[[cmdln]]\n[[cmpAttr]]\n[[dbOpen]]\n[[dbStore]]\n[[doOptimizations]]\n[[getCharges]]\n[[getHINformation]]\n[[hin2smi]]\n[[HyperChemMaker]]\n[[makeGraphs]]\n[[makeSMILES]]\n[[prepQSARworkArea]]\n[[QSAR_classes]] (//under construction//)\n[[thruSpaceDist]]\n[[updateCoordinates]]
There are several points to address in this tutorial:\n#Background theory or information about what a QSAR study does and what its limitations are.\n#How to run a QSAR study using the software available on the workstation and desktop PC.\n#A list of individual programs and how to run each using the optimal settings.\n\n__''1: Background information for understanding the basic concepts of a QSAR''__ (for anyone interested)\n*[[QSAR Overview]]\n*[[Data Set Construction]]\n*[[Descriptor Calculation]]\n*[[Data Reduction]]\n*[[Mapping]]\n*[[Model Validation and Usage]]\n\n__''2: Directions for running a QSAR study from your PC and //saison//''__ (relevant for IUP students only)\n*[[Linux Conventions]]\n*[[Work Area and Environment]]\n*[[Sketching Compounds with HyperChem]]\n*[[Reading Compound Information]]\n*[[Optimizing Compound Geometry]]\n*[[Optimizing Compound Charges]]\n*[[Molecule Attributes]]\n*[[Calculating Descriptors]]\n*[[Data Set Manipulation]]\n*[[Reducing the Data]]\n*[[Linear Mapping]]\n*[[Nonlinear Mapping]]\n*[[Model Validation]]\n*[[Information Storage]]\n\n__''3: Individual Programs Available on //saison//''__ \nA list of individual scripts and programs for QSAR that are available on //saison// can be found [[here|QSAR Programs]]. While anyone is welcome to look through these, they can only be accessed and run by those with access to the group's workstation.
coming soon\n\nsee the [[QSAR Tutorial]] for more detailed information.
coming soon
\n[[Range Scaling]]\n[[Regression Model]]\n[[Residual Analysis]]\n[[RMSE]] or [[Root Mean Square Error|RMSE]]
References used in this wiki. (loosely alpha by author)\n''__Be__''\n [[ref-be-001]] Besalú, E; de Julián-Ortiz, J.V.; Pogliani, L. "Trends and Plot Methods in MLR Studies" //[[J. Chem. Inf. Model.|http://pubs.acs.org/journals/jcisd8/index.html]]//, ''2007'', ASAP article.\n''__En__''\n [[ref-en-001]] Engel, T. "Basic Overview of Chemoinformatics" //[[J. Chem. Inf. Model.|http://pubs.acs.org/journals/jcisd8/index.html]]//, ''2006'', //46//, 2267-2277.\n''__Ju__''\n [[ref-ju-001]] Jurs, P.C. //Computer Software Applications in Chemistry//; 2nd ed., John Wiley & Sons: New York, NY, 1996.\n''__Ne__''\n [[ref-ne-001]] Neal, R. //Bayesian Learning for Neural Networks//; ~Springer-Verlag, New York, NY, 1996.\n''__Ot__''\n [[ref-ot-001]] Otto, M. //Chemometrics: Statistics and Computer Application in Analytical Chemistry//; ~Wiley-VCH, Weinheim, Germany, 1999.\n''__St__''\n [[ref-st-001]] Steiner, R. //The Chemistry Maths Book//; Oxford University Press, Oxford, UK, 1996.\n [[ref-st-002]] Stouch, T.R.; Jurs, P.C. "A Simple Method for the Representation, Quantification, and Comparison of the Volumes and Shapes of Chemical Compounds" //[[J. Chem. Inf. Comput. Sci.|http://pubs.acs.org/journals/jcisd8/index.html]]//, ''1986'',//26//,4-12.\n''__To__''\n [[ref-to-001]] Todeschini, R.; Consonni, V. //Handbook of Molecular Descriptors//; ~Wiley-VCH: Weinheim, Germany, 2000.\n
[[SCF Field]]\n[[Semiempirical Method]]\n[[Simulated Annealing]]\n[[SMILES]]\n[[Solvent Accessible Surface Area]]\n[[Special Characters]]\n[[Standard Deviation]]\n[[Student's t-test|t-test]]\n[[Supervised Methods]]\n[[Support Vector Machine]]\n[[Surface Area (Molecular)|Solvent Accessible Surface Area]] (see [[Solvent Accessible Surface Area]])
SMILES = ''S''implified ''M''olecular ''I''nput ''L''ine ''E''ntry ''S''pecification\n\nThe original work was created by Arthur and David Weininger (need ref), and has been expanded since. This system allows an unambiguous description of a chemical structure within a single string (list of characters), and is quite common in [[cheminformatics|Cheminformatics]]. SMILES are used for information storage and retrieval in many software packages, and are quite commonly used in publications. \n\nA good explanation of SMILES notation can be found at the [[Daylight Chemical Information Systems website|http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html]]. \n\nFor our work here, we use SMILES for database storage and dataset processing, and the SMILES notation is created from [[HyperChem]] [[hin file|HIN File]] information by conversion via [[Open Babel]] software.
A guide to QSAR and related topics in the [[McElroy Research Group]]
QSAR Wiki
Structure sketching is a common method of entering structural information. For example, we use the commercially available package ~HyperChem to physically draw each molecule of a dataset. Each 'picture' is then saved as a text file in specific format to relay two-dimensional aspects of the molecule, such as atom identity and how each atom is connected to its neighbor. This can be quite laborious for large datasets. In industry, or when doing collaborations with industry, typically the information is already provided for you in different file formats. Large virtual libraries store chemical structure information in a long string of characters, called [[SMILES]]. These can then be readily processed for our next step, descriptor calculation.\n\nFor exact directions to use ~HyperChem for your QSAR study, see [[Using HyperChem]].
Some special characters and mathematical symbols\n|''symbol''|''description''| |''symbol''|''description''| |''symbol''|''description''|\n| ∂ |partial differential| | ∧ |logical and| | ⊂ |subset of|\n| ∃ |there exists| | ∨ |logical or| | ⊃ |superset of|\n| ∅ |empty or null set| | ∩ |intersection| | ⊄ |not a subset of|\n| ∇ |nabla or backward difference| | ∪ |union| | ⊆ |subset of or equal to|\n| ∈ |element of| | ∫ |integral| | ⊇ |superset of or equal to|\n| ∉ |not an element of| | ∴ |therefore| | ∀ |for all|\n| ∋ |contains as member| | ∼ |similar to| | ⇔ |double arrow|\n| ∏ |product sign| | ≅ |approximately equal to| | ↔ |small double arrow|\n| ∑ |summation sign| | ≈ |almost equal to| | ƒ |function|\n| √ |square root| | ≠ |not equal to| | ⊥ |orthogonal to|\n| ∝ |proportional to| | ≡ |identical to| | ⋅ |dot operator|\n| ∞ |infinity| | ≤ |less than or equal to| | ⊕ |direct sum|\n| ∠ |angle| | ≥ |greater than or equal to| | ⊗ |direct product|\n
The standard deviation value for a series of data gives a measure of spread when considering a typical Gaussian distribution. To be exact, it is defined as the square root of the [[variance|Variance]] of a series. For a sample, variance is denoted as //s^^2^^//, whereas a population variance is denoted as //σ^^2^^//. The standard deviation for a sample is denoted by //s//, whereas for a population is would be //σ//.\n\nExactly, the standard deviation is given by\n[img[standard_deviation|img_QSAR/stdev1.gif]]\n\nalthough most frequently a degree-of-freedom correction is used for the common form:\n[img[standard_deviation|img_QSAR/stdev2.gif]]\nwhere N is the number of observations; //x~~i~~// is an individual observation; x-bar is the data set mean.
''Books available in the research group''\n*Computer Related Books\n**//A Practical Guide to Linux Commands, Editors, and Shell Programming//\n**//A Practical Guide to the Unix System// (good place to start learning basic commands)\n**//Hardening Linux//\n**//Learning GNU Emacs// (see [[Emacs]] for useful commands)\n**//Learning Perl//\n**//Learning Python//\n**//Learning the bash Shell//\n**//Python Cookbook//\n**//Teach Yourself C//\n**//Teach Yourself C++//\n**//Teach Yourself Perl//\n**//Unix Power Tools//\n*Chemometrics, QSAR, and Mathematics Books\n**//Bayesian Learning for Neural Networks//\n**//Chemometrics: A Practical Guide//\n**//Computer Software Applications in Chemistry//\n**//Multivariate Calibration//\n**//Statistical Analysis//\n**//The Chemistry Maths Book//\n\n''Books Available From IUP Libraries''\n*//Handbook of Molecular Descriptors// by Todeschini ''~RS420.T63 2000''\n*//Molecular Connectivity in Chemistry and Drug Research// by Kier & Hall ''~QD461.K42''\n
\n[[t-test]]\n[[Topological Descriptor]]\n[[Training Set]] or [[TSET|Training Set]]
coming soon ...\n\n''__List of Topological Descriptors__''\n*[[Atom Counts]]
\n[[Unsupervised Methods]]
[[Variance]]\n[[Variance of Inflation Factor|VIF]] or [[VIF]]\n[[Volume, Molecular|Molecular Volume]]
[[Weighted Holistic Invariant Molecule|WHIM]] or [[WHIM]]\n[[Weighted Mean]]\n[[Wiener Number|Wiener Index]] or [[Wiener Index]]
Welcome to the [[McElroy Research Group]]'s Wiki!\n\nThis wiki was set up using Jeremy Ruston's [[TiddlyWiki|http://www.tiddlywiki.com]]. Clicking any hypertext will open up a section for viewing, and you can even see the code behind it. Go ahead and try it out - you can't change anything on this page. The benefit of this type of Wiki is that is exists as a single HTML file, and can be stored on a USB stick for on-the-go editing. After a bit of editing, I simply copied the .html file to this website and Voila! The downside is that other users can not edit it as more popular sites, like Wikipedia.\n\nThe left menu comprises the main sections of the site, including a [[QSAR overview and tutorial|QSAR Tutorial]]. Opening those links will allow you to cascade through other topics of interest. If the screen becomes too crowded, simply choose ''close all'' on the right side to clear the page. \n\n''Last online update: 21-Sep-07 12:15 EST''\n\n
The weighted mean is the sum of values in a series multiplied by a weighting factor (whatever may be deemed appropriate for the process in question). See also [[mean|Mean]].\n[img[weighted mean|img_QSAR/wmean1.gif]]\nwhere //n// is the number of observations; //w~~i~~// is the weight of observation //x~~i~~//. For this definition, ∑w~~i~~ = 1 must hold true.\n\nWhen another weighting factor is used (so that ∑w~~i~~ ≠1), then the above summation is changed by dividing the term ∑w~~i~~x~~i~~ by the term ∑w~~i~~.
[[ZMT File]]\n[[z-Score]]
''__Function__''\n*Calculate several atom count descriptors and a molecular formula of a molecule. \n''__Descriptors Calculated__''\n||Label|Description|\n| 1 |numH |a count of hydrogen atoms|\n| 2 |numC |a count of carbon atoms|\n| 3 |numN |a count of nitrogen atoms|\n| 4 |numO |a count of oxygen atoms|\n| 5 |numP |a count of phosphorous atoms|\n| 6 |numS |a count of sulfur atoms|\n| 7 |numF |a count of fluorine atoms|\n| 8 |numCl |a count of chlorine atoms|\n| 9 |numBr |a count of bromine atoms|\n| 10 |numI |a count of iodine atoms|\n| 11 |numX |a count of all halogen atoms (F,Cl,Br,I)|\n| 12 |numHA |a count of all heavy atoms (non-hydrogens)|\n''__Attributes Stored__''\n*MOLECULE.FORMULA: a simple molecular formula parsed from the atom counts\n''__Commands Available__''\n|Simple|Compound|Default|Description|\n|-h |{{{--}}}help |n/a|print help screen and exit|\n|-f FILENAME |{{{--}}}file=FILENAME |dbHIN|designate the database file to read/store if not default|\n|-p |{{{--}}}print |off|information printed to the screen|\n|-l |{{{--}}}log |off|information printed to an output file (see -o)|\n|-o OUTFILE|{{{--}}}output=OUTFILE |'output.txt'|file designated for output when -l flag chosen|\n|-d DESCLIST |{{{--}}}desc=DESCLIST |all|choose the descriptors to calculate; enter list in single quotes|\n|-a |{{{--}}}all |on|calculate descriptors for all compounds (not really needed since it's on by default|\n|-n MOLECULES |{{{--}}}numbers=MOLECULES |off|enter a list of specific compounds to try (will not save information in database)|\n|-s |{{{--}}}save |off|when used, saves the descriptors to database file; does not work with '-n'|
''__Function__''\n*Creates an [[adjacency matrix|Adjacency Matrix]] and [[distance matrix|Distance Matrix]] for the [[hydrogen-suppressed graph|Hydrogen-Suppressed Graph]] of a molecule\n''__Descriptors Calculated__''\n*none\n''__Attributes Stored__''\n*MOLECULE.ADJMAT (ADJ), a two-dimensional array (matrix) storing [[adjacency matrix|Adjacency Matrix]] values for the [[hydrogen-suppressed graph|Hydrogen-Suppressed Graph]]\n*MOLECULE.DISTMAT (DIST), a two-dimensional array (matrix) storing [[distance matrix|Distance Matrix]] values for the [[hydrogen-suppressed graph|Hydrogen-Suppressed Graph]]\n''__Commands Available__''\n|Simple|Compound|Default|Description|\n|-h |{{{--}}}help |n/a|print help screen and exit|\n|-f FILENAME |{{{--}}}file=FILENAME |dbHIN|designate the database file to read/store if not default|\n|-p |{{{--}}}print |off|information printed to the screen|\n|-d DESCLIST |{{{--}}}desc=DESCLIST |all|choose the attributes to calculate; enter list in single quotes|\n|-a |{{{--}}}all |on|calculate descriptors for all compounds (not really needed since it's on by default|\n|-n MOLECULES |{{{--}}}numbers=MOLECULES |off|enter a list of specific compounds to try (will not save information in database)|\n|-s |{{{--}}}save |off|when used, saves the attributes to database file; does not work with '-n'|
''__Function__''\n*Can be called from other programs; must pass in a single list (array) of values. Returns the [[mean|Mean]] of the values.\n*Can be called from the command line (see below). Prints the [[mean|Mean]] of the values to the screen.\n''__Descriptors Calculated__''\n*none\n''__Attributes Stored__''\n*none\n''__Commands Available__''\n|Simple|Compound|Default|Description|\n|-h |{{{--}}}help |n/a|print help screen and exit|\n|-i | |n/a|denotes you wish to use input file which is controlled by -f below|\n|-f INFILE |{{{--}}}file=INFILE |'input.txt'|designate the input file containing a list of data (one value per line); must be used with -i above|\n|-d DATA|{{{--}}}data=DATA |n/a|enter list of values surrounded by single quotes|
''__Function__''\n*Can be called from another program. Send in the molecular formula and the molecular weight is returned.\n*Can be called from the command line. Enter a molecular formula and the molecular weight will be printed to the screen. It can also parse the molecular formula and provide you with a breakdown of the numbers of each atom within the molecule.\n''__Descriptors Calculated__''\n*none\n''__Attributes Stored__''\n*none\n''__Commands Available__''\n|Simple|Compound|Default|Description|\n|-h |{{{--}}}help |n/a|print help screen and exit|\n|-s |{{{--}}}syms |n/a |will return a parsed formula denoting the ID and number of each type of atom|\n|-m |{{{--}}}MW |n/a |returns the molecular weight of a compound|\n''__Examples__''\n*calcMolecularWeight -m ~CH2O returns "molecular weight: 30.03"\n*calcMolecularWeight -m 'Ca3(~PO4)2' returns "molecular weight: 405.15"\n**''note'' formulas with parentheses must be surrounded by single quotes for parser to work correctly\n*calcMolecularWeight -s ~CH2O returns three lines:\n##C (Carbon): 1\n##H (Hydrogen): 2\n##O (Oxygen): 1
''__Function__''\n*Calculates the longest topological [[path length|Path Length]] in a molecule from the information in MOLECULE.GRAPH attribute.\n*Also stores a dictionary (right now as a descriptor) PLENGTHS, which contains the [[number of paths|Path Count]] of length X (under development)\n''__Descriptors Calculated__''\n||Label|Description|\n| 1 |LPL |longest path length in the molecule|\n| 2 | |more expected soon, such as the number of paths of length=X|\n''__Attributes Stored__''\n*none\n''__Commands Available__''\n|Simple|Compound|Default|Description|\n|-h |{{{--}}}help |n/a|print help screen and exit|\n|-f FILENAME |{{{--}}}file=FILENAME |'dbHIN'|designate the database file to read/store if not default|\n|-p |{{{--}}}print |off|information printed to the screen|\n|-d DESCLIST |{{{--}}}desc=DESCLIST |all|choose the descriptors to calculate; enter list in single quotes|\n|-a |{{{--}}}all |on|calculate descriptors for all compounds (not really needed since it's on by default|\n|-n MOLECULES |{{{--}}}numbers=MOLECULES |off|enter a list of specific compounds to try (will not save information in database)|\n|-s |{{{--}}}save |off|when used, saves the descriptors to database file; does not work with '-n'|
This script is called from another program.\nA list (array) of values is passed in.\nThe function returns the minimum, maximum, and range values of the list back to the program.
''__Function__''\n*Can be called from other programs; must pass in a single list (array) of values. Returns the [[mean|Mean]] and [[standard deviation|Standard Deviation]] of the values.\n*Can be called from the command line (see below). Prints the [[mean|Mean]] and [[standard deviation|Standard Deviation]] of the values to the screen.\n''__Descriptors Calculated__''\n*none\n''__Attributes Stored__''\n*none\n''__Commands Available__''\n|Simple|Compound|Default|Description|\n|-h |{{{--}}}help |n/a|print help screen and exit|\n|-i | |n/a|denotes you wish to use input file which is controlled by -f below|\n|-f INFILE |{{{--}}}file=INFILE |'input.txt'|designate the input file containing a list of data (one value per line); must be used with -i above|\n|-d DATA|{{{--}}}data=DATA |n/a|enter list of values surrounded by single quotes|
This is a function called from inside another program.\nA list (array) of values is submitted.\nThe program returns a list containing the [[z-scores|z-Score]] for each value.
''__Function__''\n*This program is run after the [[MOPAC]] [[PM3|PM3 Hamiltonian]] optimization. It checks to make sure that all of the geometries passed the optimization test by searching for the string '[[SCF FIELD|SCF Field]] ACHIEVED' in each molecules [[MOPAC]] output file.\n''__Descriptors Calculated__''\n*none\n''__Attributes Stored__''\n*none\n''__Commands Available__''\n|Simple|Compound|Default|Description|\n|-h |{{{--}}}help |n/a|print help screen and exit|\n|-f FILENAME |{{{--}}}file=FILENAME |dbHIN|designate the database file to read/store if not default|\n|-p |{{{--}}}print |off|information printed to the screen|\n|-l |{{{--}}}log |off|information printed to an output file (see -o)|\n|-o OUTFILE|{{{--}}}output=OUTFILE |'~PM3-PASS.txt'|designates the output file to show ~PM3 optimization passes when -l flag chosen|\n|-e ERRFILE |{{{--}}}error=ERRFILE |'~PM3-FAIL.txt'|designate the output file to show ~PM3 optimization failures when -l flag is chosen|\n|-a |{{{--}}}all |on|calculate descriptors for all compounds (not really needed since it's on by default|\n|-n MOLECULES |{{{--}}}numbers=MOLECULES |off|enter a list of specific compounds to try (will not save information in database)|
This script is called from within other programs to parse and interpret commands given for each particular command-line program. It makes use of the [[optparse module|http://docs.python.org/lib/module-optparse.html]] in [[Python|http://www.python.org]].\n\nEach program that utilizes this function sets up a list of options that include the flag designation, default values, and (key,value) pair options. This dictionary is passed back to the program after parsing for program control. Options for each program that can be run from the command line can be found in the program descriptions.
This is a function containing a class object, which allows you to sort an object by specific class attributes. The code for this was taken from work by Andrew Dalke which can now be found [[here|http://wiki.python.org/moin/HowTo/Sorting]].
This function is called within a program when you wish to open up the molecule database (default='dbHIN.gz') and extract information. The file is read and returned as a list of objects, which is then further processed in the program of question.
This function is called from within other programs when you wish to update and/or save a molecule database (default='dbHIN.gz'). All information that was changed in a list of objects during the operation of the program is put back into the original file. This overwrites only, so a backup copy of the database file should be made before saving changes when in doubt.
a descriptor template for describing command, etc. about a QSAR descriptor\n\n''__Function__''\n*\n''__Descriptors Calculated__''\n||Label|Description|\n''__Attributes Stored__''\n*\n''__Commands Available__''\n|Simple|Compound|Default|Description|\n|-h |{{{--}}}help |n/a|print help screen and exit|\n|-f FILENAME |{{{--}}}file=FILENAME |dbHIN|designate the database file to read/store if not default|\n|-p |{{{--}}}print |off|information printed to the screen|\n|-l |{{{--}}}log |off|information printed to an output file (see -o)|\n|-o OUTFILE|{{{--}}}output=OUTFILE |'output.txt'|file designated for output when -l flag chosen|\n|-d DESCLIST |{{{--}}}desc=DESCLIST |all|choose the descriptors to calculate; enter list in single quotes|\n|-a |{{{--}}}all |on|calculate descriptors for all compounds (not really needed since it's on by default|\n|-n MOLECULES |{{{--}}}numbers=MOLECULES |off|enter a list of specific compounds to try (will not save information in database)|\n|-s |{{{--}}}save |off|when used, saves the descriptors to database file; does not work with '-n'|
''__Function__''\n*This program is a wrapper to run the semiempirical molecular orbital package [[MOPAC]]. Currently with this program, one can choose to run predefined keywords for optimization of three-dimensional geometry ([[PM3|PM3 Hamiltonian]]) or optimization of charge information ([[AM1|AM1 Hamiltonian]]).\n''__Descriptors Calculated__''\n*none\n''__Attributes Stored__''\n*none\n''__Commands Available__''\n|Simple|Compound|Default|Description|\n|-h |{{{--}}}help |n/a|print help screen and exit|\n|-p |{{{--}}}pm3 |n/a|manipulates necessary files, enters MOPAC keywords, and runs MOPAC with ~PM3 optimization|\n|-a |{{{--}}}am1 |n/a|manipulates necessary files, enters MOPAC keywords, and runs MOPAC with ~AM1 optimization|\n\n''__Current Default Keywords__''\n*''~PM3'': "~PM3 T=99999.9 EF HESS=1 MMOK GNORM="\n*''~AM1'': "~AM1 T=9999.9 1SCF"
''__Function__''\n*This program is run from the command line after [[MOPAC]] [[AM1|AM1 Hamiltonian]] optimization.\n*The partial charge information for each atom in a molecule is saved in the MOLECULE.ATOMS list\n*Three whole-molecule descriptors related to charge information are also read and stored.\n''__Descriptors Calculated__''\n||Label|Description|\n| 1 |HEAT|the heat of formation for the molecule|\n| 2 |DIPOLE|the dipole of the molecule|\n| 3 |IONPOT|the ionization potential for the molecule|\n''__Attributes Stored__''\n*MOLECULE.ATOMS is modified to include a CHARGE value for each atom in the molecule\n''__Commands Available__''\n|Simple|Compound|Default|Description|\n|-h |{{{--}}}help |n/a|print help screen and exit|\n|-f FILENAME |{{{--}}}file=FILENAME |'dbHIN'|designate the database file to read/store if not default|\n|-p |{{{--}}}print |off|information printed to the screen|\n|-a |{{{--}}}all |on|get charges for all compounds (not really needed since it's on by default|\n|-n MOLECULES |{{{--}}}numbers=MOLECULES |off|enter a list of specific compounds to try (will not save information in database)|\n|-s |{{{--}}}save |off|when used, saves the descriptors to database file; does not work with '-n'|
''__Function__''\n*Database initialization program. Creates a database (default='dbHIN.gz'), which is an array of MOLECULE classes.\n*Each MOLECULE class contains various attributes, arrays, and dictionaries that store most of the information about a dataset needed for a QSAR study.\n*Requires that all dataset molecules exist in the ~/molecules directory in both [[.hin|HIN File]] and [[.zmt|ZMT file]] formats.\n''__Descriptors Calculated__''\n||Label|Description|\n| 1 | MW | molecular weight (by calcMolecularWeight)|\n''__Attributes Stored__''\n*MOLECULE.DESCRIPTORS, a dictionary that will contain descriptor labels as keys and descriptor values as key-values\n*MOLECULE.NAME, the name of a molecule (otherwise 'none'\n*MOLECULE.NUMBER, the number of the molecule in the dataset\n*MOLECULE.ATOMS, contains an array of ~AtomProp dictionaries for each atom. For each atom, the following values are stored:\n**ATOMNUM, the number of the atom in a molecule (determined by HyperChem ordering)\n**ATOMID, the atomic symbol of an atom in a molecule (C,H,N, etc.)\n**ATOMTYPE, the HyperChem designation of an atom in a molecule (CA,C4,O2,CL, etc.)\n**POSX, the Cartesian x-coordinate of an atom in a molecule\n**POSY, the Cartesian y-coordinate of an atom in a molecule\n**POSZ, the Cartesian z-coordinate of an atom in a molecule\n**NCONN, the number of connections (adjacent) to an atom in a molecule\n**~CON1, the ATOMNUM of the first connected atom to an atom in a molecule\n**~CON1TYPE, the connection type of ~CON1 (s,d,t,a)\n**~CON2 through ~CON6, similar to ~CON1 above\n**~CON2TYPE through ~CON6TYPE, similar to ~CON1TYPE above\n*MOLECULE.NATOMS, the number of all atoms in a molecule\n*MOLECULE.MW, the molecular weight of a molecule (determined by calcMolecularWeight)\n''__Commands Available__''\n|Simple|Compound|Default|Description|\n|-h |{{{--}}}help |n/a|print help screen and exit|\n|-f FILELIST |{{{--}}}filelist=FILELIST |n/a|designate the first and last molecule number in your dataset|\n|-o OUTPUT |{{{--}}}output=OUTPUT |dbHIN'|name of the database file to be used for the entire study|
''__Function__''\n*This script can be run from the command line in a directory where all of the molecule [[.hin|HIN File]] files exist. \n*It uses [[Open Babel|http://openbabel.sourceforge.net/wiki/Main_Page]] to convert all the information in .hin files to [[SMILES]] notation. All [[SMILES]] are then stored in a single output file for later usage.\n*If you want to save the [[SMILES]] string for each molecule in the database, use [[makeSMILES]].\n''__Descriptors Calculated__''\n*none\n''__Attributes Stored__''\n*none\n''__Commands Available__''\n|Simple|Compound|Default|Description|\n|-h |{{{--}}}help |n/a|print help screen and exit|\n|-o OUTFILE|{{{--}}}output=OUTFILE |'smiles.txt'|file designated for output if default not desired|
''__Function__''\n*This program runs from the command line. It reads in topological connectivity information from the MOLECULE.ATOMS attribute lists and creates a new attribute MOLECULE.GRAPH.\n*This GRAPH information is used in later programs to make various matrices which in turn are used to calculate various descriptors.\n''__Descriptors Calculated__''\n*none\n''__Attributes Stored__''\n*MOLECULE.GRAPH\n''__Commands Available__''\n|Simple|Compound|Default|Description|\n|-h |{{{--}}}help |n/a|print help screen and exit|\n|-f FILENAME |{{{--}}}file=FILENAME |dbHIN|designate the database file to read/store if not default|\n|-p |{{{--}}}print |off|information printed to the screen|\n|-n MOLECULES |{{{--}}}numbers=MOLECULES |off|enter a list of specific compounds to try (will not save information in database)|\n|-s |{{{--}}}save |off|when used, saves the descriptors to database file; does not work with '-n'|
''__Function__''\n*This program can be run from the command line. It is similar to [[hin2smi]] with the exception that (in addition to file output) it will store the [[SMILES]] string for each molecule as an attribute MOLECULE.SMILES\n*Uses [[Open Babel|http://openbabel.sourceforge.net/wiki/Main_Page]]\n''__Descriptors Calculated__''\n*none\n''__Attributes Stored__''\n*MOLECULE.SMILES\n''__Commands Available__''\n|Simple|Compound|Default|Description|\n|-h |{{{--}}}help |n/a|print help screen and exit|\n|-f FILENAME |{{{--}}}file=FILENAME |dbHIN|designate the database file to read/store if not default|\n|-l |{{{--}}}log |off|information printed to an output file (see -o)|\n|-o OUTFILE|{{{--}}}output=OUTFILE |'output.txt'|file designated for output when -l flag chosen|\n|-s |{{{--}}}save |off|when used, saves the descriptors to database file; does not work with '-n'|
This is a command line program which initializes a work area for a QSAR study. From anywhere in your workstation directory, issue the command \n*prepQSARworkArea -d DIRNAME\nThe DIRNAME above is the full path designation of a directory where you will carry out all work related to the study. It should not already exist. Once the program runs successfully, you will find the following structures in that directory:\n*fbm, a directory where flexible Bayesian modeling information will be kept\n*molecules, a directory where all of your original molecule [[.hin|HIN File]] files and [[.zmt|ZMT File]] files will be kept\n*mopac, a directory where all [[MOPAC]] related files and output will be stored. It will contain two subdirectories:\n**~AM1, a directory where all [[AM1|AM1 Hamiltonian]] related files will be kept\n**~PM3, a directory where all [[PM3|PM3 Hamiltonian]] related files will be kept\n*input.txt and output.txt files (both empty)
__Trends and Plot Methods in MLR Studies__ by Emili Besalú et. al, //[[J. Chem. Inf. Model.|http://pubs.acs.org/journals/jcisd8/index.html]]//, ''2007'', ASAP article.\n\n''Notes Relevant to Group Members''\n-in regard to least squares, plots of calculated vs. observed values and observed vs. calculated values are not identical (though both are centered around the bisector of the first and third quadrant)\n-for a calculated vs. observed plot, the slope of the regression line ≤ 1 and = r^^2^^\n-residual plots can be revealing; can show nonlinearity\n-heteroscedasticity is when residuals diverge for growing values of the property of interest, which means a nonconstancy of the variance of a measure over the levels of the property in question (this is the opposite of homoscedasticity)\n-in general, statistically speaking, it is probably more desirable to have residuals display homoscedasticity\n-for a calculated vs. observed plot for fitted data, linear regression techniques give models that tend to predict weakly active structures too highly (in terms of the dependent variable) and highly active structures too low\n\nReferenced in:\n*
__Basic Overview of Chemoinformatics__ by Thomas Engel, //[[J. Chem. Inf. Model.|http://pubs.acs.org/journals/jcisd8/index.html]]// ''2006'', //46//, 2267-2277.\n\n''Notes Relevant to Group Members''\n-Nice little review article listing the different areas of research that can be associated with chem(o)informatics. \n-Comprehensive list of source materials for deeper reading.\n-Not all sections apply to our group research.\n\nReferenced in:\n*
''__Computer Software Applications in Chemistry (2^^nd^^ Ed)__'' Jurs, P.C. (''1996'') John Wiley & Sons. ''ISBN:0-471-10587-2''\n\nReferenced in:\n*
''__Bayesian Learning for Neural Networks__'' Neal, R.M. (''1996'') ~Springer-Verlag. ''ISBN:0-387-94724-8''\n\nReferenced in:\n*
''__Chemometrics: Statistics and Computer Application in Analytical Chemistry__'' Otto, M. (''1999'') ~Wiley-VCH.\n''ISBN:3-527-29628-X''\n\nReferenced in:\n* [[Chemometrics]] - pg.1
''__The Chemistry Maths Book__'' Steiner, E. (''1996'') Oxford University Press. ''ISBN:0-19-855913-5''\n\nReferenced in:\n*
__A Simple Method for the Representation, Quantification, and Comparison of the Volumes and Shapes of Chemical Compounds__ by Stouch & Jurs, //[[J. Chem. Inf. Comput. Sci.|http://pubs.acs.org/journals/jcisd8/index.html]]//, ''1986'',//26//,4-12.\n\n''Notes Relevant to Group Members''\n\nReferenced in:\n*[[Molecular Volume]]\n*[[Solvent Accessible Surface Area]]
''__Handbook of Molecular Descriptors__'' Todeschini, R. and Consonni, V. (''2000'') ~Wiley-VCH. ''ISBN:3-527-29913-0''\n\nReferenced in:\n*[[Charged Partial Surface Areas]]\n\n\n
''__Function__''\n*Calculates descriptors based on [[Euclidean through-space distances|Euclidean Distance]] between atom pairs in a molecule. \n*Requires three-dimensional geometry optimization.\n''__Descriptors Calculated__''\n||Label|Description|\n| 1 |~TSD1 |the maximum through-space distance of the molecule|\n| 2 |~TSD2 |the average of all through-space distances of a molecule|\n| 3 |~TSD3 |the relative standard deviation of through-space distances of a molecule|\n''__Attributes Stored__''\n*none\n''__Commands Available__''\n|Simple|Compound|Default|Description|\n| -h | {{{--}}}help |n/a|print help screen and exit|\n| -f FILENAME | {{{--}}}file=FILENAME |dbHIN|designate the database file to read/store if not default|\n| -p | {{{--}}}print |off|information printed to the screen|\n| -l | {{{--}}}log |off|information printed to an output file (see -o)|\n| -o OUTFILE| {{{--}}}output=OUTFILE |'output.txt'|file designated for output when -l flag chosen|\n| -d DESCLIST | {{{--}}}desc=DESCLIST |all|choose the descriptors to calculate; enter list in single quotes|\n| -a | {{{--}}}all |on|calculate descriptors for all compounds (not really needed since it's on by default|\n| -n MOLECULES | {{{--}}}numbers=MOLECULES |off|enter a list of specific compounds to try (will not save information in database)|\n| -s | {{{--}}}save |off|when used, saves the descriptors to database file; does not work with '-n'|
This program is run from the command line only ''after'' successful [[MOPAC]] optimization has been completed with the [[PM3|PM3 Hamiltonian]] optimization via [[doOptimizations]]. All files should have been checked with [[chkPM3]].\n\nAs long as you're using the default database file name ('dbHIN.gz'), you can type just the command. Otherwise, you'll need a ''-f FILENAME'' flag to designate your database name.\n\nThis program takes the updated Cartesian coordinates and replaces the MOLECULE.ATOMS values for each key (POSX, POSY, POSZ). These new optimized coordinates will than be the basis for several geometric and hybrid descriptor calculations.
The z-score (aka standard normal variate) for an observation in a series is a way to standardize the values in the set compared to the [[mean|Mean]] and [[standard deviation|Standard Deviation]] of the data set. The z-score of a particular observation (//z~~i~~//) is given by\n[img[z score|img_QSAR/zscore.gif]]\nwhere //x~~i~~// is the individual value; //μ// is the mean (or x-bar for sample mean); //σ// is the standard deviation (or //s// for sample standard deviation). The value of z relates the number of standard deviations from the mean from which the observation sits.