# Getting started Thanks for checking out Salty! The purpose of Salty is to extend some of the machine learning ecosystem to ionic liquid (IL) data from [ILThermo](_static/http://ilthermo.boulder.nist.gov/). ## Obtaining Smiles Strings Salty operates using the simplified molecular-input line-entry system ([SMILES](_static/https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system)). One of the core methods of Salty is the `check_name()` function that converts [IUPAC](_static/https://iupac.org/) naming to Smiles: ```python import salty smiles = salty.check_name("1-butyl-3-methylimidazolium") print(smiles) ``` CCCCn1cc[n+](_static/c1)C Once we have a smiles representation of a molecule, we can convert it into a molecular object with RDKit: ```python %matplotlib inline from rdkit import Chem from rdkit.Chem import Draw fig = Draw.MolToMPL(Chem.MolFromSmiles(smiles),figsize=(5,5)) ``` ![png](_static/output_3_0.png) We can generate many kinds of bitvector representations or *fingerprints* of the molecular structure. Fingerprints can be used as descriptors in machine learning models, uncertainty estimators in structure search algorithms, or to compare two molecular structures: ```python ms = [Chem.MolFromSmiles("OC(=O)C(N)Cc1ccc(O)cc1"), Chem.MolFromSmiles(smiles)] fig=Draw.MolsToGridImage(ms[:],molsPerRow=2,subImgSize=(400,200)) fig.save('compare.png') from rdkit.Chem.AtomPairs import Pairs from rdkit.Chem import AllChem from rdkit.Chem.Fingerprints import FingerprintMols from rdkit import DataStructs radius = 2 fpatom = [Pairs.GetAtomPairFingerprintAsBitVect(x) for x in ms] print("atom pair score: {:8.4f}".format(DataStructs.TanimotoSimilarity(fpatom[0], fpatom[1]))) fpmorg = [AllChem.GetMorganFingerprint(ms[0],radius,useFeatures=True), AllChem.GetMorganFingerprint(ms[1],radius,useFeatures=True)] fptopo = [FingerprintMols.FingerprintMol(x) for x in ms] print("morgan score: {:11.4f}".format(DataStructs.TanimotoSimilarity(fpmorg[0], fpmorg[1]))) print("topological score: {:3.4f}".format(DataStructs.TanimotoSimilarity(fptopo[0], fptopo[1]))) ``` atom pair score: 0.0513 morgan score: 0.0862 topological score: 0.3991 ![](_static/compare.png) `check_name` is based on a curated data file of known cations and anions ```python print("Cations in database: {}".format(len(salty.load_data("cationInfo.csv")))) print("Anions in database: {}".format(len(salty.load_data("anionInfo.csv")))) ``` Cations in database: 316 Anions in database: 125 ## A Few Useful Datafiles Salty contains some csv datafiles taken directly from ILThermo: heat capacity (constant pressure), density, and viscosity data for pure ILs. The `aggregate_data` function can be used to quickly manipulate these datafiles and append 2D features. ```python rawdata = salty.load_data("cpt.csv") rawdata.columns ``` Index(['Heat capacity at constant pressure per unit volume, J/K/m3', 'Heat capacity at constant pressure, J/K/mol', 'Heat capacity at constant pressure*, J/K/mol', 'Pressure, kPa', 'Temperature, K', 'salt_name'], dtype='object') ```python devmodel = salty.aggregate_data(['cpt', 'density']) # other option is viscosity ``` `aggregate_data` returns a devmodel object that contains a pandas dataframe of the raw data and a data summary: ```python devmodel.Data_summary ```
0
Unique salts 109
Cations array(['CCCC[n+]1ccc(cc1)C', 'CCCCCCCCn1cc[n+]...
Anions array(['[B-](_static/F)(F)(F)F', 'F[P-](_static/F)(F)(F)(F)F',...
Total datapoints 7834
density 847.5 - 1557.1
cpt 207.47 - 1764.0
Temperature range (K) 100.0 - 60000.0
Pressure range (kPa) 273.15 - 463.15
and has the builtin 2D features from [rdkit](_static/http://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors) all scaled and centered: ```python devmodel.Data.columns ``` Index(['steiger-anion', 'Marsili Partial Charges-anion', 'BalabanJ-anion', 'BertzCT-anion', 'Ipc-anion', 'HallKierAlpha-anion', 'Kappa1-anion', 'Kappa2-anion', 'Kappa3-anion', 'Chi0-anion', ... 'VSA_EState10-cation', 'Topliss fragments-cation', 'Temperature, K', 'Pressure, kPa', 'Heat capacity at constant pressure, J/K/mol', 'Specific density, kg/m3', 'name-anion', 'smiles-anion', 'name-cation', 'smiles-cation'], dtype='object', length=196) The purpose of the data summary is to provide historical information when ML models are ported over into [GAINS](_static/https://wesleybeckner.github.io/gains/). Once we have a devmodel the underlying data can be interrogated: ```python import matplotlib.pyplot as plt import numpy as np df = devmodel.Data with plt.style.context('seaborn-whitegrid'): fig=plt.figure(figsize=(5,5), dpi=300) ax=fig.add_subplot(111) scat = ax.scatter(np.exp(df["Heat capacity at constant pressure, J/K/mol"]), np.exp( df["Specific density, kg/m3"]), marker="*", c=df["Temperature, K"]/max(df["Temperature, K"]), cmap="Purples") plt.colorbar(scat) ax.grid() ax.set_ylabel("Density $(kg/m^3)$") ax.set_xlabel("Heat Capacity $(J/K/mol)$") ``` ![png](_static/output_17_0.png) ## Build NN Models with Scikit-Learn Salty's `devmodel_to_array` function automatically detects the number of targets in the devmodel and creates train/test arrays accordingly: ```python from sklearn.model_selection import cross_val_score from sklearn.neural_network import MLPRegressor from sklearn.multioutput import MultiOutputRegressor model = MLPRegressor(activation='logistic', alpha=0.92078, batch_size='auto', beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08, hidden_layer_sizes=75, learning_rate='constant', learning_rate_init=0.001, max_iter=1e8, momentum=0.9, nesterovs_momentum=True, power_t=0.5, random_state=None, shuffle=True, solver='lbfgs', tol=1e-08, validation_fraction=0.1, verbose=False, warm_start=False) multi_model = MultiOutputRegressor(model) X_train, Y_train, X_test, Y_test = salty.devmodel_to_array(devmodel, train_fraction=0.8) multi_model.fit(X_train, Y_train) ``` MultiOutputRegressor(estimator=MLPRegressor(activation='logistic', alpha=0.92078, batch_size='auto', beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08, hidden_layer_sizes=75, learning_rate='constant', learning_rate_init=0.001, max_iter=100000000.0, momentum=0.9, nesterovs_momentum=True, power_t=0.5, random_state=None, shuffle=True, solver='lbfgs', tol=1e-08, validation_fraction=0.1, verbose=False, warm_start=False), n_jobs=1) We can then see how the model is performing with matplotlib: ```python X=X_train Y=Y_train with plt.style.context('seaborn-whitegrid'): fig=plt.figure(figsize=(5,2.5), dpi=300) ax=fig.add_subplot(122) ax.plot([0,2000], [0,2000], linestyle="-", label=None, c="black", linewidth=1) ax.plot(np.exp(Y)[:,0],np.exp(multi_model.predict(X))[:,0],\ marker="*",linestyle="",alpha=0.4) ax.set_ylabel("Predicted $C_{pt}$ $(K/J/mol)$") ax.set_xlabel("Actual $C_{pt}$ $(K/J/mol)$") ax.text(0.1,.9,"R: {0:5.3f}".format(multi_model.score(X,Y)), transform = ax.transAxes) plt.xlim(200,1700) plt.ylim(200,1700) ax.grid() ax=fig.add_subplot(121) ax.plot([0,2000], [0,2000], linestyle="-", label=None, c="black", linewidth=1) ax.plot(np.exp(Y)[:,1],np.exp(multi_model.predict(X))[:,1],\ marker="*",linestyle="",alpha=0.4) ax.set_ylabel("Predicted Density $(kg/m^3)$") ax.set_xlabel("Actual Density $(kg/m^3)$") plt.xlim(850,1600) plt.ylim(850,1600) ax.grid() plt.tight_layout() ``` ![png](_static/output_21_0.png) ## Build Models with Keras The above sklearn model has a very simple architecture (1 hidden layer with 75 nodes) we can recreate this in Keras: ```python from keras.layers import Dense, Dropout, Input from keras.models import Model, Sequential from keras.optimizers import Nadam X_train, Y_train, X_test, Y_test = salty.devmodel_to_array\ (devmodel, train_fraction=0.8) model = Sequential() model.add(Dense(75, activation='relu', input_dim=X_train.shape[1])) model.add(Dropout(0.5)) model.add(Dense(2, activation='relu')) model.compile(optimizer="adam", loss="mean_squared_error", metrics=['mse']) model.fit(X_train,Y_train,epochs=1000,verbose=False) ``` ```python X=X_train Y=Y_train with plt.style.context('seaborn-whitegrid'): fig=plt.figure(figsize=(5,2.5), dpi=300) ax=fig.add_subplot(122) ax.plot([0,2000], [0,2000], linestyle="-", label=None, c="black", linewidth=1) ax.plot(np.exp(Y)[:,0],np.exp(model.predict(X))[:,0],\ marker="*",linestyle="",alpha=0.4) ax.set_ylabel("Predicted $C_{pt}$ $(K/J/mol)$") ax.set_xlabel("Actual $C_{pt}$ $(K/J/mol)$") #ax.text(0.1,.9,"R: {0:5.3f}".format(multi_model.score(X,Y)), transform = ax.transAxes) plt.xlim(200,1700) plt.ylim(200,1700) ax.grid() ax=fig.add_subplot(121) ax.plot([0,2000], [0,2000], linestyle="-", label=None, c="black", linewidth=1) ax.plot(np.exp(Y)[:,1],np.exp(model.predict(X))[:,1],\ marker="*",linestyle="",alpha=0.4) ax.set_ylabel("Predicted Density $(kg/m^3)$") ax.set_xlabel("Actual Density $(kg/m^3)$") plt.xlim(850,1600) plt.ylim(850,1600) ax.grid() plt.tight_layout() ``` ![png](_static/output_24_0.png) These are all the basic Salty functions for now!