当前位置：网站首页>2022 | Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization

2022 | Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization

2022-07-21 07:38:00 【Stunned flounder (】

2022 | Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization

Paper: https://arxiv.org/abs/2206.12411
Code: https://github.com/wenhao-gao/mol_opt

PMO： Compare 25 Sample efficiency of molecular optimization methods

Molecular optimization is a basic goal of Chemical Science , It is the core interest of drug and material design . In recent years , Significant progress has been made in solving challenging problems in all aspects of computational molecular optimization , High effectiveness is emphasized 、 diversity , And recent composability . Despite these advances , Many papers report the results of trivial or self-designed tasks , This poses additional challenges to directly evaluate the performance of the new method . Besides , Optimization efficiency of samples （Oracle Number of molecules evaluated ） Rarely discussed , Although this is a necessary consideration for practical discovery applications .

To fill this gap , The author proposes an open source baseline for molecular optimization （PMO）, To facilitate the replication evaluation of molecular optimization algorithms . And in-depth study 25 A molecular design algorithm stay 23 Species task Performance on . Experiments show that , In the Limited oracle Under the circumstances , Most of the most advanced methods cannot surpass previous algorithms , And the existing algorithms can not effectively solve some molecular optimization problems . therefore , From the choice of Optimization Algorithm 、 Molecular assembly strategies and oracle To share the impact on optimization performance , Provide information for future algorithm development and baseline testing . All the codes can be found here ：https://github.com/wenhao-gao/mol_opt

Introduce

Although exciting progress has been made in this field and new methods have been proposed , But how these algorithms compare with each other is still unclear . There are at least three problems ：

Lack of right oracle consider : Many papers are not reported oracle How many times has it been called to achieve the reported results .
Some papers only report trivial oracles result , Such as quantitative estimation of drug similarity (QED) Or punish octanol - Water partition coefficient (LogP); Other papers even introduced new self-designed tasks , This confused the comparison with previous work .
Randomness : Many algorithms are not deterministic , And show significant operational differences , Therefore, it is necessary to report the results of several independent tests .

therefore , The author proposes a new repeatable large-scale experimental model （PMO）. Yes 23 Kind of oracle Conduct 25 Method benchmark , Each method is tuned , And conducted many independent tests . Consider the combination of optimization ability and sample efficiency , Limit orcale Call the number （10000）, And use AUC To measure model performance .PMO Benchmarks will make molecular optimization easier to obtain and reproducible , So as to promote the progress of the algorithm , Finally, molecular optimization technology will be more widely used in the workflow of experimental drugs and material discovery .

Algorithm

Molecular optimization methods have two parts ：

Molecular assembly strategy ： Molecular assembly implicitly defines the chemical space explored by the algorithm , The optimization algorithm determines how to explore in this space . The assembly of molecules is similar to the expression of molecules in machine learning , There are mainly the following ：（1）SMILES character string ;（2）SELFIES character string ;（3） Based on atoms （atom） The molecular diagram of （molecular graph）;（4） Based on groups （fragment） The molecular diagram of （ May include atoms ）;（5） Synthesis diagram based on chemical reaction .
An optimization method ： It mainly includes the following 9 class , genetic algorithm（GA, Genetic algorithm (ga) ）, Monte Carlo Tree Search（MCTS, Monte carlo tree search ）, Bayesian optimization （BO, Bayesian optimization ）, variational autoencoder（VAE, Variational self encoder ）, generative adversarial network（GAN, Generative antagonistic network ）,score-based model （SBM, Probability model based on score ）, hill climbing （HC, Climbing algorithm ）, reinforcement learning（RL, To enhance learning ）, gradient ascent（GRAD, The gradient rises ）.

The following table summarizes the current molecular design methods based on the classification of assembly strategies and optimization methods , Including but not limited to the methods included in the baseline .

The experimental conclusion

Oracle： In order to test the generality of the method , Our goal is to cover a wide range of pharmaceutical related Oracle function .
Data sets ： Use ZINC 250K Data sets , The dataset contains data from ZINC About... Sampled in the database 250K molecular
Evaluation indicators ： In order to consider both optimization capability and sample efficiency , We report top-K The area under the curve of the average attribute value (AUC) And oracle Call the number (AUC top-K) Comparison of , As the main index to measure performance .

The efficiency of the sample
All current methods , Except in some very simple oracle outside , Cannot be called hundreds of times oracle To complete the optimization . This shows that all current molecular optimization algorithms are difficult to directly take experimental measurements on small molecules oracle Optimize molecules . By comparison top-X and AUC Top-X Ranking , You can see some in Top-X Which method of performance is acceptable in AUC Top-X I didn't do well in the game , Explain the previously considered strong Algorithm , The efficiency of using samples is not strong , Proved AUC The significance of this indicator .

Traditional methods are still powerful
The top two comprehensive ranking methods are REINVENT（ be based on SMILES Reinforcement learning methods ）,Graph GA（ Genetic algorithm based on molecular graph ）. Neither method has been published in AI Summit meeting , In contrast, many recent years AI The paper of the top meeting is not well realized .

The following table is based on the average AUC Top-10 Of 10 The performance of the best molecular optimization method . The experiment report 5 An independent test AUC Top-10 Mean and standard deviation . The best model in each task is marked in bold .

smile There are no obvious shortcomings
It is found in the comparative test ,SMILES In most cases, it is better than SELFIES. The author believes that the main reason is

The current language model can be learned SMILES The grammar of , Ensure that the generated molecules have enough effective molecules , So it's smoothed SELFIES The advantages of .
Through the case study, it is found that , In fact, SELFIES There are also grammatical errors in , It's just SMILES If there is a syntax error in, an error will be reported , And in the SELFIES Cover up your mistakes , Although the resulting molecules are effective in bond valence , But it did not bring effective chemical space exploration .

A model-based approach may be more effective , But it needs careful design
model-based The optimization method is to train a predictive model As truth oracle substitute , To optimize , To achieve the purpose of more efficient optimization . On the other hand ,model-based The optimization method of needs more careful design , Simply add one GNN Training does not necessarily have the desired improvement .

Different types of methods are more suitable for different tasks
The authors found that, for example, in some isomer based tasks , Genetic algorithm method based on string （ such as SMILES-GA,STONED be based on SELFIES） The effect is better. . In the task based on similarity , The effect of these methods is not ideal . This shows that different optimization methods have different adaptability in different types of tasks , Depending on oracle Of landscape.

When reporting results, you need to re optimize the parameters and run multiple times
By adjusting the super parameters, we get a comparison with REINVENT Original paper Better results in , It shows that when comparing different models , The importance of parameter adjustment .

Reference resources

https://mp.weixin.qq.com/s/mgO16tuLVaovLyJA64agrw

原网站

版权声明
本文为[Stunned flounder (]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/202/202207201141117708.html

当前位置：网站首页>2022 | Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization

2022 | Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization

2022 | Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization

PMO： Compare 25 Sample efficiency of molecular optimization methods

Introduce

Algorithm

The experimental conclusion

Reference resources

边栏推荐

猜你喜欢

随机推荐