当前位置：网站首页>Analysis of common activation functions

Analysis of common activation functions

2022-07-20 09:33:00 【Early lunar month in Pingqiu】

sigmoid and ReLU

$\frac{1}{1+e^{-x}}$

$s i g m o i d$ The problem with activating functions is that as the input approaches $\pm\infty$ when , The gradient will quickly become 0, When gradient returns , Shallow parameters cannot be effectively updated .

$R e L U (x) = m a x (0, x)$
$R e L U$ stay x>0 when , Gradient constant 1, There will be no gradient vanishing . stay x<0 when , The gradient of 0, No reverse transmission , Can be similar $d r o p o u t$ Introduce more nonlinearity . After adding the model , The stability and effect of training are better than $s i g m o i d$ .

ReLU and ReLU6

$R e L U 6 (x) = m i n (6, m a x (0, x))$
Limit $R e L U$ The maximum output of does not exceed 6, It can enhance the small model on the end , Robustness in low precision reasoning .

sigmoid and hard sigmoid

$hard\_sigmoid(x) = (ReLU(x) + 3)/6$
Approximatable $s i g m o i d$ function , Less computation .

swish and hard swish

$x\cdot sigmoid(\beta x)$
$hard\_swish(x) = x \cdot (ReLU6(x) + 3)/6$

$s w i s h$ Medium $s i g m o i d$ operation , The calculation amount on the end is too heavy , So use $hard\_sigmoid$ To approximate . $h\_swish$ The activation operation is in $m o b i l e n e t v 3$ Is used in .