当前位置：网站首页>Word2vec simple summary

Word2vec simple summary

2022-07-22 01:23:00 【aelum】

Catalog

One 、Skip-Gram
Two 、CBOW
3、 ... and 、Gensim Realization

One 、Skip-Gram

Skip-Gram The model assumes that the head word can be used to generate context words .

Every word in the vocabulary consists of two $d$ Dimension vector , say concretely , For the index is $i$ The word , We use... Separately $\boldsymbol{v}_i,\boldsymbol{u}_i$ Represents two vectors used as the head word and the context word .

Set words $w_o$ At the head word $w_c$ In the context window , be

$P(w_o|w_c)=\frac{\exp(\boldsymbol{u}_o'\boldsymbol{v}_c)}{\sum_{i\in \mathcal{V}}\exp(\boldsymbol{u}_i'\boldsymbol{v}_c)}$

among $\mathcal{V}=\{0,1,\cdots,|\mathcal{V}|-1\}$ Is the index set of thesaurus .

Set the size of the context window to $m$ , be Skip-Gram The likelihood function of the model is

$\mathcal{L}=\prod_{t=1}^T \prod_{-m\leq j\leq m,j\neq 0}P(w_{t+j}|w_t)$

Maximize $\mathcal{L}$ It's equivalent to minimizing $-\log\mathcal{L}$ , namely

$-\sum_{t=1}^T\sum_{-m\leq j\leq m,j\neq 0}\log P(w_{t+j}|w_t)$

Two 、CBOW

And Skip-Gram contrary ,CBOW The model assumes that context words can be used to generate head words .

Set the size of the context window to $m$ , Then the number of context words is $2 m$ . When calculating conditional probability, we usually average these context words , namely

$\overline{\boldsymbol{v}_o}=\frac{1}{2m}\sum_{1\leq|i|\leq m}\boldsymbol{v}_{c+i}$

remember $\mathcal{W}_{o(c)}=\{\boldsymbol{v}_{c-m},\cdots,\boldsymbol{v}_{c-1},\boldsymbol{v}_{c+1},\cdots,\boldsymbol{v}_{c+m}\}$ , be

$P(w_c|\mathcal{W}_{o(c)})=\frac{\exp(\boldsymbol{u}_c'\overline{\boldsymbol{v}_o})}{\sum_{i\in\mathcal{V}}\exp(\boldsymbol{u}_i'\overline{\boldsymbol{v}_o})}$

CBOW The likelihood function of the model is

$\mathcal{L}=\prod_{t=1}^T P(w_c|\mathcal{W}_{o(c)})$

3、 ... and 、Gensim Realization

The related documents ：https://radimrehurek.com/gensim/models/word2vec.html

First import the packages you need next ：

from gensim.models import Word2Vec, KeyedVectors
from gensim.test.utils import common_texts
import gensim.downloader as api

We can use gensim Self contained Word2Vec To achieve relevant calculations ：

sentences = [['first', 'sentence'], ['second', 'sentence']]
#  The word vector dimension is 5
#  The number of occurrences is less than 1 Words will be discarded 
#  Use 4 Thread training 
model = Word2Vec(sentences=sentences, vector_size=5, min_count=1, workers=4)
word_vectors = model.wv
#  After the training, you can check the word vector 
print(word_vectors['sentence'])
# [-0.01072454 0.00472863 0.10206699 0.18018547 -0.186059 ]

After training , You can save the model for next use

model.save('word2vec.model')  #  Save the model 
model = Word2Vec.load('word2vec.model')  #  Read the model

Of course, you can also save the trained word vector

word_vectors.save("word2vec.wordvectors")  #  Save the word vector 
word_vectors = KeyedVectors.load("word2vec.wordvectors", mmap='r')  #  Read the word vector ,r Read only

We can also check the most similar words to a certain word

model = Word2Vec(sentences=common_texts, vector_size=5, min_count=1)
sims = model.wv.most_similar('computer', topn=3)
print(sims)
# [('minors', 0.4151746332645416), ('time', 0.18495501577854156), ('interface', 0.05030104145407677)]

Use ready-made pre training models ：

word_vectors = api.load('glove-twitter-25')
print(word_vectors.most_similar('twitter', topn=10))
# [('facebook', 0.948005199432373), ('tweet', 0.9403423070907593), ('fb', 0.9342358708381653), ('instagram', 0.9104824066162109), ('chat', 0.8964965343475342), ('hashtag', 0.8885937333106995), ('tweets', 0.8878158330917358), ('tl', 0.8778461217880249), ('link', 0.877821147441864), ('internet', 0.8753896355628967)]