当前位置:网站首页>Word2vec simple summary
Word2vec simple summary
2022-07-22 01:23:00 【aelum】
One 、Skip-Gram
Skip-Gram The model assumes that the head word can be used to generate context words .
Every word in the vocabulary consists of two d d d Dimension vector , say concretely , For the index is i i i The word , We use... Separately v i , u i \boldsymbol{v}_i,\boldsymbol{u}_i vi,ui Represents two vectors used as the head word and the context word .
Set words w o w_o wo At the head word w c w_c wc In the context window , be
P ( w o ∣ w c ) = exp ( u o ′ v c ) ∑ i ∈ V exp ( u i ′ v c ) P(w_o|w_c)=\frac{\exp(\boldsymbol{u}_o'\boldsymbol{v}_c)}{\sum_{i\in \mathcal{V}}\exp(\boldsymbol{u}_i'\boldsymbol{v}_c)} P(wo∣wc)=∑i∈Vexp(ui′vc)exp(uo′vc)
among V = { 0 , 1 , ⋯ , ∣ V ∣ − 1 } \mathcal{V}=\{0,1,\cdots,|\mathcal{V}|-1\} V={ 0,1,⋯,∣V∣−1} Is the index set of thesaurus .
Set the size of the context window to m m m, be Skip-Gram The likelihood function of the model is
L = ∏ t = 1 T ∏ − m ≤ j ≤ m , j ≠ 0 P ( w t + j ∣ w t ) \mathcal{L}=\prod_{t=1}^T \prod_{-m\leq j\leq m,j\neq 0}P(w_{t+j}|w_t) L=t=1∏T−m≤j≤m,j=0∏P(wt+j∣wt)
Maximize L \mathcal{L} L It's equivalent to minimizing − log L -\log\mathcal{L} −logL, namely
− ∑ t = 1 T ∑ − m ≤ j ≤ m , j ≠ 0 log P ( w t + j ∣ w t ) -\sum_{t=1}^T\sum_{-m\leq j\leq m,j\neq 0}\log P(w_{t+j}|w_t) −t=1∑T−m≤j≤m,j=0∑logP(wt+j∣wt)
Two 、CBOW
And Skip-Gram contrary ,CBOW The model assumes that context words can be used to generate head words .
Set the size of the context window to m m m, Then the number of context words is 2 m 2m 2m. When calculating conditional probability, we usually average these context words , namely
v o ‾ = 1 2 m ∑ 1 ≤ ∣ i ∣ ≤ m v c + i \overline{\boldsymbol{v}_o}=\frac{1}{2m}\sum_{1\leq|i|\leq m}\boldsymbol{v}_{c+i} vo=2m11≤∣i∣≤m∑vc+i
remember W o ( c ) = { v c − m , ⋯ , v c − 1 , v c + 1 , ⋯ , v c + m } \mathcal{W}_{o(c)}=\{\boldsymbol{v}_{c-m},\cdots,\boldsymbol{v}_{c-1},\boldsymbol{v}_{c+1},\cdots,\boldsymbol{v}_{c+m}\} Wo(c)={ vc−m,⋯,vc−1,vc+1,⋯,vc+m}, be
P ( w c ∣ W o ( c ) ) = exp ( u c ′ v o ‾ ) ∑ i ∈ V exp ( u i ′ v o ‾ ) P(w_c|\mathcal{W}_{o(c)})=\frac{\exp(\boldsymbol{u}_c'\overline{\boldsymbol{v}_o})}{\sum_{i\in\mathcal{V}}\exp(\boldsymbol{u}_i'\overline{\boldsymbol{v}_o})} P(wc∣Wo(c))=∑i∈Vexp(ui′vo)exp(uc′vo)
CBOW The likelihood function of the model is
L = ∏ t = 1 T P ( w c ∣ W o ( c ) ) \mathcal{L}=\prod_{t=1}^T P(w_c|\mathcal{W}_{o(c)}) L=t=1∏TP(wc∣Wo(c))
3、 ... and 、Gensim Realization
The related documents :https://radimrehurek.com/gensim/models/word2vec.html
First import the packages you need next :
from gensim.models import Word2Vec, KeyedVectors
from gensim.test.utils import common_texts
import gensim.downloader as api
We can use gensim
Self contained Word2Vec
To achieve relevant calculations :
sentences = [['first', 'sentence'], ['second', 'sentence']]
# The word vector dimension is 5
# The number of occurrences is less than 1 Words will be discarded
# Use 4 Thread training
model = Word2Vec(sentences=sentences, vector_size=5, min_count=1, workers=4)
word_vectors = model.wv
# After the training, you can check the word vector
print(word_vectors['sentence'])
# [-0.01072454 0.00472863 0.10206699 0.18018547 -0.186059 ]
After training , You can save the model for next use
model.save('word2vec.model') # Save the model
model = Word2Vec.load('word2vec.model') # Read the model
Of course, you can also save the trained word vector
word_vectors.save("word2vec.wordvectors") # Save the word vector
word_vectors = KeyedVectors.load("word2vec.wordvectors", mmap='r') # Read the word vector ,r Read only
We can also check the most similar words to a certain word
model = Word2Vec(sentences=common_texts, vector_size=5, min_count=1)
sims = model.wv.most_similar('computer', topn=3)
print(sims)
# [('minors', 0.4151746332645416), ('time', 0.18495501577854156), ('interface', 0.05030104145407677)]
Use ready-made pre training models :
word_vectors = api.load('glove-twitter-25')
print(word_vectors.most_similar('twitter', topn=10))
# [('facebook', 0.948005199432373), ('tweet', 0.9403423070907593), ('fb', 0.9342358708381653), ('instagram', 0.9104824066162109), ('chat', 0.8964965343475342), ('hashtag', 0.8885937333106995), ('tweets', 0.8878158330917358), ('tl', 0.8778461217880249), ('link', 0.877821147441864), ('internet', 0.8753896355628967)]
边栏推荐
- 奇葩!面试题目竟设置如厕习惯、吃饭时长、入睡时间
- Vs2022 shortcut key settings
- 在go语言中获取当前时间,以及md5、hmac、sha1算法的简单实现
- oracle 19c datagruad复制备库RMAN-05535 ORA-01275
- 46: Chapter 4: develop file services: 7: integrate Alibaba OSS in [files] file services;
- LeetCode 10. 正则表达式匹配
- KY novel collection rules (5 collection rules)
- Strategies for implementing steam school-based curriculum system
- 删除远程分支和本地分支
- Leetcode skimming: related topics of binary tree sequence traversal
猜你喜欢
44:第四章:开发文件服务:5:在【files】文件服务中,整合FastDFS,实现【上传头像】的逻辑;(包括:项目中整合FastDFS;跨域问题;创建资源文件,并在代码中获取属性;异常处理;)
Analyze the Enlightenment of children's programming thinking
[terminal _1]-xshell 5 the hottest terminal software!
Es installation & IK Chinese parser
可视化:这十个数据可视化工具软件平台你必须知道
FreeRTOS -- a method to detect the usage of task stack
Niu Ke brushes question 01 - Kiki de duplication of integers and sorting (C language)
Interpreting the maker education model in line with the mainstream of the new era
Appium元素定位——App自动化测试
Markdown 转 PDF API 数据接口
随机推荐
malloc 和 空間配置器
【安全狗漏洞通告】Apache Spark 命令注入漏洞方案
Niu Ke brushes question 01 - Kiki de duplication of integers and sorting (C language)
【微信小程序】解决代码上传超过大小限制,小程序分包
[data analysis 02]
部署服务器
一种新的UI测试方法:视觉感知测试
Appium自动化框架升级到最新的2.0
Websocket transfer file
uniapp,微信小程序input正则校验只能输入为数字和小数点位数限制
LeetCode 1911. 最大子序列交替和
45:第四章:开发文件服务:6:第三方云存储解决方案【阿里云OSS】;(购买OSS服务;开通服务;创建一个Bucket;)
Difference between two lists
Leetcode [sword finger offer II 068. find the insertion position
【学习笔记】AGC008
网络 IO 模型的演化过程
47:第四章:开发文件服务:8:图片自动审核(阿里云内容安全);(还没写,别看;待写……)
删除远程分支和本地分支
lua环境配置
罗丹明B标记肽核酸PNA|罗丹明B-PNA|生物素修饰肽核酸pna|生物素修饰pna肽核酸|规格信息