当前位置:网站首页>Source code analysis of Bert model call
Source code analysis of Bert model call
2022-07-21 20:19:00 【iAmZard】
https://blog.csdn.net/wshzd/article/details/89639392
turn
Use BERT Precautions for : You need to download it yourself BERT Source code , Then put the unzipped folder into the following directory
Use BERT Is the purpose of : To get the final model return vector ( Namely 4) Output of step )
Use BERT The steps of the model are :
1) First load and restore the configuration parameters of the model 、 Model parameters ;
2) Call function convert_single_example Data needed to generate the model , In fact, that is input_ids、input_mask、segment_ids;
3) Put the three generated in the second step List Input to modeling.BertModel Corresponding parameter position in input_ids、input_mask、token_type_ids;
4) According to the NLP The method of task selection output is model.get_sequence_output() still model.get_pooled_output(), among model.get_sequence_output() The return data format of is [batch_size, seq_length, embedding_size], use seq2seq perhaps ner; and model.get_pooled_output() Output is
#!/usr/bin/python
-- coding:utf-8 --
import tensorflow as tf
from bert import modeling
from bert import tokenization
import os
One 、 load BERT Model
Here is the download bert The configuration file
bert_config = modeling.BertConfig.from_json_file(“chinese_L-12_H-768_A-12/bert_config.json”)
establish bert The input of
input_ids=tf.placeholder (shape=[64,128],dtype=tf.int32,name=“input_ids”)
input_mask=tf.placeholder (shape=[64,128],dtype=tf.int32,name=“input_mask”)
segment_ids=tf.placeholder (shape=[64,128],dtype=tf.int32,name=“segment_ids”)
establish bert Model
model = modeling.BertModel(
config=bert_config,
is_training=True,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=False # If you use TPU Set to True, It's going to be faster . Use CPU or GPU Set to False , It's going to be faster .
)
#bert Where the model parameters are initialized
init_checkpoint = “chinese_L-12_H-768_A-12/bert_model.ckpt”
use_tpu = False
Get all the training parameters in the model .
tvars = tf.trainable_variables()
load BERT Model
(assignment_map, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars,
init_checkpoint)
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
tf.logging.info("**** Trainable Variables ****")
Print the parameters of the loaded model
for var in tvars:
init_string = “”
if var.name in initialized_variable_names:
init_string = “, INIT_FROM_CKPT”
tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
init_string)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
Two 、 obtain BERT Model output
output_layer = model.get_sequence_output()# This gets each token Of output Output [batch_size, seq_length, embedding_size] If you do seq2seq perhaps ner Use this
output_layer = model.get_pooled_output() # This gets the sentence output
3、 ... and 、 obtain BERT Model input
def convert_single_example( max_seq_length,
tokenizer,text_a,text_b=None):
tokens_a = tokenizer.tokenize(text_a)
tokens_b = None
if text_b:
tokens_b = tokenizer.tokenize(text_b)# Here is mainly to divide Chinese characters
if tokens_b:
# If there is a second sentence , Then the total length of the two sentences should be less than max_seq_length - 3
# Because you have to fill in the sentence [CLS], [SEP], [SEP]
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# If there's only one sentence , Just add [CLS], [SEP] So the sentence length should be less than max_seq_length - 3
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
convert to bert The input of , Note the following type_ids In the source code, the corresponding is segment_ids
(a) Two sentences :
tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
(b) A single sentence :
tokens: [CLS] the dog is hairy . [SEP]
type_ids: 0 0 0 0 0 0 0
here “type_ids” It is mainly used to distinguish the first sentence and the second sentence .
The first sentence is 0, The second sentence is 1. It will be added to the vector of words during pre training , But this is not necessary
because [SEP] The first sentence and the second sentence have been distinguished . but type_ids It will make learning easier
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)# Convert Chinese into ids
establish mask
input_mask = [1] * len(input_ids)
Supplement the input 0
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
return input_ids,input_mask,segment_ids # The corresponding is to create bert Model time input_ids,input_mask,segment_ids Parameters
Now start calling word segmentation and generating input data
vocab_file=“chinese_L-12_H-768_A-12/vocab.txt”
token = tokenization.FullTokenizer(vocab_file=vocab_file)
input_ids,input_mask,segment_ids = convert_single_example(100,token,“ You can buy it directly from the editorial department , Address , Beijing Gymnasium road 8 Number , In the Chinese sports newspaper general social academy ”)
print(input_ids)
print(input_mask)
print(segment_ids)
边栏推荐
- Initializing libiomp5.dylib, but found libomp.dylib already initialized
- 分布式事务其中的那些坑
- Data consistency of Nacos registry cluster
- BERT模型调用源码分析
- In fastjason data type, there is a problem of $ref: "$.list[0]" when parsing jsonobject
- 微信公众号开发接入,利用微信公众平台申请测试号进行本地开发
- 整合ssm框架的項目
- 对象拷贝工具类(fastjson)
- [untitled]
- 【ol-cesium】OpenLayers与Cesium的二三维联动
猜你喜欢
NIO三大核心详解
PyTorch基础模块和实践
学习IO由浅入深
One of the problems to solve the sorting order error of Oracle database query single table
【PCB】基於STM32F103RCT6搖杆-藍牙模塊開發板-畫板筆記整理
Jd.com's popular architect growth manual is launched, and you deserve the architect aura
1451 - cannot delete or update a parent row has multiple foreign key constraints to delete the data rows of the sub table
Nacos-注册中心原理解析
Projet d'intégration du cadre SSM
【PCB】Altium Designer 中(Via过孔)(文本string)(走线Linear Dimension)等项目修改默认设置
随机推荐
Chart is code: build a new generation of graphics library in a coded way -- feekin
Nacos-配置中心原理解析
Classloader and parental delegation mechanism
PyTorch基础知识
yum安装gcc报错
转换String三种方式比较:(String)、toString()、String.valueOf()
1.从零开始学习paddlepaddle之环境安装与基础计算
mask rcnn 加载权重报错
One of the project optimization: the installation and use of redis cache database in the project, and strengthen the project reading operation
Character function and memory function
整合ssm框架的项目
MySql update语句或者delete语句 where条件没走索引会锁表
LeetCode 76. Minimum covering substring sliding window method
解决 zsh:command not found
Linux系统Redis安装
数据库宿舍管理系统
【3D建模】Solidworks 3D建模及PrusaSlicer切片打印学习笔记
Kingbase conversion time
kingbase转换时间
合泰32-Onenet-WiFi模块-合泰单片机通过MQTT协议数据上云(一)