当前位置：网站首页>Source code analysis of Bert model call

Source code analysis of Bert model call

2022-07-21 20:19:00 【iAmZard】

https://blog.csdn.net/wshzd/article/details/89639392
turn

Use BERT Precautions for ： You need to download it yourself BERT Source code , Then put the unzipped folder into the following directory

Use BERT Is the purpose of ： To get the final model return vector （ Namely 4） Output of step ）

Use BERT The steps of the model are ：

1） First load and restore the configuration parameters of the model 、 Model parameters ;

2） Call function convert_single_example Data needed to generate the model , In fact, that is input_ids、input_mask、segment_ids;

3） Put the three generated in the second step List Input to modeling.BertModel Corresponding parameter position in input_ids、input_mask、token_type_ids;

4） According to the NLP The method of task selection output is model.get_sequence_output() still model.get_pooled_output(), among model.get_sequence_output() The return data format of is [batch_size, seq_length, embedding_size], use seq2seq perhaps ner; and model.get_pooled_output() Output is

#!/usr/bin/python

-- coding:utf-8 --

import tensorflow as tf
from bert import modeling
from bert import tokenization
import os

One 、 load BERT Model

Here is the download bert The configuration file

bert_config = modeling.BertConfig.from_json_file(“chinese_L-12_H-768_A-12/bert_config.json”)

establish bert The input of

input_ids=tf.placeholder (shape=[64,128],dtype=tf.int32,name=“input_ids”)
input_mask=tf.placeholder (shape=[64,128],dtype=tf.int32,name=“input_mask”)
segment_ids=tf.placeholder (shape=[64,128],dtype=tf.int32,name=“segment_ids”)

establish bert Model

model = modeling.BertModel(
config=bert_config,
is_training=True,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=False # If you use TPU Set to True, It's going to be faster . Use CPU or GPU Set to False , It's going to be faster .
)

#bert Where the model parameters are initialized
init_checkpoint = “chinese_L-12_H-768_A-12/bert_model.ckpt”
use_tpu = False

Get all the training parameters in the model .

tvars = tf.trainable_variables()

load BERT Model

(assignment_map, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars,
init_checkpoint)

tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

tf.logging.info("**** Trainable Variables ****")

Print the parameters of the loaded model

for var in tvars:
init_string = “”
if var.name in initialized_variable_names:
init_string = “, INIT_FROM_CKPT”
tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
init_string)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())

Two 、 obtain BERT Model output

output_layer = model.get_sequence_output()# This gets each token Of output Output [batch_size, seq_length, embedding_size] If you do seq2seq perhaps ner Use this

output_layer = model.get_pooled_output() # This gets the sentence output

3、 ... and 、 obtain BERT Model input

def convert_single_example( max_seq_length,
tokenizer,text_a,text_b=None):
tokens_a = tokenizer.tokenize(text_a)
tokens_b = None
if text_b:
tokens_b = tokenizer.tokenize(text_b)# Here is mainly to divide Chinese characters
if tokens_b:
# If there is a second sentence , Then the total length of the two sentences should be less than max_seq_length - 3
# Because you have to fill in the sentence [CLS], [SEP], [SEP]
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# If there's only one sentence , Just add [CLS], [SEP] So the sentence length should be less than max_seq_length - 3
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]

convert to bert The input of , Note the following type_ids In the source code, the corresponding is segment_ids

(a) Two sentences :

tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]

type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1

(b) A single sentence :

tokens: [CLS] the dog is hairy . [SEP]

type_ids: 0 0 0 0 0 0 0

here “type_ids” It is mainly used to distinguish the first sentence and the second sentence .

The first sentence is 0, The second sentence is 1. It will be added to the vector of words during pre training , But this is not necessary

because [SEP] The first sentence and the second sentence have been distinguished . but type_ids It will make learning easier

tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)# Convert Chinese into ids

establish mask

input_mask = [1] * len(input_ids)

Supplement the input 0

while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
return input_ids,input_mask,segment_ids # The corresponding is to create bert Model time input_ids,input_mask,segment_ids Parameters

Now start calling word segmentation and generating input data

vocab_file=“chinese_L-12_H-768_A-12/vocab.txt”
token = tokenization.FullTokenizer(vocab_file=vocab_file)
input_ids,input_mask,segment_ids = convert_single_example(100,token,“ You can buy it directly from the editorial department , Address , Beijing Gymnasium road 8 Number , In the Chinese sports newspaper general social academy ”)
print(input_ids)
print(input_mask)
print(segment_ids)

原网站

版权声明
本文为[iAmZard]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/202/202207210502349201.html