当前位置：网站首页>Two dimensional convolution Chinese microblog emotion classification project

Two dimensional convolution Chinese microblog emotion classification project

2022-07-21 04:55:00 【Don't wait for shy brother to develop】

Two dimensional convolution Chinese microblog emotion classification project

1、 Data set description
2、 Two dimensional convolution Chinese emotion classification practice
3、 Model to predict

1、 Data set description

Here is a Chinese microblog emotion classification project . The data set I use here is collected from Sina Weibo 12 Ten thousand data , Half of the positive and negative samples . In the label 1 Express a positive comment ,0 Express negative comments . The data source is https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/weibo_sen ti_100k/intro.ipynb. If you have other data , Other data can also be used .
The data we use this time needs to be processed by ourselves , So we need to segment sentences , After word segmentation, each Words are numbered according to frequency . The word segmentation tool we want to use here is Stuttering participle , Stuttering participle is a good use Chinese word segmentation tool , The installation mode is to open the command prompt , Then type the command ：

pip install jieba

After installation, in python Directly in the program import jieba You can use it .

2、 Two dimensional convolution Chinese emotion classification practice

#  Install stuttering participle 
# pip install jieba
import jieba 
import pandas as pd 
import numpy as np 
from tensorflow.keras.layers import Dense, Input, Dropout
from tensorflow.keras.layers import Conv2D, GlobalMaxPool2D, Embedding, concatenate
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K
from sklearn.model_selection import train_test_split
import json
# pip install plot_model
from plot_model import plot_model
from matplotlib import pyplot as plt

#  Batch size 
batch_size = 128
#  Training cycle 
epochs = 3
#  Word vector length 
embedding_dims = 128
#  Number of filters 
filters = 32

#  The first half of this data is a positive sample , The second half is a negative sample 
data = pd.read_csv('weibo_senti_100k.csv')
#  Before looking at the data 5 That's ok 
data.head(),data.tail()

#  Calculate the number of positive samples 
poslen = sum(data['label']==1)
#  Calculate the number of negative samples 
neglen = sum(data['label']==0)
print(' Number of positive samples ：', poslen)
print(' Number of negative samples ：', neglen)

#  Test the use of stuttering participles 
print(list(jieba.cut(' Parents must have Liu Yong's mentality , Keep learning , Keep improving ')))

# Define the participle function , The incoming x Carry out word segmentation 
cw = lambda x: list(jieba.cut(x))
# apply Pass in a function , hold cw Function applied to data['review'] Each line 
#  Save the result of word segmentation to data['words'] in 
data['words'] = data['review'].apply(cw)

#  Before viewing the data again 5 That's ok 
data.head()

#  Calculate the maximum number of words in a piece of data 
max_length = max([len(x) for x in data['words']])
#  The result of printing is 202, The number of words in the longest sentence is not too much 
#  Then 202 As a standard , Fill in the length of all sentences to 202 The length of 
#  For example, the longest sentence is 2000, Then some sentences are too long , We can set a smaller value as the standard length of all sentences 
#  Setting up 1000, So over 1000 Only take the front of the sentence 1000 Word , Insufficient 1000 Fill the sentence with 1000 The length of 
print(max_length)

#  hold data['words'] All of the list All become string format 
texts = [' '.join(x) for x in data['words']]
texts

#  View a comment , Now the data is in string format , And words are separated by spaces 
#  This is to meet the format requirements of the following data processing , The next step is to use Tokenizer Process the data 
texts[4]

#  Instantiation Tokenizer, Set the maximum number of words in the dictionary to 30000
# Tokenizer It will automatically filter out some symbols, such as ：!"#$%&()*+,-./:;<=>[email protected][\\]^_`{|}~\t\n
tokenizer = Tokenizer(num_words=30000)
#  Pass in our training data , Build a dictionary , The number of words is set according to the word frequency , The higher the frequency , The smaller the number ,
tokenizer.fit_on_texts(texts) 
#  Convert words into numbers , The number is greater than 30000 Words will be filtered out 
sequences = tokenizer.texts_to_sequences(texts) 
#  Set the sequence to max_length The length of , exceed max_length Partial abandonment of , Less than max_length Then fill 0
# padding='pre' Fill in the front of the sentence ,padding='post' Fill in after the sentence 
X = pad_sequences(sequences, maxlen=max_length, padding='pre')  
# X = np.array(sequences)

#  Get the dictionary 
dict_text = tokenizer.word_index
#  Look up words in the dictionary dui yi
dict_text[' Dream ']

#  hold token_config Save to json In file , The model prediction stage can use 
file = open('token_config.json','w',encoding='utf-8')
#  hold tokenizer become json data 
token_config = tokenizer.to_json()
#  preservation json data 
json.dump(token_config, file)

print(X[4])

#  Define Tags 
# 01 Positive sample ,10 It's a negative sample 
positive_labels = [[0, 1] for _ in range(poslen)]
negative_labels = [[1, 0] for _ in range(neglen)]
#  Merge tags 
Y = np.array(positive_labels + negative_labels)
print(Y)
#  Sharding data sets 
x_train,x_test,y_train,y_test = train_test_split(X, Y, test_size=0.2)

#  Define a functional model 
#  Define model inputs ,shape-(batch, 202)
sequence_input = Input(shape=(max_length,))
# Embedding layer ,30000 Express 30000 Word , The vector corresponding to each word is 128 dimension 
embedding_layer = Embedding(input_dim=30000, output_dim=embedding_dims)
# embedded_sequences Of shape-(batch, 202, 128)
embedded_sequences = embedding_layer(sequence_input)
# embedded_sequences Of shape Turned into (batch, 202, 128, 1)
embedded_sequences = K.expand_dims(embedded_sequences, axis=-1)

#  The convolution kernel size is 3, The number of columns must be equal to the word vector length 
cnn1 = Conv2D(filters=filters, kernel_size=(3,embedding_dims), activation='relu')(embedded_sequences)
cnn1 = GlobalMaxPool2D()(cnn1)

#  The convolution kernel size is 4, The number of columns must be equal to the word vector length 
cnn2 = Conv2D(filters=filters, kernel_size=(4,embedding_dims), activation='relu')(embedded_sequences)
cnn2 = GlobalMaxPool2D()(cnn2)

#  The convolution kernel size is 5, The number of columns must be equal to the word vector length 
cnn3 = Conv2D(filters=filters, kernel_size=(5,embedding_dims), activation='relu')(embedded_sequences)
cnn3 = GlobalMaxPool2D()(cnn3)

#  Merge 
merge = concatenate([cnn1, cnn2, cnn3], axis=-1)
#  Full link layer 
x = Dense(128, activation='relu')(merge)
# Dropout layer 
x = Dropout(0.5)(x)
#  Output layer 
preds = Dense(2, activation='softmax')(x)
#  Defining models 
model = Model(sequence_input, preds)

plot_model(model, dpi=200)

#  Define the cost function , Optimizer 
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])

#  Training models 
history=model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))

#  Save the model 
model.save('cnn_model.h5')

3、 Model to predict

from keras.models import load_model
from keras.preprocessing.text import tokenizer_from_json
from keras.preprocessing.sequence import pad_sequences
import jieba
import numpy as np
import json
#  load tokenizer
json_file=open('token_config.json','r',encoding='utf-8')
token_config=json.load(json_file)
tokenizer=tokenizer_from_json(token_config)
# Load model 
model=load_model('cnn_model.h5')
# Emotional prediction 
def predict(text):
    #  Word segmentation of sentences 
    cw=list(jieba.cut(text))# [' The weather today ', ' really ', ' good ', ',', ' I want to ', ' Go to ', ' Play a ball ']
    # list Turn the string , Between the elements ' ' separate 
    texts=' '.join(cw) #  character string  ' The weather today   really   good  ,  I want to   Go to   Play a ball '
    # print([texts])# [' The weather today   really   good  ,  I want to   Go to   Play a ball ']
    #  Convert words into numbers , The number is greater than 30000 Words will be filtered out 
    sequences=tokenizer.texts_to_sequences([texts]) # [texts]  Is to turn a string into a list 
    # model.input_shape by (None, 202),202 Is the sequence length when training the model 
    #  Set the sequence to 202 The length of , exceed 202 Partial abandonment of , Less than 202 Then fill 0
    sequences=pad_sequences(sequences=sequences
                            ,maxlen=model.input_shape[1] # 202
                            ,padding = 'pre')
    #  Model to predict  shape(1,2)
    predict_result=model.predict(sequences)
    #  Take out predict_result The index corresponding to the maximum value of the element in 
    result=np.argmax(predict_result)
    if(result==1):
        print(' Positive emotions ')
    else:
        print(' Negative emotions ')

if __name__ == '__main__':
    predict(' It's a beautiful day , I'm going to play ')
    predict(" A large room of people , As a result, the water supply was cut off early in the morning , I collapsed until now [ Crazy ]")
    predict(" I'm very happy today ")
    # pass

There are a lot of emoticons in the raw data , Such as [ love you ],[ ha-ha ],[ applause ],[ lovely ] etc. , The corresponding words in these emoticons represent the emotion of this sentence to a large extent . So the reason why we have achieved such high accuracy in this project , It has a lot to do with the emoticons here . If you use other data sets to classify emotions , Should also get good results , But it should be hard to get 98% Such a high accuracy .

原网站

版权声明
本文为[Don't wait for shy brother to develop]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/202/202207200555411841.html