利用IMDB数据构建和训练神经网络

利用keras自带的IMDB数据,构建一个简单的神经网络模型。IMDB为互联网电影数据库的简称,它包含了50000条观众的影评数据,其中25000条训练集,25000条测试集。

In [1]:
import numpy as np
In [2]:
from keras.datasets import imdb
(train_data,train_lables),(test_data,test_lables) = imdb.load_data(num_words=10000)
Using TensorFlow backend.
In [3]:
np.shape(train_data)
Out[3]:
(25000,)
In [4]:
train_lables[1]
Out[4]:
0
In [5]:
word_index = imdb.get_word_index()
reverse_word_index = dict([(value,key) for (key,value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i - 3,'?') for i in train_data[0]])
In [6]:
decoded_review
Out[6]:
"? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

1. 数据准备

将整数序列转化为二进制矩阵,采用one-hot编码,即构造一个10000维的向量,如果某文字出现,则相应位置为1,否则为0.

In [7]:
import numpy as np
def vectorize_sequence(sequences, dimension = 10000):
    results = np.zeros((len(sequences),dimension))
    for i,sequence in enumerate(sequences):
        results[i,sequence] = 1.
    return(results)
x_train = vectorize_sequence(train_data)
x_test = vectorize_sequence(test_data)
In [8]:
x_train[0]
Out[8]:
array([0., 1., 1., ..., 0., 0., 0.])
In [9]:
y_train = np.asarray(train_lables).astype('float32')
y_test = np.asarray(test_lables).astype('float32')
In [10]:
y_test
Out[10]:
array([0., 1., 1., ..., 0., 0., 0.], dtype=float32)

2. 构建网络

In [11]:
from keras import models
from keras import layers

model = models.Sequential() ## 顺序模型,基本上都是选这个
model.add(layers.Dense(16,activation='relu',input_shape = (10000,))) ## 16个神经元,指定输入为10000维,后面就不需要指定了
model.add(layers.Dense(16,activation='relu'))
model.add(layers.Dense(1,activation='sigmoid'))  ### 二分类问题,因此最后一层设置为1个神经元,激活函数为sigmoid.
In [12]:
## 分类问题,输出为0/1,因此loss设为‘binary-crossentropy'。如果是多分类,应该是’categorical-crossentropy'。
model.compile(optimizer = 'rmsprop',
             loss = 'binary_crossentropy',
             metrics = ['accuracy'])
In [13]:
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 16)                160016    
_________________________________________________________________
dense_2 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
=================================================================
Total params: 160,305
Trainable params: 160,305
Non-trainable params: 0
_________________________________________________________________

3. 验证方法

In [14]:
x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]
In [15]:
history = model.fit(partial_x_train,
                   partial_y_train,
                   epochs = 20,
                   batch_size=512,
                   validation_data=(x_val,y_val))
Train on 15000 samples, validate on 10000 samples
Epoch 1/20
15000/15000 [==============================] - 6s 386us/step - loss: 0.5040 - accuracy: 0.7925 - val_loss: 0.3804 - val_accuracy: 0.8694
Epoch 2/20
15000/15000 [==============================] - 5s 347us/step - loss: 0.2962 - accuracy: 0.9071 - val_loss: 0.3016 - val_accuracy: 0.8874
Epoch 3/20
15000/15000 [==============================] - 5s 322us/step - loss: 0.2208 - accuracy: 0.9280 - val_loss: 0.2774 - val_accuracy: 0.8909
Epoch 4/20
15000/15000 [==============================] - 5s 301us/step - loss: 0.1720 - accuracy: 0.9450 - val_loss: 0.2749 - val_accuracy: 0.8880
Epoch 5/20
15000/15000 [==============================] - 4s 234us/step - loss: 0.1427 - accuracy: 0.9544 - val_loss: 0.2899 - val_accuracy: 0.8849
Epoch 6/20
15000/15000 [==============================] - 3s 226us/step - loss: 0.1151 - accuracy: 0.9666 - val_loss: 0.3024 - val_accuracy: 0.8803
Epoch 7/20
15000/15000 [==============================] - 3s 195us/step - loss: 0.0947 - accuracy: 0.9718 - val_loss: 0.3203 - val_accuracy: 0.8797
Epoch 8/20
15000/15000 [==============================] - 3s 223us/step - loss: 0.0764 - accuracy: 0.9777 - val_loss: 0.3512 - val_accuracy: 0.8754
Epoch 9/20
15000/15000 [==============================] - 3s 181us/step - loss: 0.0622 - accuracy: 0.9836 - val_loss: 0.3986 - val_accuracy: 0.8713
Epoch 10/20
15000/15000 [==============================] - 3s 218us/step - loss: 0.0528 - accuracy: 0.9874 - val_loss: 0.3868 - val_accuracy: 0.8789
Epoch 11/20
15000/15000 [==============================] - 3s 215us/step - loss: 0.0407 - accuracy: 0.9904 - val_loss: 0.4127 - val_accuracy: 0.8774
Epoch 12/20
15000/15000 [==============================] - 3s 220us/step - loss: 0.0330 - accuracy: 0.9935 - val_loss: 0.4508 - val_accuracy: 0.8707
Epoch 13/20
15000/15000 [==============================] - 3s 209us/step - loss: 0.0264 - accuracy: 0.9951 - val_loss: 0.4787 - val_accuracy: 0.8748
Epoch 14/20
15000/15000 [==============================] - 3s 203us/step - loss: 0.0222 - accuracy: 0.9953 - val_loss: 0.5061 - val_accuracy: 0.8700
Epoch 15/20
15000/15000 [==============================] - 4s 239us/step - loss: 0.0163 - accuracy: 0.9979 - val_loss: 0.5327 - val_accuracy: 0.8698
Epoch 16/20
15000/15000 [==============================] - 3s 209us/step - loss: 0.0130 - accuracy: 0.9981 - val_loss: 0.5649 - val_accuracy: 0.8693
Epoch 17/20
15000/15000 [==============================] - 3s 216us/step - loss: 0.0117 - accuracy: 0.9978 - val_loss: 0.6013 - val_accuracy: 0.8678
Epoch 18/20
15000/15000 [==============================] - 3s 225us/step - loss: 0.0055 - accuracy: 0.9997 - val_loss: 0.6358 - val_accuracy: 0.8662
Epoch 19/20
15000/15000 [==============================] - 3s 226us/step - loss: 0.0083 - accuracy: 0.9983 - val_loss: 0.6573 - val_accuracy: 0.8665
Epoch 20/20
15000/15000 [==============================] - 3s 200us/step - loss: 0.0033 - accuracy: 0.9999 - val_loss: 0.6849 - val_accuracy: 0.8665
In [16]:
history_dict = history.history
history_dict.keys()
Out[16]:
dict_keys(['val_loss', 'val_accuracy', 'loss', 'accuracy'])

4. 绘制结果

In [17]:
import matplotlib.pyplot as plt

history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

epochs = range(1,len(loss_values)+1)

plt.plot(epochs,loss_values,'bo',label = "Training loss")
plt.plot(epochs, val_loss_values,'b',label = "Validation loss")
plt.title('Training and Validation loss')
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()

plt.show()

5. 从头开始训练一个模型

训练到epoch = 4,validation loss最小,后面training loss一直在下降,但远低于validation loss,过拟合了。选定epoch = 4,重新训练一个模型。

In [18]:
model = models.Sequential()
model.add(layers.Dense(16,activation='relu',input_shape = (10000,)))
model.add(layers.Dense(16,activation='relu'))
model.add(layers.Dense(1,activation='sigmoid'))

model.compile(optimizer = 'rmsprop',
             loss = 'binary_crossentropy',
             metrics = ['accuracy'])

model.fit(x_train,y_train,epochs = 4,batch_size=512)
results = model.evaluate(x_test,y_test)
Epoch 1/4
25000/25000 [==============================] - 5s 181us/step - loss: 0.4431 - accuracy: 0.8228
Epoch 2/4
25000/25000 [==============================] - 4s 151us/step - loss: 0.2545 - accuracy: 0.9088
Epoch 3/4
25000/25000 [==============================] - 3s 133us/step - loss: 0.1978 - accuracy: 0.9298
Epoch 4/4
25000/25000 [==============================] - 3s 119us/step - loss: 0.1687 - accuracy: 0.9398
25000/25000 [==============================] - 6s 230us/step
In [19]:
results
Out[19]:
[0.2947778236961365, 0.883840024471283]

6. 使用训练好的网络预测新数据

In [20]:
model.predict(x_test)
Out[20]:
array([[0.16397169],
       [0.99960154],
       [0.819988  ],
       ...,
       [0.10560978],
       [0.06917291],
       [0.7328982 ]], dtype=float32)
In [ ]: