from sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.feature_extraction.text import CountVectorizer# 说明:# 1、主要用到了两个函数:CountVectorizer()和TfidfTransformer()。# 2、CountVectorizer是通过fit_transform函数将文本中的词语转换为词频矩阵,# 1)矩阵元素weight[i][j] 表示j词在第i个文本下的词频,即各个词语出现的次数;# 2)通过get_feature_names()可看到所有文本的关键字,通过toarray()可看到词频矩阵的结果。# 3、TfidfTransformer也有个fit_transform函数,它的作用是计算tf-idf值# 测试目的:# 1)CountVectorizer单条多次计算与一次lst结算结果是否一致,# 2)即test的结果与test_a、test_b的结果是否一致# 3) 即测试数据是否会影响TfidfTransformer与CountVectorizer计算itidf的结果# 结论:# 1) 只要train与test都是用同一个词频矩阵CountVectorizer,单条多次计算与一次lst结算结果一致# 2) 为了保证测试集也能用到训练集的词频矩阵,保存模型的时候需要保存CountVectorizertrain = ['This is the first document.', 'This is the second second document.']test = ['And the third one.', 'Is this the first document?']test_a = ['And the third one.']test_b = ['Is this the first document?']vectorizer = CountVectorizer()tfidftransformer = TfidfTransformer()# 注意只要vectorizer.fit_transform,词频矩阵就固定了count_train = vectorizer.fit_transform(train)print('count:')print(vectorizer.vocabulary_)print('feature_names:')print(vectorizer.get_feature_names())print(count_train.toarray())tfidf = tfidftransformer.fit_transform(count_train)train_weight = tfidf.toarray()print(tfidf.shape)print(train_weight)count_test = vectorizer.transform(test)# 注意,这里是通过固定的词频矩阵来转换test_a、test_bcount_test_a = vectorizer.transform(test_a)count_test_b = vectorizer.transform(test_b)# print(type(count2))print('count_train:')print(vectorizer.get_feature_names())print('词频矩阵对比如下:')print(count_test.toarray())print(count_test_a.toarray())print(count_test_b.toarray())test_tfidf = tfidftransformer.transform(count_test)test_weight = test_tfidf.toarray()test_weight_a = tfidftransformer.transform(count_test_a).toarray()test_weight_b = tfidftransformer.transform(count_test_b).toarray()print('tfidf对比如下:')print(test_weight)print(test_weight_a)print(test_weight_b)
结果输出如下:
count:
{'first': 1, 'the': 4, 'is': 2, 'second': 3, 'this': 5, 'document': 0} feature_names: ['document', 'first', 'is', 'second', 'the', 'this'] [[1 1 1 0 1 1] [1 0 1 2 1 1]] (2, 6) [[0.4090901 0.57496187 0.4090901 0. 0.4090901 0.4090901 ] [0.28986934 0. 0.28986934 0.81480247 0.28986934 0.28986934]] count_train: ['document', 'first', 'is', 'second', 'the', 'this'] 词频矩阵对比如下: [[0 0 0 0 1 0] [1 1 1 0 1 1]] [[0 0 0 0 1 0]] [[1 1 1 0 1 1]] tfidf对比如下: [[0. 0. 0. 0. 1. 0. ] [0.4090901 0.57496187 0.4090901 0. 0.4090901 0.4090901 ]] [[0. 0. 0. 0. 1. 0.]] [[0.4090901 0.57496187 0.4090901 0. 0.4090901 0.4090901 ]]