【NLP】feature提取

一些约定：（为了变量名前后统一，并且开箱即用，约定一些固定的符号）

y: target 标签, 0/1 或者在多分类模型中是 0/1/2，一维的，直接传到模型里面
sentences_raw：原始语料， list<str>
sentences: 初步清洗后的语料 list<str>
X: 特征提取后的向量

1. 文本数据的初步清洗

step1 读入数据

import pandas as pd
df = pd.read_csv('http://www.guofei.site/datasets_for_ml/SMSSpamCollection/SMSSpamCollection.csv', sep='\t', header=None, names=['label', 'sentences'])
y = ((df.label == 'spam') * 1).values
sentences_raw = df.sentences.values

step2 初步清洗

清洗目标：

统一变成小写
去除标点、数字等
去除单个字母的单词
（如有）去除stopwords中的词语
（需要用 NLTK）词形还原，例如进行时、过去式、复数形式

import re

regex = re.compile('[a-zA-Z]{2,}')  # 2次以上字母组合，去除标点符号和数字
sentences = [regex.findall(sentence.lower()) for sentence in sentences_raw]
sentences = [' '.join(words) for words in sentences]  # 重新整合成句子
# sentences = [' '.join([word for word in words if word not in stops]) for words in sentences]  # 如果有stopswords的话


# train_test_split
from sklearn.model_selection import train_test_split
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.2)

中文清洗

读入语料

import pandas as pd

df = pd.read_csv('外卖评价语料.csv')
y = df.label
sentences_raw = df.sentences.values

用jieba拆词
去除数字
去除停词
用空格拼回

import jieba
import re
sentences = [jieba.lcut(sentence, cut_all=False) for sentence in sentences_raw]  # 拆词
regex = re.compile(u'[\u4e00-\u9fa5]')
sentences = [[word for word in sentence if regex.match(word) is not None] for sentence in sentences]  # jieba 拆开的词，一串数字也作为一个词返回，这里过滤一下
# sentences=[[word for word in sentence if word not in stops] for sentence in sentences] # 停词
sentences = [' '.join(sentence) for sentence in sentences]

更为复杂的情况，额外增加：

（代码还没有）繁体转简体，
（代码还没有）5位以上数字转NUM，
英文转小写
增加专有名词，防止分词时的错误

import jieba
import re

add_words = ['沙瑞金', '田国富', '高育良', '侯亮平','钟小艾', '陈岩石', '欧阳菁', '易学习', '王大路', '蔡成功','孙连城', '季昌明', '丁义珍', '郑西坡',' 赵东来', '高小琴', '赵瑞龙', '林华华', '陆亦可', '刘新建', '刘庆祝']

for add_word in add_words:
    jieba.add_word(add_word)

sentences = [jieba.lcut(sentence, cut_all=False) for sentence in sentences_raw]  # 拆词
regex = re.compile(u'[\u4e00-\u9fa5]')
sentences = [[word for word in sentence if regex.match(word) is not None] for sentence in sentences]  # jieba 拆开的词，一串数字也作为一个词返回，这里过滤一下
# sentences=[[word for word in sentence if word not in stops] for sentence in sentences] # 停词
sentences = [' '.join(sentence) for sentence in sentences]

2. VocabularyProcessor

给每个词对应一个数字标号（dictionary），然后把整个句子按照 dictionary 的一一对应关系，转换成数字列表。
tensorflow.contrib.learn.preprocessing.VocabularyProcessor 可以实现这个功能，但是这个函数将被移除。
网上的实现也不够 Pythonic
所以自己写了个类 ，轻量、好用

import collections


class VocabularyProcessor:
    '''
    author: guofei
    site: www.guofei.site
    输入：一个列表，列表中的元素是句子，句子中的单词用空格分开。这里不提供数据清洗功能，所以应当先进行数据清洗，然后使用这个类
    输出：一个列表的列表，长度是词语个数，元素是单词的序号，例如 [[1, 2], [1, 2, 0, -1, -1, 3], [-1, 0, -1, -1], [2, 1, 0, -1, 3]]
    ```
    corpus = ['this is the first document',
              'this is the second second document',
              'and the third one',
              'is this the first document']
    vp = VocabularyProcessor(max_vocabulary_size=5)
    vp.fit(corpus)
    X = vp.transform(corpus)
    ```
    - 剔除低频词，并且把所有的低频词标记为 'RARE'，其序号固定为 0
    - max_vocabulary_size=10000 最多保留多少单词
    '''
    def __init__(self, max_vocabulary_size=10000):
        self.max_vocabulary_size = max_vocabulary_size
        self.word_count = None
        self.dictionary = None
        self.reverse_dictionary = None

    def fit(self, sentences):
        words = [word for sentence in sentences for word in sentence.split(' ')]
        word_count_ = collections.Counter(words)  # 词频
        word_count = word_count_.most_common(self.max_vocabulary_size - 1)  # 最多词频

        words_most = [word for word, num in word_count]  # 频率最高的词
        len_words_most=len(words_most)
        dictionary = dict(zip(words_most, range(1,len_words_most)))
        reverse_dictionary = dict(zip(range(1,len_words_most), words_most))

        # 添加 'RARE' ,也就是低频词语
        rare_count = sum([word_count_[word] for word in word_count_ if word not in words_most])
        word_count = [('RARE', rare_count)] + word_count
        dictionary['RARE'] = 0
        reverse_dictionary[0] = 'RARE'

        self.word_count = word_count
        self.dictionary = dictionary
        self.reverse_dictionary = reverse_dictionary

    def transform(self, sentences):
        return [[self.dictionary.get(word, 0) for word in sentence.split(' ')] for sentence in sentences]

    def fit_transform(self, sentences):
        self.fit(sentences)
        return self.transform(sentences)

使用方法

corpus = ['this is the first document',
          'this is the second second document',
          'and the third one',
          'is this the first document']
vp = VocabularyProcessor(max_vocabulary_size=5)
vp.fit(corpus)
X = vp.transform(corpus)
# vp.word_count, vp.dictionary, vp.reverse_dictionary
print(X)

[[1, 2, 3, 0, 0], [1, 2, 3, 0, 0, 0], [0, 3, 0, 0], [2, 1]]

fit之后，还可以查询字典的其它信息

vp.word_count  # 字典大小
vp.dictionary  # {'word': idx}
vp.reverse_dictionary  # {idx: 'word'}

3. CountVectorizer

又叫做词袋（Bag of Words），或者Tf(Text Frequency)

词袋介绍

结果的长度等于字典的长度，结果的元素值是在这句话中单词出现的次数。

例如，我们的词袋为

{'tensorflow': 4, 'makes': 3, 'machine': 2, 'learning': 1, 'easy': 0}

词袋的计算结果为

sentence1 = 'TensorFlow does make machine learning easy!'
sentence1_list = [0,1,1,1,1,1]

sentence2 = 'Machine learning is easy'
sentence2 = [1, 1, 1, 0, 0]

代码实现

sklearn.feature_extraction.text.CountVectorizer

from sklearn.feature_extraction import text

count_vectorizer = text.CountVectorizer()
# ngram_range
# max_df : float in range [0.0, 1.0] or int, default=1.0
# min_df : float in range [0.0, 1.0] or int, default=1
# max_features : int or None, default=None
# vocabulary : Mapping or iterable, optional


# 源数据
sentences = ['This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?']

count_vectorizer.fit(sentences)
X = count_vectorizer.transform(sentences)
# 1. 不在vocabulary中的词，在transform阶段被忽略
# 2. 返回的 numpy 的稀疏矩阵，用 X.toarray() 转换成普通矩阵

X.toarray()  # 是一个shape=(num_sentence,num_words) 的array

# count_vectorizer.fit_transform(sentences) # 合并写法

count_vectorizer.vocabulary_  # vocabulary向量（dict格式，{'word': idx}）
# {u'and': 0, u'document': 1, u'first': 2, u'is': 3, u'one': 4, u'second': 5, u'the': 6, u'third': 7, u'this': 8}

count_vectorizer.get_feature_names() # 返回一个list，按照单词的序号排列

ngram

ngram_range=(1, 2) # 分词时，除了自身外，还可以保留前后单词，例如，这个案例可以以这个为词典：
# ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool']

中文

# 默认正则表达式，剔除了单个字，这在英文中是合适的，但中文不能剔除
token_pattern = r'(?u)\b\w\w+\b'
# 改成这样：
token_pattern = r'\S'

TF-IDF

TF-IDF(Text Frequency-Inverse Document Frequency)
有时候我们想要调整某些特定单词的权重，提高有用单词的权重，降低常用或无意义单词的权重。

$f_{ij}=$frequency of term (feature) i in doc j
$TF_{ij}=\dfrac{f_{ij}}{\sum_k f_{kj}}$

$n_i=$number of docs that mention term i
$N=$total number of docs
$IDF_{i}=\log\dfrac{N}{n_i}$，某个单词越频繁出现，这个数字越小

$TF-IDF_{ij}=TF_{ij}\times IDF_i$

from sklearn.feature_extraction import text
tf_idf_transformer=text.TfidfTransformer()

counts = [[3, 0, 1],
           [2, 0, 0],
           [3, 0, 0],
           [4, 0, 0],
           [3, 2, 0],
           [3, 0, 2]]
tf_idf_transformer.fit(counts)

tf_idf_transformer.transform(counts).toarray()

TfidfVectorizer

TfidfVectorizer = CountVectorizer + TfidfTransformer

from sklearn.feature_extraction import text

tf_idf=text.TfidfVectorizer()
tf_idf.fit(sentences)

# tf_idf.vocabulary_ # {'单词':idx}
# tf_idf.fixed_vocabulary_ # 是否指定 vocabulary
# tf_idf.idf_ # 每个单词的 idf 值
# tf_idf.stop_words_

X = tf_idf.transform(sentences)

其它的特征提取方法

text.HashingVectorizer  # 占用内存更少，索引速度更快
VectorizerMixin

参考资料

http://scikit-learn.org/stable/modules/feature_extraction.html
https://blog.csdn.net/pipisorry/article/details/41957763
Nick McClure:《TensorFlow机器学习实战指南》机械工业出版社
lan Goodfellow:《深度学习》人民邮电出版社
王琛等：《深度学习原理与TensorFlow实战》电子工业出版社
李嘉璇：《TensorFlow技术解析与实战》人民邮电出版社
黄文坚：《TensorFlow实战》电子工业出版社
郑泽宇等：《TensorFlow实战Google深度学习框架》电子工业出版社

0x00_读论文 11

0x11_算法平台 16

0x12_Pandas与numpy 12

0x13_特征工程 4

0x21_有监督学习 21

0x22_上世纪神经网络 10

0x23_神经网络与TF 17

0x24_NLP 13

0x25_CV 9

0x26_torch 5

0x31_降维 10

0x32_聚类 5

0x33_图模型 9

0x41_统计模型 9

0x42_概率论 7

0x43_时间序列 10

0x44_随机过程 2

0x51_代数与分析 13

0x52_方程 2

0x53_复分析与积分变换 8

0x55_数值计算 7

0x56_最优化 11

0x59_应用数学 10

0x60_启发式算法 8

0x70_可视化 11

0x80_数据结构与算法 21

0xa0_蒙特卡洛方法 6

0xb0_Python语法 19

0xd0_设计模式 7