【NLP】LDA模型与实现

LDA（Latent Dirichlet Allocation）是一个关于NLP的模型。区别于另一个LDA(Linear Discrimination Analysis) 是一种有监督降维模型。

先做 CountVectorizer

count_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')

X_cnt_vec = count_vectorizer.fit_transform(data_samples)

然后lda

lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(X_cnt_vec)

lda.components_array # array, [n_components, n_features]，意义：components_[i, j]值得是 topic i和 word j 的关系强度。

n_components: 主题个数
doc_topic_prior:先验Dirichlet分布的参数$\theta$，默认1/n_components
topic_word_prior:先验Dirichlet分布的参数$\beta$，默认1/n_components
learning_method: 即LDA的求解算法。有 ‘batch’ 和 ‘online’两种选择。 ‘batch’即我们在原理篇讲的变分推断EM算法，而”online”即在线变分推断EM算法，在”batch”的基础上引入了分步训练，将训练样本分批，逐步一批批的用样本更新主题词分布的算法。默认是”batch”。如果数据量少，用”batch”，需要调参少。如果数据量多，用 “online” ，速度较快。
learning_decay：仅仅在算法使用”online”时有意义，取值最好在(0.5, 1.0]
learning_offset：仅仅在算法使用”online”时有意义，取值要大于1。用来减小前面训练样本批次对最终模型的影响。
max_iter ：EM算法的最大迭代次数。
batch_size: 仅仅在算法使用”online”时有意义，即每次EM算法迭代时使用的文档样本的数量。 evaluate_every

代码官方文档。 https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

理论篇 https://zhuanlan.zhihu.com/p/31470216

0x00_读论文 11

0x11_算法平台 16

0x12_Pandas与numpy 12

0x13_特征工程 4

0x21_有监督学习 21

0x22_上世纪神经网络 10

0x23_神经网络与TF 17

0x24_NLP 13

0x25_CV 9

0x26_torch 5

0x31_降维 10

0x32_聚类 5

0x33_图模型 9

0x41_统计模型 9

0x42_概率论 7

0x43_时间序列 10

0x44_随机过程 2

0x51_代数与分析 13

0x52_方程 2

0x53_复分析与积分变换 8

0x55_数值计算 7

0x56_最优化 11

0x59_应用数学 10

0x60_启发式算法 8

0x70_可视化 11

0x80_数据结构与算法 21

0xa0_蒙特卡洛方法 6

0xb0_Python语法 19

0xd0_设计模式 7

【NLP】LDA模型与实现