Pipeline for training LSA models using Scikit-Learn.
Instead of writing custom code for latent semantic analysis, you just need:
- install pipeline:
pip install latent-semantic-analysis
- run pipeline:
- either in terminal:
lsa-train --path_to_config config.yaml
- or in python:
import latent_semantic_analysis
latent_semantic_analysis.train(path_to_config="config.yaml")
NOTE: more about config file here.
No data preparation is needed, only a csv file with raw text column (with arbitrary name).
The user interface consists of only one files:
- config.yaml - general configuration with sklearn TF-IDF and SVD parameters
Change config.yaml to create the desired configuration and train LSA model with the following command:
- terminal:
lsa-train --path_to_config config.yaml
- python:
import latent_semantic_analysis
latent_semantic_analysis.train(path_to_config="config.yaml")
Default config.yaml:
seed: 42
path_to_save_folder: models
# data
data:
data_path: data/data.csv
sep: ','
text_column: text
# tf-idf
tf-idf:
lowercase: true
ngram_range: (1, 1)
max_df: 1.0
min_df: 1
# svd
svd:
n_components: 10
algorithm: arpack
NOTE: tf-idf
and svd
are sklearn TfidfVectorizer and TruncatedSVD parameters correspondingly, so you can parameterize instances of these classes however you want.
After training the model, the pipeline will return the following files:
model.joblib
- sklearn pipeline with LSA (TF-IDF and SVD steps)config.yaml
- config that was used to train the modellogging.txt
- logging filedoc2topic.json
- document embeddingsterm2topic.json
- term embeddings
Python >= 3.6
If you use latent-semantic-analysis in a scientific publication, we would appreciate references to the following BibTex entry:
@misc{dayyass2021lsa,
author = {El-Ayyass, Dani},
title = {Pipeline for training LSA models},
howpublished = {\url{https://github.com/dayyass/latent-semantic-analysis}},
year = {2021}
}