GitHub - dayyass/latent-semantic-analysis: Pipeline for training LSA models using Scikit-Learn.

Latent Semantic Analysis

Pipeline for training LSA models using Scikit-Learn.

Usage

Instead of writing custom code for latent semantic analysis, you just need:

install pipeline:

pip install latent-semantic-analysis

run pipeline:

either in terminal:

lsa-train --path_to_config config.yaml

or in python:

import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

NOTE: more about config file here.

No data preparation is needed, only a csv file with raw text column (with arbitrary name).

Config

The user interface consists of only one files:

config.yaml - general configuration with sklearn TF-IDF and SVD parameters

Change config.yaml to create the desired configuration and train LSA model with the following command:

terminal:

lsa-train --path_to_config config.yaml

python:

import latent_semantic_analysis

latent_semantic_analysis.train(path_to_config="config.yaml")

Default config.yaml:

seed: 42
path_to_save_folder: models

# data
data:
  data_path: data/data.csv
  sep: ','
  text_column: text

# tf-idf
tf-idf:
  lowercase: true
  ngram_range: (1, 1)
  max_df: 1.0
  min_df: 1

# svd
svd:
  n_components: 10
  algorithm: arpack

NOTE: tf-idf and svd are sklearn TfidfVectorizer and TruncatedSVD parameters correspondingly, so you can parameterize instances of these classes however you want.

Output

After training the model, the pipeline will return the following files:

model.joblib - sklearn pipeline with LSA (TF-IDF and SVD steps)
config.yaml - config that was used to train the model
logging.txt - logging file
doc2topic.json - document embeddings
term2topic.json - term embeddings

Requirements

Python >= 3.6

Citation

If you use latent-semantic-analysis in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{dayyass2021lsa,
    author       = {El-Ayyass, Dani},
    title        = {Pipeline for training LSA models},
    howpublished = {\url{https://github.com/dayyass/latent-semantic-analysis}},
    year         = {2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
data		data
latent_semantic_analysis		latent_semantic_analysis
models		models
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
codecov.yml		codecov.yml
config.yaml		config.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Latent Semantic Analysis

Usage

Config

Output

Requirements

Citation

About

Releases 1

Packages

Languages

License

dayyass/latent-semantic-analysis

Folders and files

Latest commit

History

Repository files navigation

Latent Semantic Analysis

Usage

Config

Output

Requirements

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages