MeoMaya Documentation

Docs

1. Introduction

MeoMaya is a lightweight, high-performance Natural Language Processing (NLP) framework built entirely in Python. It is designed to be simple, modular, and efficient, making it an ideal choice for developers, researchers, and students who need a powerful NLP toolkit without the overhead of larger, more complex libraries.

The framework provides a complete text processing pipeline, including normalization, tokenization, part-of-speech (POS) tagging, and parsing. Additionally, it features a pure-Python machine learning stack with a TF-IDF vectorizer and a centroid-based classifier, allowing for straightforward implementation of text classification and analysis tasks.

Philosophy: MeoMaya is built on the principle of providing core NLP functionalities in a clear and accessible manner. It prioritizes speed and low resource consumption, making it suitable for a wide range of applications, from web backends to embedded systems.

Tip: Use the sidebar to jump across sections quickly.

2. Installation

Prerequisites

  • Python 3.11 or higher
  • pip package manager

Steps

1. Clone the repository:

git clone https://github.com/KashyapSinh-Gohil/meomaya.git
cd meomaya

2. Install core dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r meomaya/requirements.txt

3. Optional Dependencies:

For Indian language support, you need indic-nlp-library:

pip install indic-nlp-library

For API and development tools, install the development dependencies:

pip install -r requirements-dev.txt

3. Quick Start

Using the Modelify Class

The Modelify class is the simplest entry point to MeoMaya's NLP pipeline. It encapsulates all the core components and runs them in the correct order.

from meomaya.core.modelify import Modelify

# Initialize the model for text processing
m = Modelify(mode="text")

# Run the pipeline on your text
result = m.run("MeoMaya makes NLP easy and fun!")

# The result is a dictionary containing the processed text
import json
print(json.dumps(result, indent=2))

Using the CLI

For quick tasks, you can use MeoMaya directly from your terminal. This is useful for testing or integrating with shell scripts.

python -m meomaya "MeoMaya is great for command-line use." --mode text

4. Core Components

MeoMaya's core functionality is built around a pipeline of four main components. You can use them individually or together to build custom NLP workflows.

Normalizer

The Normalizer cleans and standardizes text. Its primary function is to convert text to lowercase, but it can be extended to handle other normalization tasks like removing punctuation or expanding contractions.

from meomaya.core.normalizer import Normalizer

normalizer = Normalizer(lang="en")
normalized_text = normalizer.normalize("This is an EXAMPLE sentence with some UPPERCASE words.")
# Output: "this is an example sentence with some uppercase words."

Tokenizer

The Tokenizer breaks down a string of text into a list of individual tokens. This is a fundamental step in most NLP tasks. MeoMaya's tokenizer is designed to handle various languages and can be customized.

from meomaya.core.tokenizer import Tokenizer

tokenizer = Tokenizer(lang="en")
tokens = tokenizer.tokenize("Hello, world! This is MeoMaya.")
# Output: ['Hello', ',', 'world', '!', 'This', 'is', 'MeoMaya', '.']

Tagger

The Tagger, or Part-of-Speech (POS) Tagger, assigns a grammatical category (like noun, verb, adjective, etc.) to each token. This provides valuable semantic information for further analysis.

from meomaya.core.tagger import Tagger

tagger = Tagger(lang="en")
tagged_tokens = tagger.tag(['MeoMaya', 'is', 'a', 'powerful', 'tool'])
# Output: [('MeoMaya', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('powerful', 'JJ'), ('tool', 'NN')]

Parser

The Parser analyzes the grammatical structure of a sentence and creates a dependency parse tree. This reveals the relationships between words in the sentence. Currently, the parser in MeoMaya is a placeholder for a more complex implementation, but it demonstrates the structure of the pipeline.

from meomaya.core.parser import Parser

parser = Parser(lang="en")
parse_tree = parser.parse([('MeoMaya', 'NNP'), ('is', 'VBZ'), ('cool', 'JJ')])
# Output: {'tree': [('MeoMaya', 'NNP'), ('is', 'VBZ'), ('cool', 'JJ')]}

5. Machine Learning Utilities

MeoMaya includes a set of pure-Python machine learning tools for common NLP tasks like text classification.

Vectorizer

The Vectorizer converts text documents into numerical representations. It uses the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm, which reflects how important a word is to a document in a collection or corpus.

from meomaya.ml.vectorizer import Vectorizer

texts = [
    "MeoMaya is a great NLP framework.",
    "I enjoy using Python for NLP.",
    "This framework is fast and efficient."
]
vectorizer = Vectorizer()
X = vectorizer.fit_transform(texts)

# X is a list of TF-IDF vectors, representing the input texts
for i, doc in enumerate(texts):
    print(f"Document {i+1}: {X[i]}")

Classifier

The Classifier is a centroid-based classifier that uses cosine similarity to determine the category of a given text. It's a simple yet effective algorithm that works well for many text classification problems. For a more detailed example, including model training, saving, and prediction, please refer to the Advanced Sentiment Demo section.

from meomaya.ml.classifier import Classifier
from meomaya.ml.vectorizer import Vectorizer

# Example: Basic classification after training
# (Assuming vectorizer and classifier are already trained and loaded)
# vectorizer = ...
# classifier = ...

# new_text = "This is a great product!"
# X_new = vectorizer.transform([new_text])
# prediction = classifier.classify(X_new)
# print(prediction)

6. REST API

Run the local FastAPI server (no third-party services required):

uvicorn meomaya.api.server:app --host 0.0.0.0 --port 8000

Call the API:

curl -X POST http://localhost:8000/run -H 'Content-Type: application/json' \
-d '{"input": "Hello from MeoMaya!", "mode": "text"}'

Batch endpoint:

curl -X POST http://localhost:8000/run/batch -H 'Content-Type: application/json' \
-d '{"inputs": ["hi", "there"], "mode": "text"}'

7. Hardware Selection

MeoMaya detects hardware automatically (CPU/CUDA/MPS) without requiring torch by default.

from meomaya.core.hardware import select_device
print(select_device())  # 'cpu', 'cuda', or 'mps'

Override via environment variable:

export MEOMAYA_DEVICE=cpu

5.1 Advanced Sentiment Demo

The full_nlp_workflow_demo.py script located in the meomaya/examples/ directory demonstrates the MeoMaya NLP workflow end-to-end, including vectorization and classification.

Training the Model

Before you can predict sentiment, you need to train the model. This command will train the model using a sample dataset and save the trained vectorizer and classifier to the sentiment_model/ directory.

python meomaya/examples/full_nlp_workflow_demo.py

Note: The PYTHONPATH environment variable is set to ensure the script can locate the MeoMaya modules. You only need to train the model once.

Batch processing example

The demo also shows how to process multiple inputs using TextPipeline.process_batch.

6. Command-Line Interface (CLI)

MeoMaya's CLI provides a convenient way to access its features without writing any Python code.

Basic Usage

Use the module entry point:

python -m meomaya "Your text here" --mode text

Options

  • --mode: Override auto-detected mode: text, audio, image, video, 3d, fusion.
  • --model: Model name (placeholder; reserved for future model selection).

7. Advanced Usage

Building a Custom Pipeline

One of MeoMaya's strengths is its modularity. You can easily create your own custom NLP pipelines by combining the core components in different ways.

from meomaya.core.normalizer import Normalizer
from meomaya.core.tokenizer import Tokenizer
from meomaya.core.tagger import Tagger
from meomaya.core.parser import Parser

def custom_pipeline(text: str, lang: str = "en"):
    """A custom NLP pipeline that normalizes, tokenizes, and tags text."""
    normalizer = Normalizer(lang)
    tokenizer = Tokenizer(lang)
    tagger = Tagger(lang)

    normalized_text = normalizer.normalize(text)
    tokens = tokenizer.tokenize(normalized_text)
    tagged_tokens = tagger.tag(tokens)

    return {
        'normalized': normalized_text,
        'tokens': tokens,
        'tagged': tagged_tokens,
    }

# Run the custom pipeline
result = custom_pipeline("This is a demonstration of a custom pipeline.")
print(result)

8. API Reference

Core Module

  • Normalizer(lang: str = "en")
    • normalize(text: str) -> str
  • Tokenizer(lang: str = "en")
    • tokenize(text: str) -> list[str]
  • Tagger(lang: str = "en")
    • tag(tokens: list[str]) -> list[tuple[str, str]]
  • Parser(lang: str = "en")
    • parse(tagged_tokens: list[tuple[str, str]]) -> dict

ML Module

  • Vectorizer()
    • fit(documents: list[str])
    • transform(documents: list[str]) -> list[list[float]]
    • fit_transform(documents: list[str]) -> list[list[float]]
  • Classifier()
    • train(X: list[list[float]], y: list[str])
    • classify(X: list[list[float]]) -> list[str]

9. Troubleshooting

Common Issues

  1. ImportError for indic_nlp_library: If you are working with Indian languages and get an import error, make sure you have installed the optional dependency: pip install indic-nlp-library.
  2. Incorrect Path for Corpus: When using the CLI with a corpus file, ensure you provide the correct path to the file.
  3. Performance: For very large datasets, consider processing the data in batches to manage memory usage.

Getting Help

If you're stuck, you can:

  • Review the test files in the tests/ directory for more usage examples.
  • Open an issue on the GitHub issue tracker.