Output template that formats for loading into a vector database

JDRay · April 25, 2023, 5:50pm

Hot new topics in the world today are Artificial Intelligence and Large Language Models (LLMs), with the latter being less-well-understood, but key to the development of the former.

Underlying LLMs are what are called vector databases, which use numeric representations of text to make it easy for number-crunching computers to analyze text for various attributes. Various methods of “tokenization” exist for preparing data for injection into vector databases, but in gross terms, the basic approach is to break text into database-consumable chunks that can be vectorized and later used.

Terms used for such text representations include “Corpus”, a collection of works; “Document”, a container of text on a particular subject (e.g. a book or a product review); and “Feature”, which can be individual words or sets of words that, together, have meaning, and “named entities” so that things like places (e.g. “First Street” or “Bob Smith”) have particular meaning.

Analysis of text is a complex process, but it can be aided by preparing text for consumption by an analysis application. I’m very new to this subject, but it appears that most datasets are represented in JSON format, with the simplest representation being a simple array of word blocks that are under 4096 characters in length.

Given that fiction, for example, is best represented in sentences, paragraphs, scenes, chapters, acts, and volumes, I would like to see an output template that uses already-established Scrivener structures to generate a JSON representation of a work that can be then analyzed by one of these tools. I’ll take a shot at this myself soon, but am not sufficiently adept with Scrivener to do a good job of it.

Can this be achieved by an end user with the tools available in the template construction feature?

Thank you.

J.D.

kewms · April 25, 2023, 6:08pm

You might find this thread relevant: