Skip to content

HTMLReader

Defined in: packages/readers/src/html.ts:11

Extract the significant text from an arbitrary HTML document. The contents of any head, script, style, and xml tags are removed completely. The URLs for a[href] tags are extracted, along with the inner text of the tag. All other tags are removed, and the inner text is kept intact. Html entities (e.g., &) are not decoded.

new HTMLReader(): HTMLReader

HTMLReader

FileReader.constructor

static addMetaData(filePath): (doc, index) => void

Defined in: packages/core/src/schema/type.ts:94

string

(doc, index): void

BaseNode

number

void

FileReader.addMetaData


loadData(filePath): Promise<Document<Metadata>[]>

Defined in: packages/core/src/schema/type.ts:65

string

Promise<Document<Metadata>[]>

FileReader.loadData


loadDataAsContent(fileContent): Promise<Document<Metadata>[]>

Defined in: packages/readers/src/html.ts:18

Public method for this reader. Required by BaseReader interface.

Uint8Array

The content of the file.

Promise<Document<Metadata>[]>

Promise<Document[]> A Promise object, eventually yielding zero or one Document parsed from the HTML content of the specified file.

FileReader.loadDataAsContent


parseContent(html, options): Promise<string>

Defined in: packages/readers/src/html.ts:33

Wrapper for string-strip-html usage.

string

Raw HTML content to be parsed.

Partial<Opts> = {}

An object of options for the underlying library

Promise<string>

The HTML content, stripped of unwanted tags and attributes

getOptions


getOptions(): Partial<Opts>

Defined in: packages/readers/src/html.ts:46

Wrapper for our configuration options passed to string-strip-html library

Partial<Opts>

An object of options for the underlying library

https://codsen.com/os/string-strip-html/examples