WIKINDX API trunk

FILETOTEXT

Convert files of various types to text ready for searching.

Table of Contents

Methods

__construct()  : mixed
convertToText()  : string
Convert files of various types to text ready for searching and return it
readAbiWord()  : string
Extract the text content of AbiWord files (ABW, AWT, ZABW)
readDjVu()  : string
Extract the text content of DjVu files (DJV, DJVU) with djvutxt utility
readDVI()  : string
Extract the text content of DeVice Independent files (DVI) with catdvi utility
readFictionBook()  : string
Extract the text content of FictionBook ebooks (FB1, FB2)
readPostScript()  : string
Extract the text content of PostScript files (PS, EPS) with ps2pdf utility
readScribus()  : string
Extract the text content of Scribus files (SLA)

Methods

convertToText()

Convert files of various types to text ready for searching and return it

public convertToText(string $filepath[, string $mimetype = "text/plain" ]) : string

This function dispatches the conversion to functions specialized by mime-type.

The dispatching is done according to the mime-type AND file extension. So you MUST pass a file with an appropriate extension.

Parameters
$filepath : string

An absolute or relative file path

$mimetype : string = "text/plain"

A mime-type. Default is "text/plain"

Return values
string

Text extracted

readAbiWord()

Extract the text content of AbiWord files (ABW, AWT, ZABW)

public readAbiWord(string $filepath) : string

This XML format is not documented but it seems the text is always enclosed inside "p" elements.

Parameters
$filepath : string

An absolute or relative file path

Tags
see
http://www.abisource.com/wiki/AbiWord

AbiWord Format

Return values
string

Text extracted

readDjVu()

Extract the text content of DjVu files (DJV, DJVU) with djvutxt utility

public readDjVu(string $filepath) : string

This format is used for archiving and contains text if an OCR have been used.

djvutxt utility is included in DjVuLibre toolbox.

Parameters
$filepath : string

An absolute or relative file path

Tags
see
http://djvu.sourceforge.net/doc/man/djvutxt.html

djvutxt manpage

http://djvu.sourceforge.net

DjVuLibre website

Return values
string

Text extracted

readDVI()

Extract the text content of DeVice Independent files (DVI) with catdvi utility

public readDVI(string $filepath) : string

This format is a byproduct of a TeX compilation.

catdvi utility is included in most TeX distributions like TeX Live.

Parameters
$filepath : string

An absolute or relative file path

Tags
see
http://catdvi.sourceforge.net/

catdvi website

https://tug.org/texlive/

Tex Live website

Return values
string

Text extracted

readFictionBook()

Extract the text content of FictionBook ebooks (FB1, FB2)

public readFictionBook(string $filepath) : string

Russian XML ebook format.

Versions supported :

  • v1 (no documentation found but that should word)
  • v2
Parameters
$filepath : string

An absolute or relative file path

Tags
see
http://www.gribuser.ru/xml/fictionbook/index.html.en

FictionBook 2.0 Specification

Return values
string

Text extracted

readPostScript()

Extract the text content of PostScript files (PS, EPS) with ps2pdf utility

public readPostScript(string $filepath) : string

This Adobe format is a scripted document that need GhostScript to be interpreted.

ps2pdf utility is included in GhostScript.

Parameters
$filepath : string

An absolute or relative file path

Tags
see
http://web.mit.edu/ghostscript/www/Ps2pdf.htm

ps2pdf manpage

https://www.ghostscript.com/

Ghostscript website

Return values
string

Text extracted

readScribus()

Extract the text content of Scribus files (SLA)

public readScribus(string $filepath) : string

This XML format is not documented but it seems the text is always enclosed inside the "CH" attribut of "ITEXT" elements.

Parameters
$filepath : string

An absolute or relative file path

Tags
see
https://wiki.scribus.net/canvas/(FR)_Introdution_au_Format_de_fichier_SLA_pour_Scribus_1.4

Scribus File Format

https://github.com/scribusproject/scribus/tree/master/resources/tests

Scribus file samples

Return values
string

Text extracted


        
On this page

Search results