FILETOTEXT
in package
Convert files of various types to text ready for searching.
Table of Contents
Methods
- __construct() : mixed
- convertToText() : string
- Convert files of various types to text ready for searching and return it
- readAbiWord() : string
- Extract the text content of AbiWord files (ABW, AWT, ZABW)
- readDjVu() : string
- Extract the text content of DjVu files (DJV, DJVU) with djvutxt utility
- readDVI() : string
- Extract the text content of DeVice Independent files (DVI) with catdvi utility
- readFictionBook() : string
- Extract the text content of FictionBook ebooks (FB1, FB2)
- readPostScript() : string
- Extract the text content of PostScript files (PS, EPS) with ps2pdf utility
- readScribus() : string
- Extract the text content of Scribus files (SLA)
Methods
__construct()
public
__construct() : mixed
convertToText()
Convert files of various types to text ready for searching and return it
public
convertToText(string $filepath[, string $mimetype = "text/plain" ]) : string
This function dispatches the conversion to functions specialized by mime-type.
The dispatching is done according to the mime-type AND file extension. So you MUST pass a file with an appropriate extension.
Parameters
- $filepath : string
-
An absolute or relative file path
- $mimetype : string = "text/plain"
-
A mime-type. Default is "text/plain"
Return values
string —Text extracted
readAbiWord()
Extract the text content of AbiWord files (ABW, AWT, ZABW)
public
readAbiWord(string $filepath) : string
This XML format is not documented but it seems the text is always enclosed inside "p" elements.
Parameters
- $filepath : string
-
An absolute or relative file path
Tags
Return values
string —Text extracted
readDjVu()
Extract the text content of DjVu files (DJV, DJVU) with djvutxt utility
public
readDjVu(string $filepath) : string
This format is used for archiving and contains text if an OCR have been used.
djvutxt utility is included in DjVuLibre toolbox.
Parameters
- $filepath : string
-
An absolute or relative file path
Tags
Return values
string —Text extracted
readDVI()
Extract the text content of DeVice Independent files (DVI) with catdvi utility
public
readDVI(string $filepath) : string
This format is a byproduct of a TeX compilation.
catdvi utility is included in most TeX distributions like TeX Live.
Parameters
- $filepath : string
-
An absolute or relative file path
Tags
Return values
string —Text extracted
readFictionBook()
Extract the text content of FictionBook ebooks (FB1, FB2)
public
readFictionBook(string $filepath) : string
Russian XML ebook format.
Versions supported :
- v1 (no documentation found but that should word)
- v2
Parameters
- $filepath : string
-
An absolute or relative file path
Tags
Return values
string —Text extracted
readPostScript()
Extract the text content of PostScript files (PS, EPS) with ps2pdf utility
public
readPostScript(string $filepath) : string
This Adobe format is a scripted document that need GhostScript to be interpreted.
ps2pdf utility is included in GhostScript.
Parameters
- $filepath : string
-
An absolute or relative file path
Tags
Return values
string —Text extracted
readScribus()
Extract the text content of Scribus files (SLA)
public
readScribus(string $filepath) : string
This XML format is not documented but it seems the text is always enclosed inside the "CH" attribut of "ITEXT" elements.
Parameters
- $filepath : string
-
An absolute or relative file path
Tags
Return values
string —Text extracted