FILETOTEXT
    
            
            in package
            
        
    
    
    
Convert files of various types to text ready for searching.
Table of Contents
Methods
- readEPUB() : string
- Extract the text content of EPUB ebooks (EPUB)
- readHtml() : string
- Extract the text content of (X)HTML files (loosly)
- readMultipart() : string
- Extract the text content of multipart files (RFC2557) (EML, MHT)
- readOpenDocument() : string
- Extract the text content of LibreOffice/OpenDocument/SunOffice Document and Presentation files (ODP, ODT...)
- readPDF() : string
- Extract the text content of PDF files (PDF)
- readPowerPointXML() : string
- Extract the text content of Microsoft Office PowerPoint files (2007 and higher)
- readRTF() : string
- Extract the text content of Rich Text Format (RTF) files
- readText() : string
- Extract the text content of plain text files
- readWordBinary() : string
- Extract the text content of Microsoft Office Word files (before 2007) (DOC, DOT)
- readWordXML() : string
- Extract the text content of Microsoft office Word files (2007 and higher) (DOCX, DOCM...)
- readXPS() : string
- Extract the text content of Open XML Paper Specification files (XPS, OXPS)
Methods
readEPUB()
Extract the text content of EPUB ebooks (EPUB)
    private
                    readEPUB(string $filepath) : string
    All version of EPUB are supported with a single function because the specification has changed very little when we consider only its structure and text extraction.
Versions supported :
- EPUB 3.2
- EPUB 3.1
- EPUB 3.0
- EPUB 2.0.1
Parameters
- $filepath : string
- 
                    An absolute or relative file path 
Tags
Return values
string —Text extracted
readHtml()
Extract the text content of (X)HTML files (loosly)
    private
                    readHtml(string $filepath) : string
    Widely accepts elements of (X)HTML in all versions. Remove items that are not textual or purely technical items.
We assume that the document is malformed (normalization is performed) and can be parsed in reading order.
Parameters
- $filepath : string
- 
                    An absolute or relative file path 
Return values
string —Text extracted
readMultipart()
Extract the text content of multipart files (RFC2557) (EML, MHT)
    private
                    readMultipart(string $filepath) : string
    This format is a container for any number of files of arbitrary mime-type, separated by text boundaries. It's is used for packed HTML and email storage.
Each file is extracted, reencoded in UTF-8 (with transliteration) if possible, and parsed with convertToText().
Parameters
- $filepath : string
- 
                    An absolute or relative file path 
Tags
Return values
string —Text extracted
readOpenDocument()
Extract the text content of LibreOffice/OpenDocument/SunOffice Document and Presentation files (ODP, ODT...)
    private
                    readOpenDocument(string $filepath) : string
    All version of OpenDocument are supported with a single function because the specification has changed very little when we consider only its structure and text extraction.
Versions supported :
- Open Document Format for Office Applications (OpenDocument) Specification v1.3
- Open Document Format for Office Applications (OpenDocument) Specification v1.2
- Open Document Format for Office Applications (OpenDocument) Specification v1.1
- Open Document Format for Office Applications (OpenDocument) Specification v1.0
- Flat Open Document (Open Document without container)
Type supported:
- Document
- Presentation
Parameters
- $filepath : string
- 
                    An absolute or relative file path 
Tags
Return values
string —Text extracted
readPDF()
Extract the text content of PDF files (PDF)
    private
                    readPDF(string $filepath) : string
    Adobe Portable Document Format extracted with PdfToText. PHP class of Christian Vigh or XPdf utilities.
The supported extensions of the format are not well defined. In the case of xpdf the support is supposed to be cutting edge.
Parameters
- $filepath : string
- 
                    An absolute or relative file path 
Tags
Return values
string —Text extracted
readPowerPointXML()
Extract the text content of Microsoft Office PowerPoint files (2007 and higher)
    private
                    readPowerPointXML(string $filepath) : string
    XML format of Microsoft office Suite. Templates and plain documents are supported, with and without macros. All versions starting with PowerPoint 2007 are supported.
Parameters
- $filepath : string
- 
                    An absolute or relative file path 
Tags
Return values
string —Text extracted
readRTF()
Extract the text content of Rich Text Format (RTF) files
    private
                    readRTF(string $filepath) : string
    Extracted with RTF classes of Christian Vigh. All version of RTF are supported.
Parameters
- $filepath : string
- 
                    An absolute or relative file path 
Tags
Return values
string —Text extracted
readText()
Extract the text content of plain text files
    private
                    readText(string $filepath) : string
    Markdown, reStructured text are supported as plain text.
Parameters
- $filepath : string
- 
                    An absolute or relative file path 
Return values
string —Text extracted
readWordBinary()
Extract the text content of Microsoft Office Word files (before 2007) (DOC, DOT)
    private
                    readWordBinary(string $filepath) : string
    Binary format of Microsoft office Suite. Templates and plain documents are supported, with and without macros. All versions before Word 2007 are supported.
The function uses the PHPWord library from which we kept only the code for reading Word binary files.
Parameters
- $filepath : string
- 
                    An absolute or relative file path 
Tags
Return values
string —Text extracted
readWordXML()
Extract the text content of Microsoft office Word files (2007 and higher) (DOCX, DOCM...)
    private
                    readWordXML(string $filepath) : string
    XML format of Microsoft office Suite. Templates and plain documents are supported, with and without macros. All versions starting with Word 2007 are supported.
Parameters
- $filepath : string
- 
                    An absolute or relative file path 
Tags
Return values
string —Text extracted
readXPS()
Extract the text content of Open XML Paper Specification files (XPS, OXPS)
    private
                    readXPS(string $filepath) : string
    All versions are supported.
Parameters
- $filepath : string
- 
                    An absolute or relative file path 
Tags
Return values
string —Text extracted