FILETOTEXT
in package
Convert files of various types to text ready for searching.
Table of Contents
Methods
- readEPUB() : string
- Extract the text content of EPUB ebooks (EPUB)
- readHtml() : string
- Extract the text content of (X)HTML files (loosly)
- readMultipart() : string
- Extract the text content of multipart files (RFC2557) (EML, MHT)
- readOpenDocument() : string
- Extract the text content of LibreOffice/OpenDocument/SunOffice Document and Presentation files (ODP, ODT...)
- readPDF() : string
- Extract the text content of PDF files (PDF)
- readPowerPointXML() : string
- Extract the text content of Microsoft Office PowerPoint files (2007 and higher)
- readRTF() : string
- Extract the text content of Rich Text Format (RTF) files
- readText() : string
- Extract the text content of plain text files
- readWordBinary() : string
- Extract the text content of Microsoft Office Word files (before 2007) (DOC, DOT)
- readWordXML() : string
- Extract the text content of Microsoft office Word files (2007 and higher) (DOCX, DOCM...)
- readXPS() : string
- Extract the text content of Open XML Paper Specification files (XPS, OXPS)
Methods
readEPUB()
Extract the text content of EPUB ebooks (EPUB)
private
readEPUB(string $filepath) : string
All version of EPUB are supported with a single function because the specification has changed very little when we consider only its structure and text extraction.
Versions supported :
- EPUB 3.2
- EPUB 3.1
- EPUB 3.0
- EPUB 2.0.1
Parameters
- $filepath : string
-
An absolute or relative file path
Tags
Return values
string —Text extracted
readHtml()
Extract the text content of (X)HTML files (loosly)
private
readHtml(string $filepath) : string
Widely accepts elements of (X)HTML in all versions. Remove items that are not textual or purely technical items.
We assume that the document is malformed (normalization is performed) and can be parsed in reading order.
Parameters
- $filepath : string
-
An absolute or relative file path
Return values
string —Text extracted
readMultipart()
Extract the text content of multipart files (RFC2557) (EML, MHT)
private
readMultipart(string $filepath) : string
This format is a container for any number of files of arbitrary mime-type, separated by text boundaries. It's is used for packed HTML and email storage.
Each file is extracted, reencoded in UTF-8 (with transliteration) if possible, and parsed with convertToText().
Parameters
- $filepath : string
-
An absolute or relative file path
Tags
Return values
string —Text extracted
readOpenDocument()
Extract the text content of LibreOffice/OpenDocument/SunOffice Document and Presentation files (ODP, ODT...)
private
readOpenDocument(string $filepath) : string
All version of OpenDocument are supported with a single function because the specification has changed very little when we consider only its structure and text extraction.
Versions supported :
- Open Document Format for Office Applications (OpenDocument) Specification v1.3
- Open Document Format for Office Applications (OpenDocument) Specification v1.2
- Open Document Format for Office Applications (OpenDocument) Specification v1.1
- Open Document Format for Office Applications (OpenDocument) Specification v1.0
- Flat Open Document (Open Document without container)
Type supported:
- Document
- Presentation
Parameters
- $filepath : string
-
An absolute or relative file path
Tags
Return values
string —Text extracted
readPDF()
Extract the text content of PDF files (PDF)
private
readPDF(mixed $filepath) : string
Adobe Portable Document Format extracted with PdfToText. PHP class of Christian Vigh or XPdf utilities.
The supported extensions of the format are not well defined. In the case of xpdf the support is supposed to be cutting edge.
Parameters
- $filepath : mixed
-
An absolute or relative file path
Tags
Return values
string —Text extracted
readPowerPointXML()
Extract the text content of Microsoft Office PowerPoint files (2007 and higher)
private
readPowerPointXML(string $filepath) : string
XML format of Microsoft office Suite. Templates and plain documents are supported, with and without macros. All versions starting with PowerPoint 2007 are supported.
Parameters
- $filepath : string
-
An absolute or relative file path
Tags
Return values
string —Text extracted
readRTF()
Extract the text content of Rich Text Format (RTF) files
private
readRTF(string $filepath) : string
Extracted with RTF classes of Christian Vigh. All version of RTF are supported.
Parameters
- $filepath : string
-
An absolute or relative file path
Tags
Return values
string —Text extracted
readText()
Extract the text content of plain text files
private
readText(mixed $filepath) : string
Markdown, reStructured text are supported as plain text.
Parameters
- $filepath : mixed
-
An absolute or relative file path
Return values
string —Text extracted
readWordBinary()
Extract the text content of Microsoft Office Word files (before 2007) (DOC, DOT)
private
readWordBinary(string $filepath) : string
Binary format of Microsoft office Suite. Templates and plain documents are supported, with and without macros. All versions before Word 2007 are supported.
The function uses the PHPWord library from which we kept only the code for reading Word binary files.
Parameters
- $filepath : string
-
An absolute or relative file path
Tags
Return values
string —Text extracted
readWordXML()
Extract the text content of Microsoft office Word files (2007 and higher) (DOCX, DOCM...)
private
readWordXML(string $filepath) : string
XML format of Microsoft office Suite. Templates and plain documents are supported, with and without macros. All versions starting with Word 2007 are supported.
Parameters
- $filepath : string
-
An absolute or relative file path
Tags
Return values
string —Text extracted
readXPS()
Extract the text content of Open XML Paper Specification files (XPS, OXPS)
private
readXPS(string $filepath) : string
All versions are supported.
Parameters
- $filepath : string
-
An absolute or relative file path
Tags
Return values
string —Text extracted