WIKINDX API trunk

FILETOTEXT
in package

Convert files of various types to text ready for searching.

Table of Contents

Methods

readEPUB()  : string
Extract the text content of EPUB ebooks (EPUB)
readHtml()  : string
Extract the text content of (X)HTML files (loosly)
readMultipart()  : string
Extract the text content of multipart files (RFC2557) (EML, MHT)
readOpenDocument()  : string
Extract the text content of LibreOffice/OpenDocument/SunOffice Document and Presentation files (ODP, ODT...)
readPDF()  : string
Extract the text content of PDF files (PDF)
readPowerPointXML()  : string
Extract the text content of Microsoft Office PowerPoint files (2007 and higher)
readRTF()  : string
Extract the text content of Rich Text Format (RTF) files
readText()  : string
Extract the text content of plain text files
readWordBinary()  : string
Extract the text content of Microsoft Office Word files (before 2007) (DOC, DOT)
readWordXML()  : string
Extract the text content of Microsoft office Word files (2007 and higher) (DOCX, DOCM...)
readXPS()  : string
Extract the text content of Open XML Paper Specification files (XPS, OXPS)

Methods

readEPUB()

Extract the text content of EPUB ebooks (EPUB)

private readEPUB(string $filepath) : string

All version of EPUB are supported with a single function because the specification has changed very little when we consider only its structure and text extraction.

Versions supported :

  • EPUB 3.2
  • EPUB 3.1
  • EPUB 3.0
  • EPUB 2.0.1
Parameters
$filepath : string

An absolute or relative file path

Tags
see
https://www.w3.org/publishing/epub3/epub-spec.html

EPUB 3.2 Spec.

see
http://idpf.org/epub/dir/

EPUB Specifications and Projects

Return values
string

Text extracted

readHtml()

Extract the text content of (X)HTML files (loosly)

private readHtml(string $filepath) : string

Widely accepts elements of (X)HTML in all versions. Remove items that are not textual or purely technical items.

We assume that the document is malformed (normalization is performed) and can be parsed in reading order.

Parameters
$filepath : string

An absolute or relative file path

Return values
string

Text extracted

readMultipart()

Extract the text content of multipart files (RFC2557) (EML, MHT)

private readMultipart(string $filepath) : string

This format is a container for any number of files of arbitrary mime-type, separated by text boundaries. It's is used for packed HTML and email storage.

Each file is extracted, reencoded in UTF-8 (with transliteration) if possible, and parsed with convertToText().

Parameters
$filepath : string

An absolute or relative file path

Tags
see
https://tools.ietf.org/html/rfc2557

RFC2557 - MIME Encapsulation of Aggregate Documents, such as HTML (MHTML)

Return values
string

Text extracted

readOpenDocument()

Extract the text content of LibreOffice/OpenDocument/SunOffice Document and Presentation files (ODP, ODT...)

private readOpenDocument(string $filepath) : string

All version of OpenDocument are supported with a single function because the specification has changed very little when we consider only its structure and text extraction.

Versions supported :

  • Open Document Format for Office Applications (OpenDocument) Specification v1.3
  • Open Document Format for Office Applications (OpenDocument) Specification v1.2
  • Open Document Format for Office Applications (OpenDocument) Specification v1.1
  • Open Document Format for Office Applications (OpenDocument) Specification v1.0
  • Flat Open Document (Open Document without container)

Type supported:

  • Document
  • Presentation
Parameters
$filepath : string

An absolute or relative file path

Tags
see
https://www.oasis-open.org/standards/

Open Document Format for Office Applications (OpenDocument) Version 1.X

Return values
string

Text extracted

readPDF()

Extract the text content of PDF files (PDF)

private readPDF(mixed $filepath) : string

Adobe Portable Document Format extracted with PdfToText. PHP class of Christian Vigh or XPdf utilities.

The supported extensions of the format are not well defined. In the case of xpdf the support is supposed to be cutting edge.

Parameters
$filepath : mixed

An absolute or relative file path

Tags
see
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf

Adobe Portable Document Format v1.4

see
https://www.xpdfreader.com/pdftotext-man.html

pdftotext manpage

see
https://www.xpdfreader.com/pdfinfo-man.html

pdfinfo manpage

Return values
string

Text extracted

readPowerPointXML()

Extract the text content of Microsoft Office PowerPoint files (2007 and higher)

private readPowerPointXML(string $filepath) : string

XML format of Microsoft office Suite. Templates and plain documents are supported, with and without macros. All versions starting with PowerPoint 2007 are supported.

Parameters
$filepath : string

An absolute or relative file path

Tags
see
https://www.ecma-international.org/publications/standards/Ecma-376.htm

ECMA-376 - Office Open XML file formats

Return values
string

Text extracted

readText()

Extract the text content of plain text files

private readText(mixed $filepath) : string

Markdown, reStructured text are supported as plain text.

Parameters
$filepath : mixed

An absolute or relative file path

Return values
string

Text extracted

readWordBinary()

Extract the text content of Microsoft Office Word files (before 2007) (DOC, DOT)

private readWordBinary(string $filepath) : string

Binary format of Microsoft office Suite. Templates and plain documents are supported, with and without macros. All versions before Word 2007 are supported.

The function uses the PHPWord library from which we kept only the code for reading Word binary files.

Parameters
$filepath : string

An absolute or relative file path

Tags
see
https://download.microsoft.com/download/0/b/e/0be8bdd7-e5e8-422a-abfd-4342ed7ad886/word97-2007binaryfileformat(doc)specification.pdf
Return values
string

Text extracted

readWordXML()

Extract the text content of Microsoft office Word files (2007 and higher) (DOCX, DOCM...)

private readWordXML(string $filepath) : string

XML format of Microsoft office Suite. Templates and plain documents are supported, with and without macros. All versions starting with Word 2007 are supported.

Parameters
$filepath : string

An absolute or relative file path

Tags
see
https://www.ecma-international.org/publications/standards/Ecma-376.htm

ECMA-376 - Office Open XML file formats

Return values
string

Text extracted


        
On this page

Search results