Understanding and Overcoming Issues in PHPWord Document Generation

Creating MS Word documents is not an easy task for a programmer, as Word provides a large number of settings that can conflict with code when creating documents dynamically.

The main difficulties arise when creating Word from HTML. The customer will demand that the content of the document coincides with what is displayed on the browser screen. And to achieve this result was drunk more than one cup of coffee, spent more than one day in debugging the compatibility of tags in the markup.

One of the most popular opensource libraries is PHPWord from PHPOffice. It allows you to read, dynamically create documents in various formats.

Supported formats:

  • HTML
  • Word2007 (DOC, DOCX)
  • ODText
  • PDF
  • RTF
  • OOXML or OpenXML

The library has a lot of features from Microsoft Word, but I believe it is still far from ideal. Since not all the features of Microsoft Word have been realized by the developers of the library, we have to come up with solutions that will cover all the tasks from the client.

With PHPWord it is possible to create documents dynamically using PHP scripts. You can read all the features on Github.

Before we begin

In this article we will use additional tools like:

The above utilities will help you cut down on time and make your Word documents look the way you intended them to.

Installing PHPWord

Installation is fairly straightforward, just like a normal composer package. Install the stable version.

composer require phpoffice/phpword

I hope programmers have stopped using zip archives when installing dependencies for a project. They lose the ability to quickly update and install the package without storing it separately.

The package requires some basic PHP dependencies like:

  • php-xml
  • libzip
  • gd (optional)

Install PHPWord dependencies in Docker

If you like to use Docker in your work with PHP projects, here is the Dockerfile snippet to install PHPWord correctly.

RUN apt-get update \
 && apt-get install -y libzip-dev \.
 && docker-php-ext-install zip
 
# Install required dependency
RUN apt-get update && apt-get install -y \
 zip \
 libxml2-dev \
 libzip-dev

# Configure and install gd extension
RUN apt-get update && apt-get install -y \
 libfreetype6-dev \.
 libjpeg62-turbo-dev.
 libpng-dev \
 && docker-php-ext-configure gd - with-freetype - with-jpeg.
 && docker-php-ext-install -j$(nproc) gd

# Install zip and xml
RUN docker-php-ext-install zip xml

The Dockerfile snippet is unoptimized and has been compiled for clarity of module installation.

Remember each RUN instruction creates a new layer in docker and increases the size of the entire container.

Creating an MS Word Document

There are several options for creating documents:

  • using templates, where variables from the script are substituted into the template
  • creating a Word document from scratch using sections and elements.
  • creating dynamic documents from HTML.

PHPWord contains a large number of markup elements, but the main one is a section, which is where all objects are placed. A document may contain several sections, but at least one must be there.

The most flexible method is to create a document through elements. This method minimizes display defects, but increases the time for parsing the whole document, especially when each document has a different structure.

More preferable is to create a document from HTML, "Write once, run anywhere"  -  a java developer would say. But this approach has its pitfalls, which I will try to voice and give a solution, though not perfect, but definitely working.

Creating a basic document looks like this:

// Create the new document…
$phpWord = new PhpOffice\PhpWord\PhpWord\PhpWord();
// Add an empty Section to the document
$section = $phpWord->addSection();
$footer = $section->addFooter();

$footer->addPreserveText('Page {PAGE} of {NUMPAGES}.', null, [
 'alignment' => {\PhpOffice\PhpWord\SimpleType\Jc::CENTER
]);

// Add a basic html source
$html = '<p>Hello, CoderDen!</p>';

try {
 // Write HTML
 {\PhpOffice\PhpWord\Shared\Html::addHtml($section, $html, false, true);
} catch (\Exception $exception) {
 $section->addText($exception->getMessage());
 // TODO 
}

// Save document
$fileName = 'CoderDen.docx';

header('Content-Description: File Transfer');
header('Content-Disposition: attachment; filename='' . $fileName . ''');
header('Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document');
header('Content-Transfer-Encoding: binary');
header('Cache-Control: must-revalidate, post-check=0, pre-check=0');
header('Expires: 0');

$writer = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, 'Word2007');
$writer->save("php://output");

The try/catch construct is necessary to handle errors when parsing a document. It happens that html semantics is not correct for use in this method (wrong document structure, extra tags or characters of different encodings). But there is always a way out, read on to see how to handle these exceptions on the fly.

Minuses and problematic places of use

As I said above, the library requires customization for specific tasks, and has bugs and incompatibilities in working with HTML.

Compatibility and Formatting Issue

One of the main problems is compatibility between versions of Microsoft Word. The document can be distorted, especially when using complex formatting using styles, fonts and alignment of elements.

Not the best documentation

Although the library has documentation many aspects remain in the shadows and only by studying the code you can guess what works and how it works.

Performance

Large MS Word documents and weak servers can be a big problem in choosing a library for a project. Large data sets can take up all the resources, which will affect the stability of the whole system.

Loading and rendering images

For example, you want to include text in a document along with images that may be large in width or height. Not always the picture will be displayed the same way as in the original browser, but will go beyond the document boundaries.

Uploading and rendering pictures

To solve this problem we need to prescribe attributes that will limit our image to the width of the document container. Xpath will help us with that.

$crawler->filterXPath('//img')->each(function (Crawler $crawler) {
    foreach ($crawler as $node) {
        $node->removeAttribute('style');

        $width = $node->getAttribute('width');

        if (isset($width) && $width > 600) {
            $node->setAttribute('width', 600);
            $node->setAttribute('height', 'auto');
        }

        $node->setAttribute('width', '100%');
        $node->setAttribute('style', 'max-width: 100%; height: auto;');
    }
});

Display and Render large base64 images

An encoded image occupies a large amount of both space and memory. This causes execution errors in the process of converting HTML to XML.

To solve the problem you need to map base64 from the src attribute to cid (embedding image) before the conversion.

$contentIds = [];
$source = preg_replace_callback('~src="(data:image\/[^;]+;base64[^"]+)"~s', function ($matches) use (&$contentIds){
    $cid = Str::random();
    $contentIds[$cid] = $matches[1];

    return 'src="cid:' . $cid . '"';
}, $source);

After converting or manipulating the markup, the contents of the image must be returned in place

$source = preg_replace_callback("~cid:([\w_\d\-\-\-\.]+)~", function (array $matches) use ($contentIds) {
    if (isset($matches[1]) && key_exists($matches[1], $contentIds)) {
        return $contentIds[$matches[1]]
    }
    return $matches[0];
}, $source);

Incorrect semantics of HML markup

Since the structure of a Word document is a set of xml files - each tag must be closed. Also each link must have mandatory href attribute, tag id + name must not be duplicated. Each image must have src attribute. Otherwise, an error will occur when writing the file to the clipboard.

For example <a>Link</a>, <br/>, <br></br>. Here is an example of an error in the absence of the mandatory href attribute. Here is an example of an error when the mandatory href attribute is missing .

Invalid parameters passed. {"userId":1, "exception":""[object] (PhpOffice\\\PhpWord\\Exception\\Exception\Exception(code: 0): Invalid parameters passed. at /var/www/backend/vendor/phpoffice/phpword/src/PhpWord/Writer/Word2007/Part/Rels.php:122)

Here is a working option to solve this error. Create an attribute from a random string if one is missing.

$crawler->filterXPath('//a')->each(function (Crawler $crawler) {
    foreach ($crawler as $node) {
        // Set default href attribute
        if (! $node->hasAttribute('href')) {
            $node->setAttribute('href', Str::random(16));
        }
        $node->setAttribute('style', 'color: #008cff;');
    }
});

For proper semantics, I suggest using the Tidy extension. The tool handles the above problems well in a couple of code terms.

Many parameters are enabled by default, but some need to be enabled via config. Here are some of the useful parameters:

  • anchor-as-name and drop-proprietary-attributes which will tidy up all tag attributes.
  • drop-empty-elements , drop-empty-paras - will remove all empty elements.
  • bare - will replace escape spaces, smart quotes and long dashes with ASCII characters.
  • clean - performs html cleanup and replaces obsolete tags as needed.

Here is a link to the full list of available parameters: https://api.html-tidy.org/tidy/quickref_5.8.0.html.

Recommended list of parameters for tidy to tidy up all the markup.

$config = [
    'replace-color' => true,
    'drop-proprietary-attributes' => true,
    'drop-empty-elements' => true,
    'drop-empty-paras' => true,
    'bare' => true,
    'indent' => false,
    'indent-spaces' => 0,
    'clean' => true,
    'show-body-only' => true,
    'wrap' => 0,
    'hide-comments' => true,
    'output-html' => true,
    'merge-divs' => true,
    'merge-spans' => true,
    'word-2000' => true,
    'logical-emphasis' => true,
    'ascii-chars' => true,
    'numeric-entities' => true,
    'quote-ampersand' => false,
    'escape-cdata' => true,
    'anchor-as-name' => false,
]);

$tidy = new \tidy();
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();

$html = tidy_get_output($tidy);

How to clean up Microsoft HTML doc?

It's a fairly popular problem on the internet and the question "What's the best way to clean up Microsoft's inserted input ?". The best way is to use PHP Tidy with merge-divs and merge-spans enabled to remove unwanted parameters from Microsoft and remove unwanted nesting.

An example of copied content from a Microsoft Word document is shown below.

<span style="color: rgb(224, 62, 45);">
  <span xml:lang="EN-US" data-contrast="none">
    <span>Hello, Denis</span>
</span>

After the Tidy conversion you will get the following result:

<span class="c1">Hello, Denis</span>

Capture special characters

HTML content may contain characters of different encodings and an exception in the form of an error occurs when writing HTML to a file. When you open a Word document, the system notifies you that the file has been corrupted.

System notifies you that the file has been corrupted

To solve this problem, proper output escaping is required.To enable it, set the outputEscapingEnabled option to true in the PHPWord configuration file or use the following instruction at runtime.

\PhpOffice\PhpWord\Settings::setOutputEscapingEnabled(true);

The instruction will allow you to open the file without errors, but additional settings are required to display special characters correctly.

Special characters are not displayed correctly

Before adding to the html code section, HTML-ENTITIES must be converted.

$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
Html::addHtml($section, $html, true, true)

Special characters are displayed correctly

Alternatives to PHPWord

  • pandoc is a cool library with rich functionality. The best use case in PHP is as a binary file, suitable for any project. An example use case looks like this: pandoc -c style.css -s input.html -o output.docx --to html5.
  • Any other paid and free libraries or APIs
  • pure PHP

Conclusion

Working with HTML in PHP is resource-intensive, but under leverage, any feature can be adapted. PHPWord is a good and mature enough library but has its own nuances when working with HTML. A great combination for working with the library and HTML is Symfony DomCrawler + PHP Tidy, together they are indispensable for cleaning up HTML, fixing semantics and making a Word document look great.

Despite all the problems described in this article, PHPWord remains a powerful and useful tool for generating Word documents in PHP.

Similar Articles