This function is designed to convert Microsoft Word (.docx) documents into clean, readable Markdown format. It efficiently handles HTML conversion using PHPWord and further refines the output by removing unnecessary image tags. This is particularly useful for content management systems, documentation platforms, and markdown-based projects.
Uses
- Converting Word documents to Markdown for easy integration into static site generators (e.g., Jekyll, Hugo).
- Importing Word content into blogging platforms or knowledge bases.
- Simplifying documentation workflow by maintaining source content in Word while managing outputs in Markdown.
- Cleaning HTML content by removing images and preserving only the text and table structures.
Details
- File Conversion: The function uses
PHPWord
to read the.docx
file and convert it to HTML. - HTML Processing: It uses
DOMDocument
to extract the body content from the generated HTML. - Image Removal: Any
<img>
tags within the HTML body are detected and removed to ensure a clean Markdown output. - HTML to Markdown Conversion: The
HtmlConverter
fromleague/html-to-markdown
is used to transform the HTML content into well-structured Markdown. - Table Support: Using
TableConverter
, the function retains table structures during the conversion.
PHP Function
Here is the complete PHP function for converting Word documents to Markdown:
</pre> use PhpOffice\PhpWord\IOFactory; use League\HtmlToMarkdown\HtmlConverter; use League\HtmlToMarkdown\Converter\TableConverter; public function convertWordToMarkdown(string $filePath): string { // Convert DocX to HTML $phpWord = IOFactory::load($filePath); $htmlWriter = IOFactory::createWriter($phpWord, 'HTML'); ob_start(); $htmlWriter->save('php://output'); $html = ob_get_clean(); // Retrieve body content from HTML $dom = new DOMDocument(); libxml_use_internal_errors(true); $dom->loadHTML($html); libxml_clear_errors(); $body = $dom->getElementsByTagName('body')->item(0); $bodyHtml = ''; // Remove <img> tags from the body content $images = $dom->getElementsByTagName('img'); while ($img = $images->item(0)) { $img->parentNode->removeChild($img); } // Rebuild the body HTML without the images foreach ($body->childNodes as $node) { $bodyHtml .= $dom->saveHTML($node); } // Convert HTML to Markdown $converter = new HtmlConverter(['strip_tags' => true]); $converter->getEnvironment()->addConverter(new TableConverter()); return $converter->convert($bodyHtml); }
Installation Instructions
To use this function, ensure you have the following dependencies installed using Composer:
composer require phpoffice/phpword composer require league/html-to-markdown
Example Use Case
$filePath = 'path/to/document.docx'; $markdown = convertWordToMarkdown($filePath); echo $markdown;
This function is ideal for developers and content creators seeking a reliable solution for Word-to-Markdown conversion without compromising content quality. By automating the conversion, it reduces the time spent on manual content migration and formatting adjustments.