CodexWP

Convert Word to Markdown Function

This function is designed to convert Microsoft Word (.docx) documents into clean, readable Markdown format. It efficiently handles HTML conversion using PHPWord and further refines the output by removing unnecessary image tags. This is particularly useful for content management systems, documentation platforms, and markdown-based projects.

Uses

  • Converting Word documents to Markdown for easy integration into static site generators (e.g., Jekyll, Hugo).
  • Importing Word content into blogging platforms or knowledge bases.
  • Simplifying documentation workflow by maintaining source content in Word while managing outputs in Markdown.
  • Cleaning HTML content by removing images and preserving only the text and table structures.

Details

  • File Conversion: The function uses PHPWord to read the .docx file and convert it to HTML.
  • HTML Processing: It uses DOMDocument to extract the body content from the generated HTML.
  • Image Removal: Any <img> tags within the HTML body are detected and removed to ensure a clean Markdown output.
  • HTML to Markdown Conversion: The HtmlConverter from league/html-to-markdown is used to transform the HTML content into well-structured Markdown.
  • Table Support: Using TableConverter, the function retains table structures during the conversion.

PHP Function

Here is the complete PHP function for converting Word documents to Markdown:

</pre>
use PhpOffice\PhpWord\IOFactory;
use League\HtmlToMarkdown\HtmlConverter;
use League\HtmlToMarkdown\Converter\TableConverter;

public function convertWordToMarkdown(string $filePath): string
{
// Convert DocX to HTML
$phpWord = IOFactory::load($filePath);
$htmlWriter = IOFactory::createWriter($phpWord, 'HTML');
ob_start();
$htmlWriter->save('php://output');
$html = ob_get_clean();

// Retrieve body content from HTML
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
$body = $dom->getElementsByTagName('body')->item(0);
$bodyHtml = '';

// Remove <img> tags from the body content
$images = $dom->getElementsByTagName('img');
while ($img = $images->item(0)) {
$img->parentNode->removeChild($img);
}

// Rebuild the body HTML without the images
foreach ($body->childNodes as $node) {
$bodyHtml .= $dom->saveHTML($node);
}

// Convert HTML to Markdown
$converter = new HtmlConverter(['strip_tags' => true]);
$converter->getEnvironment()->addConverter(new TableConverter());

return $converter->convert($bodyHtml);
}

Installation Instructions

To use this function, ensure you have the following dependencies installed using Composer:

composer require phpoffice/phpword
composer require league/html-to-markdown

Example Use Case

$filePath = 'path/to/document.docx';
$markdown = convertWordToMarkdown($filePath);
echo $markdown;

This function is ideal for developers and content creators seeking a reliable solution for Word-to-Markdown conversion without compromising content quality. By automating the conversion, it reduces the time spent on manual content migration and formatting adjustments.

Leave a Reply