﻿id	summary	reporter	owner	description	type	status	priority	milestone	component	version	resolution	keywords	cc	blockedby	blocking
1371	Probleme nach Update bei Verwendung von HTML-Komprimierung und MS Word Artikelbeschreibungen	Torsten Riemer	somebody	"Nach Update von 1.0x nach 2.x kann es durch die neue ""/includes/external/compactor/compactor.php"" teilweise zu weißen Artikeldetail-Seiten kommen.

Schuld ist hier der Regex in Zeile 261:
{{{
		  $html = preg_replace('/<!--(.|\s)*?-->/', '', $html);
}}}
Kommentiert man diesen aus, dann kommt es auch bei Verwendung von MS-Word Artikelbeschreibungen nicht zu weißen Seiten, so unschön diese MS-Word XML-Markups auch sein mögen.

Ich hatte mir aus diesem Grund mal einen Smarty Modifier gebastelt aus folgender Funktion:
{{{
<?php

    function strip_word_html($text, $allowed_tags = '<b><i><sup><sub><em><strong><u><br>')
    {
        mb_regex_encoding('UTF-8');
        //replace MS special characters first
        $search = array('/&lsquo;/u', '/&rsquo;/u', '/&ldquo;/u', '/&rdquo;/u', '/&mdash;/u');
        $replace = array('\'', '\'', '""', '""', '-');
        $text = preg_replace($search, $replace, $text);
        //make sure _all_ html entities are converted to the plain ascii equivalents - it appears
        //in some MS headers, some html entities are encoded and some aren't
        $text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
        //try to strip out any C style comments first, since these, embedded in html comments, seem to
        //prevent strip_tags from removing html comments (MS Word introduced combination)
        if(mb_stripos($text, '/*') !== FALSE){
            $text = mb_eregi_replace('#/\*.*?\*/#s', '', $text, 'm');
        }
        //introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be
        //'<1' becomes '< 1'(note: somewhat application specific)
        $text = preg_replace(array('/<([0-9]+)/'), array('< $1'), $text);
        $text = strip_tags($text, $allowed_tags);
        //eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one
        $text = preg_replace(array('/^\s\s+/', '/\s\s+$/', '/\s\s+/u'), array('', '', ' '), $text);
        //strip out inline css and simplify style tags
        $search = array('#<(strong|b)[^>]*>(.*?)</(strong|b)>#isu', '#<(em|i)[^>]*>(.*?)</(em|i)>#isu', '#<u[^>]*>(.*?)</u>#isu');
        $replace = array('<b>$2</b>', '<i>$2</i>', '<u>$1</u>');
        $text = preg_replace($search, $replace, $text);
        //on some of the ?newer MS Word exports, where you get conditionals of the form 'if gte mso 9', etc., it appears
        //that whatever is in one of the html comments prevents strip_tags from eradicating the html comment that contains
        //some MS Style Definitions - this last bit gets rid of any leftover comments */
        $num_matches = preg_match_all(""/\<!--/u"", $text, $matches);
        if($num_matches){
              $text = preg_replace('/\<!--(.)*--\>/isu', '', $text);
        }
        return $text;
    }
?>
}}}
Quelle: http://man.hubwiz.com/docset/PHP.docset/Contents/Resources/Documents/php.net/manual/en/function.strip-tags.html

Die selbe Funktion wird auch in den folgenden beiden Quellen verwendet:
https://gist.github.com/dave1010/674071
https://gist.github.com/purwandi/2862265

Die folgenden Zeilen haben bei mir dabei zu Problemen geführt:
{{{
    #mb_regex_encoding('UTF-8'); // Tomcraft - not used!
    #$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8'); // Tomcraft - not used!
    #$text = preg_replace($search, $replace, $text); // Tomcraft - not used!
}}}
Und ich habe sie daher auskommentiert, so dass meine Funktion dann so aussieht:
{{{
<?php

function smarty_modifier_stripwordhtml($text, $allowed_tags = '<b><i><sup><sub><em><strong><u><br><p><span><script><fb:like-box><iframe><img><a><h1><h2><h3><h4><div><table><tr><td><tbody>')
{
    #mb_regex_encoding('UTF-8'); // Tomcraft - not used!
    //replace MS special characters first
    $search = array('/&lsquo;/u', '/&rsquo;/u', '/&ldquo;/u', '/&rdquo;/u', '/&mdash;/u');
    $replace = array('\'', '\'', '""', '""', '-');
    $text = preg_replace($search, $replace, $text);
    //make sure _all_ html entities are converted to the plain ascii equivalents - it appears
    //in some MS headers, some html entities are encoded and some aren't
    #$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8'); // Tomcraft - not used!
    //try to strip out any C style comments first, since these, embedded in html comments, seem to
    //prevent strip_tags from removing html comments (MS Word introduced combination)
    if(mb_stripos($text, '/*') !== FALSE){
        $text = mb_eregi_replace('#/\*.*?\*/#s', '', $text, 'm');
    }
    //introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be
    //'<1' becomes '< 1'(note: somewhat application specific)
    $text = preg_replace(array('/<([0-9]+)/'), array('< $1'), $text);
    $text = strip_tags($text, $allowed_tags);
    //eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one
    $text = preg_replace(array('/^\s\s+/', '/\s\s+$/', '/\s\s+/u'), array('', '', ' '), $text);
    //strip out inline css and simplify style tags
    $search = array('#<(strong|b)[^>]*>(.*?)</(strong|b)>#isu', '#<(em|i)[^>]*>(.*?)</(em|i)>#isu', '#<u[^>]*>(.*?)</u>#isu');
    $replace = array('<b>$2</b>', '<i>$2</i>', '<u>$1</u>');
    #$text = preg_replace($search, $replace, $text); // Tomcraft - not used!
    //on some of the ?newer MS Word exports, where you get conditionals of the form 'if gte mso 9', etc., it appears
    //that whatever is in one of the html comments prevents strip_tags from eradicating the html comment that contains
    //some MS Style Definitions - this last bit gets rid of any leftover comments */
    $num_matches = preg_match_all(""/\<!--/u"", $text, $matches);
    if($num_matches){
          $text = preg_replace('/\<!--(.)*--\>/isu', '', $text);
    }
    return $text;
}
?>
}}}
Jetzt habe ich nochmal mit Google nach der Funktion gesucht und hier eine aktuellere Quelle gefunden: https://github.com/OpenUpSA/pmg-export/blob/master/application/controllers/convert.php
{{{
	protected function _strip_word_html($text, $allowed_tags = '<b><i><sup><sub><em><strong><u><br><p><table><tr><td><th><ul><ol><li>')
    {
    	// if (strlen($text) > 100000) {
    	// 	return ""Too big to process"";
    	// }
        mb_regex_encoding('UTF-8');
        //replace MS special characters first
        $search = array('/&lsquo;/u', '/&rsquo;/u', '/&ldquo;/u', '/&rdquo;/u', '/&mdash;/u');
        $replace = array('\'', '\'', '""', '""', '-');
        $text = preg_replace($search, $replace, $text);
        //make sure _all_ html entities are converted to the plain ascii equivalents - it appears
        //in some MS headers, some html entities are encoded and some aren't
        $text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
        //try to strip out any C style comments first, since these, embedded in html comments, seem to
        //prevent strip_tags from removing html comments (MS Word introduced combination)
        if(mb_stripos($text, '/*') !== FALSE){
            $text = mb_eregi_replace('#/\*.*?\*/#s', '', $text, 'm');
        }
        $text = str_replace( chr( 194 ) . chr( 160 ), ' ', $text );
        //introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be
        //'<1' becomes '< 1'(note: somewhat application specific)
        $text = preg_replace(array('/<([0-9]+)/'), array('< $1'), $text);
        $text = strip_tags($text, $allowed_tags);
        //eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one
        $text = preg_replace(array('/^\s\s+/', '/\s\s+$/', '/\s\s+/u'), array('', '', ' '), $text);
        //strip out inline css and simplify style tags
        
        //on some of the ?newer MS Word exports, where you get conditionals of the form 'if gte mso 9', etc., it appears
        //that whatever is in one of the html comments prevents strip_tags from eradicating the html comment that contains
        //some MS Style Definitions - this last bit gets rid of any leftover comments */
        // $num_matches = preg_match_all(""/\<!--/u"", $text, $matches);
        // if($num_matches){
              
        // }
        $text = preg_replace('/<p.*?>(.*?)<\/p>/isu', '<p>$1</p>', $text);
        $text = preg_replace(':<[^/>]*>\s*</[^>]*>:', '', $text);
        $search = array('#<(strong|b )[^>]*>(.*?)</(strong|b)>#isu', '#<(em|i)[^>]*>(.*?)</(em|i)>#isu', '#<u[^>]*>(.*?)</u>#isu');
        $replace = array('<strong>$2</strong>', '<i>$2</i>', '<u>$1</u>');
        $text = preg_replace($search, $replace, $text);
        $text = preg_replace('/<!--(.*?)-->/isu', '', $text);
        $text = preg_replace('/<br(.*?)\/>/isu', '<br/>', $text);
        return $text;
    }
}}}
Testweise könnte man mal die Zeile 261 in der ""/includes/external/compactor/compactor.php"":
{{{
$html = preg_replace('/<!--(.|\s)*?-->/', '', $html);
}}}
ersetzen mit:
{{{
$html = preg_replace('/<!--(.*?)-->/isu', '', $html);
}}}
Aber generell bin ich dann doch dafür, dass wir mal einen vernünftigen MS-Word Filter optional zur Verfügung stellen."	Bug/Fehler	new	hoch	modified-shop-2.0.4.0	Shop	2.0.3.0					
