Opened 8 years ago

Last modified 6 years ago

#1371 closed Bug/Fehler

Probleme nach Update bei Verwendung von HTML-Komprimierung und MS Word Artikelbeschreibungen — at Version 1

Reported by: Torsten Riemer Owned by: somebody
Priority: hoch Milestone: modified-shop-2.0.5.0
Component: Shop Version: 2.0.3.0
Keywords: Cc:
Blocked By: Blocking:

Description (last modified by Torsten Riemer)

Nach Update von 1.0x nach 2.x kann es durch die neue "/includes/external/compactor/compactor.php" teilweise zu weißen Artikeldetail-Seiten kommen.

Schuld ist hier der Regex in Zeile 261:

		  $html = preg_replace('/<!--(.|\s)*?-->/', '', $html);

Kommentiert man diesen aus, dann kommt es auch bei Verwendung von MS-Word Artikelbeschreibungen nicht zu weißen Seiten, so unschön diese MS-Word XML-Markups auch sein mögen.

Ich hatte mir aus diesem Grund mal einen Smarty Modifier gebastelt aus folgender Funktion:

<?php

    function strip_word_html($text, $allowed_tags = '<b><i><sup><sub><em><strong><u><br>')
    {
        mb_regex_encoding('UTF-8');
        //replace MS special characters first
        $search = array('/&lsquo;/u', '/&rsquo;/u', '/&ldquo;/u', '/&rdquo;/u', '/&mdash;/u');
        $replace = array('\'', '\'', '"', '"', '-');
        $text = preg_replace($search, $replace, $text);
        //make sure _all_ html entities are converted to the plain ascii equivalents - it appears
        //in some MS headers, some html entities are encoded and some aren't
        $text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
        //try to strip out any C style comments first, since these, embedded in html comments, seem to
        //prevent strip_tags from removing html comments (MS Word introduced combination)
        if(mb_stripos($text, '/*') !== FALSE){
            $text = mb_eregi_replace('#/\*.*?\*/#s', '', $text, 'm');
        }
        //introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be
        //'<1' becomes '< 1'(note: somewhat application specific)
        $text = preg_replace(array('/<([0-9]+)/'), array('< $1'), $text);
        $text = strip_tags($text, $allowed_tags);
        //eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one
        $text = preg_replace(array('/^\s\s+/', '/\s\s+$/', '/\s\s+/u'), array('', '', ' '), $text);
        //strip out inline css and simplify style tags
        $search = array('#<(strong|b)[^>]*>(.*?)</(strong|b)>#isu', '#<(em|i)[^>]*>(.*?)</(em|i)>#isu', '#<u[^>]*>(.*?)</u>#isu');
        $replace = array('<b>$2</b>', '<i>$2</i>', '<u>$1</u>');
        $text = preg_replace($search, $replace, $text);
        //on some of the ?newer MS Word exports, where you get conditionals of the form 'if gte mso 9', etc., it appears
        //that whatever is in one of the html comments prevents strip_tags from eradicating the html comment that contains
        //some MS Style Definitions - this last bit gets rid of any leftover comments */
        $num_matches = preg_match_all("/\<!--/u", $text, $matches);
        if($num_matches){
              $text = preg_replace('/\<!--(.)*--\>/isu', '', $text);
        }
        return $text;
    }
?>

Quelle: http://man.hubwiz.com/docset/PHP.docset/Contents/Resources/Documents/php.net/manual/en/function.strip-tags.html

Die selbe Funktion wird auch in den folgenden beiden Quellen verwendet:
https://gist.github.com/dave1010/674071
https://gist.github.com/purwandi/2862265

Die folgenden Zeilen haben bei mir dabei zu Problemen geführt:

    #mb_regex_encoding('UTF-8'); // Tomcraft - not used!
    #$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8'); // Tomcraft - not used!
    #$text = preg_replace($search, $replace, $text); // Tomcraft - not used!

Und ich habe sie daher auskommentiert, so dass meine Funktion dann so aussieht:

<?php

function smarty_modifier_stripwordhtml($text, $allowed_tags = '<b><i><sup><sub><em><strong><u><br><p><span><script><fb:like-box><iframe><img><a><h1><h2><h3><h4><div><table><tr><td><tbody>')
{
    #mb_regex_encoding('UTF-8'); // Tomcraft - not used!
    //replace MS special characters first
    $search = array('/&lsquo;/u', '/&rsquo;/u', '/&ldquo;/u', '/&rdquo;/u', '/&mdash;/u');
    $replace = array('\'', '\'', '"', '"', '-');
    $text = preg_replace($search, $replace, $text);
    //make sure _all_ html entities are converted to the plain ascii equivalents - it appears
    //in some MS headers, some html entities are encoded and some aren't
    #$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8'); // Tomcraft - not used!
    //try to strip out any C style comments first, since these, embedded in html comments, seem to
    //prevent strip_tags from removing html comments (MS Word introduced combination)
    if(mb_stripos($text, '/*') !== FALSE){
        $text = mb_eregi_replace('#/\*.*?\*/#s', '', $text, 'm');
    }
    //introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be
    //'<1' becomes '< 1'(note: somewhat application specific)
    $text = preg_replace(array('/<([0-9]+)/'), array('< $1'), $text);
    $text = strip_tags($text, $allowed_tags);
    //eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one
    $text = preg_replace(array('/^\s\s+/', '/\s\s+$/', '/\s\s+/u'), array('', '', ' '), $text);
    //strip out inline css and simplify style tags
    $search = array('#<(strong|b)[^>]*>(.*?)</(strong|b)>#isu', '#<(em|i)[^>]*>(.*?)</(em|i)>#isu', '#<u[^>]*>(.*?)</u>#isu');
    $replace = array('<b>$2</b>', '<i>$2</i>', '<u>$1</u>');
    #$text = preg_replace($search, $replace, $text); // Tomcraft - not used!
    //on some of the ?newer MS Word exports, where you get conditionals of the form 'if gte mso 9', etc., it appears
    //that whatever is in one of the html comments prevents strip_tags from eradicating the html comment that contains
    //some MS Style Definitions - this last bit gets rid of any leftover comments */
    $num_matches = preg_match_all("/\<!--/u", $text, $matches);
    if($num_matches){
          $text = preg_replace('/\<!--(.)*--\>/isu', '', $text);
    }
    return $text;
}
?>

Jetzt habe ich nochmal mit Google nach der Funktion gesucht und hier eine aktuellere Quelle gefunden: https://github.com/OpenUpSA/pmg-export/blob/master/application/controllers/convert.php

	protected function _strip_word_html($text, $allowed_tags = '<b><i><sup><sub><em><strong><u><br><p><table><tr><td><th><ul><ol><li>')
    {
    	// if (strlen($text) > 100000) {
    	// 	return "Too big to process";
    	// }
        mb_regex_encoding('UTF-8');
        //replace MS special characters first
        $search = array('/&lsquo;/u', '/&rsquo;/u', '/&ldquo;/u', '/&rdquo;/u', '/&mdash;/u');
        $replace = array('\'', '\'', '"', '"', '-');
        $text = preg_replace($search, $replace, $text);
        //make sure _all_ html entities are converted to the plain ascii equivalents - it appears
        //in some MS headers, some html entities are encoded and some aren't
        $text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
        //try to strip out any C style comments first, since these, embedded in html comments, seem to
        //prevent strip_tags from removing html comments (MS Word introduced combination)
        if(mb_stripos($text, '/*') !== FALSE){
            $text = mb_eregi_replace('#/\*.*?\*/#s', '', $text, 'm');
        }
        $text = str_replace( chr( 194 ) . chr( 160 ), ' ', $text );
        //introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be
        //'<1' becomes '< 1'(note: somewhat application specific)
        $text = preg_replace(array('/<([0-9]+)/'), array('< $1'), $text);
        $text = strip_tags($text, $allowed_tags);
        //eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one
        $text = preg_replace(array('/^\s\s+/', '/\s\s+$/', '/\s\s+/u'), array('', '', ' '), $text);
        //strip out inline css and simplify style tags
        
        //on some of the ?newer MS Word exports, where you get conditionals of the form 'if gte mso 9', etc., it appears
        //that whatever is in one of the html comments prevents strip_tags from eradicating the html comment that contains
        //some MS Style Definitions - this last bit gets rid of any leftover comments */
        // $num_matches = preg_match_all("/\<!--/u", $text, $matches);
        // if($num_matches){
              
        // }
        $text = preg_replace('/<p.*?>(.*?)<\/p>/isu', '<p>$1</p>', $text);
        $text = preg_replace(':<[^/>]*>\s*</[^>]*>:', '', $text);
        $search = array('#<(strong|b )[^>]*>(.*?)</(strong|b)>#isu', '#<(em|i)[^>]*>(.*?)</(em|i)>#isu', '#<u[^>]*>(.*?)</u>#isu');
        $replace = array('<strong>$2</strong>', '<i>$2</i>', '<u>$1</u>');
        $text = preg_replace($search, $replace, $text);
        $text = preg_replace('/<!--(.*?)-->/isu', '', $text);
        $text = preg_replace('/<br(.*?)\/>/isu', '<br/>', $text);
        return $text;
    }

Testweise könnte man mal die Zeile 261 in der "/includes/external/compactor/compactor.php":

$html = preg_replace('/<!--(.|\s)*?-->/', '', $html);

ersetzen mit:

$html = preg_replace('/<!--(.*?)-->/isu', '', $html);

Aber generell bin ich dann doch dafür, dass wir mal einen vernünftigen MS-Word Filter optional zur Verfügung stellen.

Change History (1)

comment:1 by Torsten Riemer, 8 years ago

Description: modified (diff)
Note: See TracTickets for help on using tickets.