Opened 8 years ago
Last modified 6 years ago
#1371 closed Bug/Fehler
Probleme nach Update bei Verwendung von HTML-Komprimierung und MS Word Artikelbeschreibungen — at Version 1
| Reported by: | Torsten Riemer | Owned by: | somebody |
|---|---|---|---|
| Priority: | hoch | Milestone: | modified-shop-2.0.5.0 |
| Component: | Shop | Version: | 2.0.3.0 |
| Keywords: | Cc: | ||
| Blocked By: | Blocking: |
Description (last modified by )
Nach Update von 1.0x nach 2.x kann es durch die neue "/includes/external/compactor/compactor.php" teilweise zu weißen Artikeldetail-Seiten kommen.
Schuld ist hier der Regex in Zeile 261:
$html = preg_replace('/<!--(.|\s)*?-->/', '', $html);
Kommentiert man diesen aus, dann kommt es auch bei Verwendung von MS-Word Artikelbeschreibungen nicht zu weißen Seiten, so unschön diese MS-Word XML-Markups auch sein mögen.
Ich hatte mir aus diesem Grund mal einen Smarty Modifier gebastelt aus folgender Funktion:
<?php
function strip_word_html($text, $allowed_tags = '<b><i><sup><sub><em><strong><u><br>')
{
mb_regex_encoding('UTF-8');
//replace MS special characters first
$search = array('/‘/u', '/’/u', '/“/u', '/”/u', '/—/u');
$replace = array('\'', '\'', '"', '"', '-');
$text = preg_replace($search, $replace, $text);
//make sure _all_ html entities are converted to the plain ascii equivalents - it appears
//in some MS headers, some html entities are encoded and some aren't
$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
//try to strip out any C style comments first, since these, embedded in html comments, seem to
//prevent strip_tags from removing html comments (MS Word introduced combination)
if(mb_stripos($text, '/*') !== FALSE){
$text = mb_eregi_replace('#/\*.*?\*/#s', '', $text, 'm');
}
//introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be
//'<1' becomes '< 1'(note: somewhat application specific)
$text = preg_replace(array('/<([0-9]+)/'), array('< $1'), $text);
$text = strip_tags($text, $allowed_tags);
//eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one
$text = preg_replace(array('/^\s\s+/', '/\s\s+$/', '/\s\s+/u'), array('', '', ' '), $text);
//strip out inline css and simplify style tags
$search = array('#<(strong|b)[^>]*>(.*?)</(strong|b)>#isu', '#<(em|i)[^>]*>(.*?)</(em|i)>#isu', '#<u[^>]*>(.*?)</u>#isu');
$replace = array('<b>$2</b>', '<i>$2</i>', '<u>$1</u>');
$text = preg_replace($search, $replace, $text);
//on some of the ?newer MS Word exports, where you get conditionals of the form 'if gte mso 9', etc., it appears
//that whatever is in one of the html comments prevents strip_tags from eradicating the html comment that contains
//some MS Style Definitions - this last bit gets rid of any leftover comments */
$num_matches = preg_match_all("/\<!--/u", $text, $matches);
if($num_matches){
$text = preg_replace('/\<!--(.)*--\>/isu', '', $text);
}
return $text;
}
?>
Die selbe Funktion wird auch in den folgenden beiden Quellen verwendet:
https://gist.github.com/dave1010/674071
https://gist.github.com/purwandi/2862265
Die folgenden Zeilen haben bei mir dabei zu Problemen geführt:
#mb_regex_encoding('UTF-8'); // Tomcraft - not used!
#$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8'); // Tomcraft - not used!
#$text = preg_replace($search, $replace, $text); // Tomcraft - not used!
Und ich habe sie daher auskommentiert, so dass meine Funktion dann so aussieht:
<?php
function smarty_modifier_stripwordhtml($text, $allowed_tags = '<b><i><sup><sub><em><strong><u><br><p><span><script><fb:like-box><iframe><img><a><h1><h2><h3><h4><div><table><tr><td><tbody>')
{
#mb_regex_encoding('UTF-8'); // Tomcraft - not used!
//replace MS special characters first
$search = array('/‘/u', '/’/u', '/“/u', '/”/u', '/—/u');
$replace = array('\'', '\'', '"', '"', '-');
$text = preg_replace($search, $replace, $text);
//make sure _all_ html entities are converted to the plain ascii equivalents - it appears
//in some MS headers, some html entities are encoded and some aren't
#$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8'); // Tomcraft - not used!
//try to strip out any C style comments first, since these, embedded in html comments, seem to
//prevent strip_tags from removing html comments (MS Word introduced combination)
if(mb_stripos($text, '/*') !== FALSE){
$text = mb_eregi_replace('#/\*.*?\*/#s', '', $text, 'm');
}
//introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be
//'<1' becomes '< 1'(note: somewhat application specific)
$text = preg_replace(array('/<([0-9]+)/'), array('< $1'), $text);
$text = strip_tags($text, $allowed_tags);
//eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one
$text = preg_replace(array('/^\s\s+/', '/\s\s+$/', '/\s\s+/u'), array('', '', ' '), $text);
//strip out inline css and simplify style tags
$search = array('#<(strong|b)[^>]*>(.*?)</(strong|b)>#isu', '#<(em|i)[^>]*>(.*?)</(em|i)>#isu', '#<u[^>]*>(.*?)</u>#isu');
$replace = array('<b>$2</b>', '<i>$2</i>', '<u>$1</u>');
#$text = preg_replace($search, $replace, $text); // Tomcraft - not used!
//on some of the ?newer MS Word exports, where you get conditionals of the form 'if gte mso 9', etc., it appears
//that whatever is in one of the html comments prevents strip_tags from eradicating the html comment that contains
//some MS Style Definitions - this last bit gets rid of any leftover comments */
$num_matches = preg_match_all("/\<!--/u", $text, $matches);
if($num_matches){
$text = preg_replace('/\<!--(.)*--\>/isu', '', $text);
}
return $text;
}
?>
Jetzt habe ich nochmal mit Google nach der Funktion gesucht und hier eine aktuellere Quelle gefunden: https://github.com/OpenUpSA/pmg-export/blob/master/application/controllers/convert.php
protected function _strip_word_html($text, $allowed_tags = '<b><i><sup><sub><em><strong><u><br><p><table><tr><td><th><ul><ol><li>')
{
// if (strlen($text) > 100000) {
// return "Too big to process";
// }
mb_regex_encoding('UTF-8');
//replace MS special characters first
$search = array('/‘/u', '/’/u', '/“/u', '/”/u', '/—/u');
$replace = array('\'', '\'', '"', '"', '-');
$text = preg_replace($search, $replace, $text);
//make sure _all_ html entities are converted to the plain ascii equivalents - it appears
//in some MS headers, some html entities are encoded and some aren't
$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
//try to strip out any C style comments first, since these, embedded in html comments, seem to
//prevent strip_tags from removing html comments (MS Word introduced combination)
if(mb_stripos($text, '/*') !== FALSE){
$text = mb_eregi_replace('#/\*.*?\*/#s', '', $text, 'm');
}
$text = str_replace( chr( 194 ) . chr( 160 ), ' ', $text );
//introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be
//'<1' becomes '< 1'(note: somewhat application specific)
$text = preg_replace(array('/<([0-9]+)/'), array('< $1'), $text);
$text = strip_tags($text, $allowed_tags);
//eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one
$text = preg_replace(array('/^\s\s+/', '/\s\s+$/', '/\s\s+/u'), array('', '', ' '), $text);
//strip out inline css and simplify style tags
//on some of the ?newer MS Word exports, where you get conditionals of the form 'if gte mso 9', etc., it appears
//that whatever is in one of the html comments prevents strip_tags from eradicating the html comment that contains
//some MS Style Definitions - this last bit gets rid of any leftover comments */
// $num_matches = preg_match_all("/\<!--/u", $text, $matches);
// if($num_matches){
// }
$text = preg_replace('/<p.*?>(.*?)<\/p>/isu', '<p>$1</p>', $text);
$text = preg_replace(':<[^/>]*>\s*</[^>]*>:', '', $text);
$search = array('#<(strong|b )[^>]*>(.*?)</(strong|b)>#isu', '#<(em|i)[^>]*>(.*?)</(em|i)>#isu', '#<u[^>]*>(.*?)</u>#isu');
$replace = array('<strong>$2</strong>', '<i>$2</i>', '<u>$1</u>');
$text = preg_replace($search, $replace, $text);
$text = preg_replace('/<!--(.*?)-->/isu', '', $text);
$text = preg_replace('/<br(.*?)\/>/isu', '<br/>', $text);
return $text;
}
Testweise könnte man mal die Zeile 261 in der "/includes/external/compactor/compactor.php":
$html = preg_replace('/<!--(.|\s)*?-->/', '', $html);
ersetzen mit:
$html = preg_replace('/<!--(.*?)-->/isu', '', $html);
Aber generell bin ich dann doch dafür, dass wir mal einen vernünftigen MS-Word Filter optional zur Verfügung stellen.
