One of the side projects of the Scorpio Framework, is an article editing system that includes syntax highlighting. This was done so that documentation and examples etc could be easily produced.
Unfortunately, nothing ever runs smoothly! I soon found issues where the code would be converted back to actual XML entities instead of the encoded versions I had entered. A little investigation showed that TinyMCE was creating the correct syntax, it was just being mangled.
In true developer style I viewed it as being annoying more than anything so found ways around it - using textareas, or adding spaces etc. but this was getting very annoying as I knew it should not be necessary.
So I finally decided to investigate what was happening.
The results are surprising and not at all what I was expecting. Before getting there though, I went through all the usual steps:
- has the code been escaped correctly?
- is it being submitted?
- is it being escaped back into TinyMCE?
- is TinyMCE setup to encode the characters correctly? etc.
After trying a few things it became apparent it was something happening at the PHP end. As this the article system is built on Scorpio, it uses the utilityInputFilter::filterUnsafeRaw() filter over the page content variable. I added logging to check the value of the filtered data compared to the raw _REQUEST data.
And that was when I finally saw it.
The filtered data was NOT the same as the raw data. Instead of seeing HTML entities; they had been converted to the actual character so instead of & lt;blah_blah& gt; I was getting the literal < blah_blah > in the filtered data.
Checking the (rather sparse) documentation did not turn up anything; only that filtering raw "does nothing" unless a flag is specified. No flags were in use.
So if you are wanting to use character entities inside a HTML string and to then filter it - the filter extension should not be used.
Perhaps with that said this article should really be called: When is FILTER_UNSAFE_RAW not filtered UNSAFE_RAW?!