More LaTeX Hacking

After I presented “Are Text-Only Data Formats Safe? Or, Use This LaTeX Class File to Pwn Your Computer” at LEET 2010, I was asked if I planned to do any more research on TeX. My answer was no. It was a fun little project to do but not really part of my Research Agenda^TM. Imagine my surprise to find myself once again writing about TeX hacking.

One of the questions that came out of this project was could I, after performing all of the research into the internals of LaTeX, write a blacklist filter that would make embedding a preview service into a website safe. Publicly, my answer was no, but secretly, my answer was yes. Of course I can do this, I told myself. After all, there were just a few macros that really needed to be filtered out. I could prevent category code changes. I could leverage TeX itself to perform filtering. This would enable me to simply redefine the troublesome macros and primitives to be harmless. This isn’t a new idea, but I figured that I could do it correctly.

Ah hubris. Imagine my surprise to discover a new primitive in pdfTeX \pdfprimitive from the excellent TeX-SX. This primitive causes the following control sequence to have its primitive meaning, if it has one. For example, if \input has been redefined (e.g., to \relax), one can write \pdfprimitive\input to get the original meaning. I guess it is a good thing I never wrote such a blacklist.

I’m happy to say that our “promising approach” of using Web2C’s runtime configuration parameter openin_any still appears to be the best way to prevent data exfiltration.

So now that I know about \pdfprimitive, could I write a safe blacklist?