Aug 13, 2024

Recovering Redacted Content: It's Scary How Easy It Is

Blurred and pixelated images can often be reverse-engineered to reveal censored details, making them unreliable for secure redaction.
Black bar redaction in digital documents are not foolproof despite being industry standard; hidden text can be retrieved using simple software tools.
Proper redaction involves using the right tools for the job, but can go as far as printing, physically cutting out portions, and rescanning in 2-bit.

I censor images and documents everyday as part of my job.

It's nothing overly sensitive like passwords/confidential information; it's mainly for communication purposes & document creation — logos, names, or references that are irrelevant to my recipient.

If it's irrelevant, it's a distraction — so I crop, blur, pixelate, and even compress my images & documents before submission using a combination of CleanShot X and Clop. So imagine my surprise when I came across a YouTube short of Thor explaining that blur is an insecure form of censorship.

This sent me down a deep rabbit hole of the (lack of) security — of not just blurred text but also mosaic, pixelation, black bar censorship, as well as photo cropping.

Non-destructive Editing — Images and Text

I'll start with the history of Christopher Paul Neil, or Vico, his assigned nickname within the Interpol operation.

Vico was involved in a high-profile case of child sexual abuse involving at least 12 boys in Vietnam, Cambodia, and Thailand.
He appeared in more than 200 online photographs depicting the abuse, his face obscured by a digital swirl.
However, reversing the censorship simply involves applying the same swirl filter in the opposite direction, which made his face clearly visible.
Posting these reconstructed images online led to hundreds of witnesses contacting Interpol, which led to his arrest in October 2007.

Hubris is Vico's downfall — but we are just as likely to make the same mistakes in our day-to-day censoring efforts.

Reverse Engineering Text

What surprised me is how accessible these reversal tools are to the wider public.

For text-blurring reversal, there are two tools that I came across — BishopFox's Unredacter and Spipm's Depix. Both tools use different algorithms, but the underlying principles remain the same:

Use image cropping tools to isolate the censored text.
Obtain and load the font reference materials.
The software then conducts multiple experiments — creating blurred versions of every alphanumerical character within the reference font
These blurred texts are then compared to the censored text, character by character.
If the generated blurred text looks strikingly similar to the censored text, it becomes easy to decipher the original characters.

This solution is straightforward, but not perfect:

It requires the user to know the font family of the censored text.
- That's because the same text in different fonts will produce wildly different censored images, which can be hard to reverse-engineer.
- It shouldn't be difficult to identify font families using WhatTheFont, or GPT-assisted font identification tools.
What's actually hard to decipher is the blur type and intensity, which requires trial and error.
Photoshop alone contains 16 types of blurs which can affect the reverse-engineering process.
Also, the higher the blur intensity — the harder it is to recover the original text.

On the flip side, this deciphering method is versatile in theory — allowing users to recover text that's blurred, mosaic, or even pixelated.

Image De-blurring

Face de-pixelation demonstration. Source: Google Brain

Apparently, a 2017 Google project had already found ways to fill in details of very low-res images. They're not exactly accurate, but it's amazing how much information it can extract with such limited input. But also imagine how much this technology has advanced, now with AI advancements and the availability of high-quality training data.

In my research, I'm surprised to see a distinct overlap between reversing image censorship, image sharpening, and image upscaling.

This makes sense because all these technologies involve analysing surrounding pixels and using advanced algorithms to make informed decisions on what other pixel data should be.

I've found a GitHub repo compiling most of the important research done in this area starting from 2006. But there are also research papers that go as far back as 1980 — using diffraction grating to enhance blurred images the analogue way, before pixels were even widespread (it was invented in 1957).

But for pixelated images, the technology essentially involves deep learning (DL) techniques and non-deep learning (non-DL) techniques.

It even works with Video. Source: Github

DL Characteristics:

Produces higher-quality images
More versatile with various blur types and noise
Requires significant computational resources for model training
Models are more complex

Non-DL Characteristics:

Good quality if blur type and intensity are known
Less versatile than DL methods if data contains varying types of blur
Computationally and resource efficient
Algorithm is simple, and thus, more predictable

I can't find any case studies on image de-blurring being used to reverse censorship, but it is certainly being used by police authorities.

Different de-blurring algorithms. Source: Journal of Physics

There was a case involving CCTV footage of a violent crime in Delhi, but the footage was so blurred that the perpetrator could not be identified, despite being directly in front of the camera. De-blurring then played a crucial role in face detection, which led to better suspect identification. (I wonder if it's a more cost effective option than better CCTV investments.)

You can even try out de-blurring for free through Github projects like DeepMosaics.

Black Bar Redaction

Google search results for "Document Redaction"

I'd comfortably say that black bar redaction is the industry standard. In fact, it's the only thing that comes up while googling the term "redaction".

Even then, the adage still rings true — if it's not destructive, it's likely reversible. For digital documents, it's dangerous to assume that just because the user can't see it, neither can the computer.

An example would be black highlights, which is different from a proper redaction feature. Here's a Reddit post of a user copying redacted text into a clipboard by simply highlighting the redacted areas.

Redacted text being bypassed with a simple highlight + copy & paste. Source: Reddit

This also applies to covering the target text with black boxes. If saved improperly, users can simply remove the box layer in Adobe Illustrator or any PDF editor.

Removing black boxes by simply moving them using PDF tools

Sensitive information may also exist within the PDF's raw data. It could easily be extracted simply by converting the PDF into text, or for the more technically competent, digging through the source code.

Obfuscating Passwords Using Asterisks

A quick note regarding asterisk obfuscation: it's crucial that you complete your login process and not leave it unattended halfway — don't assume it's safe just because the password is censored.

Revealing password is as simple as editing HTML code in client browser

I've managed to find this flaw in KWSP's EPF login page — bypassing the censorship by simply editing a simple HTML line. There are also times when I can retrieve censored passwords by simply copying them to my clipboard and pasting them somewhere else

Maybank's login system is harder to bypass

Fortunately, Maybank's login page has implemented security measures against this — implementing some form of encryption and randomly generated UUID upon every keypress. They even restrict right-clicks on the webpage and any password manager's autocomplete system.

I have yet to try it out on the other banking portals just yet.

Recovering Cropped Screenshots

Another form of censorship is to remove entire sections by simply cropping the image. This, unfortunately, is not entirely safe either.

Pro geo-guesser discovers more un-cropped image data in RAW file. Source: RainBolt

As it turns out, you could un-crop JPEG files as well.

The Acropalypse is a vulnerability discovered in 2023, allowing users to view an un-cropped version of screenshots captured using several variants of Google Pixel phones.

In 2018, the new version of Android (Pie) was released, and the phones received a new screenshot editor called Markup. It took years for the issue to be brought up after a user found it strange that a cropped image had an abnormally large file size, which led to the discovery. A patch was released on March 13 to fix them.

You can try un-cropping the affected photos yourself by using the free tools here:

The Proper Way of Redaction

Given the vulnerabilities in document redaction, how should we go about properly redacting documents and images then?

Personally, I think the best resources come from the Legal & Defence sectors because information redaction is integral to their operations.

But here are some key ideas I've managed to capture:

Replace sensitive text with "[redacted]".

Firstly, replacing the text entirely obfuscates the length of the redacted text, while you can guesstimate the length of the redacted text using black bars. Plus, replacing the text entirely is less likely to result in user error, unlike black boxes which could go either way.

Use Adobe Acrobat's built-in redaction tool properly.

Adobe currently dominates the market, reportedly holding 76.85% of the share in 2021 — and frankly, it is good enough for the job. Personally, I use PDFGear, as long as you use the appropriate features and tools for redaction.

Utilise image and file compression, size-reducers.

Not only does this remove hidden metadata that might compromise any redaction efforts, but it also removes unnecessary bloat that makes sharing files easier. I personally use Clop for asset compression, but do note that it does not remove all metadata from the files.

Make the extra effort for truly sensitive data.

This involves printing out the PDF, physically cutting out sensitive portions, and re-scanning the document in 2-bit colour format (pure black & white). Apparently, a US court says that this method is 100% effective, despite the disclaimer that they do not explicitly support any specific redaction methods.

Password protect files/links.

Passwords help ensure that your redacted documents end up with the right person. To take one step further, make varied passwords for the different recipients, so that if a document were to be compromised, there is a chance that it can be traced back to a particular leaker.

Develop a system to manage unredacted original files.

All your redaction efforts would be wasted if bad actors got a hold of the source files, or god forbid, you sharing them publicly by mistake. There are ways around this, such as having proper file naming structures or having encrypted hard drives — but these are materials for future articles.

Writer's Note

With that, thanks for reading my second article from De-Code!
So far I've just been writing tech explanations, but do expect some variation in content as I've just secured a few interviews with subject-matter experts next week.

For now, I'm aiming to adhere to a weekly publishing schedule. I will also loop in a colleague of mine in the near future.

Subscribe to get important updates!

No spam, no sharing to third party. Only you and me.