While useful it needs a big red warning to potential leakers. If they were personally served documents (such as via email, while logged in, etc) there really isn't much that can be done to ascertain the safety of leaking it. It's not even safe if there are two or more leakers and they "compare notes" to try and "clean" something for release.
The watermark can even be contained in the wording itself (multiple versions of sentences, word choice etc stores the entropy). The only moderately safe thing to leak would be a pure text full paraphrasing of the material. But that wouldn't inspire much trust as a source.
This doesn't seem to be designed for leakers, i.e. people sending PDF's -- it's specifically for people receiving untrusted files, i.e. journalists.
And specifically about them not being hacked by malicious code. I'm not seeing anything that suggests it's about trying to remove traces of a file's origin.
I don't see why it would need a warning for something it's not designed for at all.
It would be natural for a leaker to assume that the PDF contains something "extra" and to try and and remove it with this method. It may not occur to them that this something extra could be part of the content they are going to get back.
> Dangerzone works like this: You give it a document that you don't know if you can trust (for example, an email attachment). Inside of a sandbox, Dangerzone converts the document to a PDF (if it isn't already one), and then converts the PDF into raw pixel data: a huge list of RGB color values for each page. Then, outside of the sandbox, Dangerzone takes this pixel data and converts it back into a PDF.
With this in mind, Dangerzone wouldn't even remove conventional watermarks (that inlay small amounts of text on the image).
I think the "freedomofpress" GitHub repo primed you to think about protecting someone leaking to journalists, but really it's designed to keep journalists (and other security-minded folk) safe from untrusted attachments.
The official website -- https://dangerzone.rocks/ -- is a lot more clear about exactly what the tool does. It removes malware, removes network requests, supports various filetypes, and is open source.
I seem to remember Yahoo finance (I think it was them, maybe someone else) introducing benign errors into their market data feeds, to prevent scraping.
This lead to people doing 3 requests instead of just 1, to correct the errors, which was very expensive for them, so they turned it off.
I don't think watermarking is a winning game for the watermarker, with enough copies any errors can be cancelled.
> I don't think watermarking is a winning game for the watermarker, with enough copies any errors can be cancelled.
This is a very common assumption that turns out to be false.
There are Tardos probabilistic codes (see the paper I linked) which have the watermark scale as the square of the traitor count.
For example, with a watermark of just 400 bits, 4 traitors (who try their best to corrupt the watermark) will stand out enough to merit investigation and with 800 bits be accused without any doubt. This is for a binary alphabet, with text you can generate a bigger alphabet and have shorter watermarks.
These are typically intended for tracing pirated content, so they carry the so-called Marking Assumption (if given two or more versions of a piece of content, you must choose one. A pirate isn't going to corrupt or remove a piece of video, that would be unsuitable for leaking). So it would likely be possible to get better results with documents, may require larger watermarks to get such traitors reliably.
I'm not totally convinced that the threat model is realistic. The watermarker has to embed the watermark, the only place to do that is in the least significant bits of whatever the message is.
If it's an audio file then the least significant bits of each sample would work.
If it's a video file then the LSBs in a DCT bin may also be unnoticeable.
It can really only go in certain places, without it affecting the content in a meaningful way.
If it's in a header, or separate known location, then the pirate can just delete those bits.
The threat model presented says the pirates have to go with one of the copies, or only correct errors that are different between 2 copies.
That's the part that I don't think is realistic.
If the pirates knew that the file was marked, and the scheme used to mark it, but didn't know the key (a standard threat model for things like encryption), then they could inject their own noise into wherever the watermark could be hiding, and now the problem is the watermarker trying to send a message on a noisy channel, where the pirates have a jammer.
I don't even think you have to sacrifice quality, since the copy you have already has noise, and you just need to inject the same amount (or more).
It's more sophisticated than that. A single movie can be fragmented into 1000s of fragments, each fragment carries 1 bit. It's called A/B forensic watermarking. So you need to insert a 1-bit watermark into a video segment that is a few megabytes, there is no feasible way to defeat this as a pirate unless the watermarker is incompetent. Averaging will not work.
See AWS offering:
For large-scale per-viewer, implement a content identification strategy that allows you to trace back to specific clients, such as per-user session-based watermarking. With this approach, media is conditioned during transcoding and the origin serves a uniquely identifiable pattern of media segments to the end user. A session to a user-mapping service receives encrypted user ID information in the header or cookies of the request context and uses this information to determine the uniquely identifiable pattern of media segments to serve to the viewer. This approach requires multiple distinctly watermarked copies of content to be transcoded, with a minimum of two sets of content for A/B watermarking. Forensic watermarking also requires YUV decompression, so encoding time for 4K feature length content can take upwards of 20 hours. DRM service providers in the AWS Partner Network (APN) are available to aid in the deployment of per-viewer content forensics.
This will be more challenging for text. Not as difficult for images.
> the only place to do that is in the least significant bits
This is also false, it's the most naive way to watermark content. They do it in the mid range frequencies these days. And then make the watermarks robust to resizing, re-encoding, cropping and even rotation in some cases. They survive when someone holds a camera to record a screen.
Why not leak a dataset of N full text paraphrasings of the material, together with a zero-knowledge proof of how to take one of the paraphrasings and specifically "adjust" it to the real document (revealed in private to trusted asking parties)? Then the leaker can prove they released "at least the one true leak" without incriminating themselves. There is a cryptographic solution to this issue.
For some reason, printing 1 page of an Excel or Word document to a PDF often gets up to around 4MB in size. Passing it through this compresses it quite well.
Heh, I've seen this a bunch of times and it's of interest to me, but honestly? It's sooooo limiting by being an interface without a complementary command line tool. Like, I'd like to put this into some workflows but it doesn't really make sense to without using something like pyautogui. But maybe I'm missing something hidden in the documentation.
Not much further than their documentation, friend! But thanks for finding that, that's actually super helpful! I hope somebody puts in a pr for updating the documentation to make it clear what functionality their tool has.
They have some kind of virus scanner for files you open via a share link. Not sure about the ones you have stored on your own drive unshared.
But probably the main security here is just using the chrome pdf viewer instead of the adobe one. Which you can do without google drive. The browser PDF viewers ignore all the strange and risky parts of the PDF spec that would likely be exploited.
I often view PDFs in Drive, and it's definitely not just displaying the document with the native web browser. It is rendered with their "Drive renderer", whatever that is. They don't even display a simple .txt file natively in the browser.
And what special sauce does the web preview use? At some point, someone has to actually parse and process the data. I feel like on a tech site like Hacker News, speculating that Google has somehow done a perfect job of preventing malicious PDFs beckons the question: how do you actually do that and prove that it's safe? And is that even possible in perpetuity?
> how do you actually do that and prove that it's safe?
Obviously you can't. You assume it's best in class based on various factors including the fact that this is the same juggernaut that runs project zero. They also somehow manage to secure their cloud offering against malicious clients so presumably they can manage to parse a pdf to an image without getting pwned.
It would certainly be interesting to know what their internal countermeasures are but I don't know if that's publicized or not.
Is there some reason why just viewing the PDF with a FLOSS, limited PDF viewer (e.g. atril) would not accomplish the same level of safety? What can a "dangerous PDF" do inside atril?
A crafted PDF can potentially exploit a bug in atril to compromise the recipient's computer since writing memory-safe C is difficult. This approach was famously used by a malware vendor to exploit iMessage through a compressed image format that's part of the PDF standard:
This is why Firefox chose to implement a custom PDF reader in pure JS for better sandboxing leveraging the existing browser JS sandboxing.
As a side effect, it's been a helpful JS library for embedding PDFs on websites.
The Chrome PDF parser, originating from Foxit (now open-sourced as PDFium), has been the source of many exploits in Chrome itself over the years.
Shameless self promotion: preview.ninja is a site I built that does this and supports 300+ file formats. I'm currently weekend coding version 2.0 which will support 500+ formats and allow direct data extraction in addition to safe viewing.
It is a passion project and will always be free because commercial CDR[1] solutions are insanely expensive and everyone should have access to the tools to compute securely.
https://en.wikipedia.org/wiki/Traitor_tracing#Watermarking
https://arxiv.org/abs/1111.3597
The watermark can even be contained in the wording itself (multiple versions of sentences, word choice etc stores the entropy). The only moderately safe thing to leak would be a pure text full paraphrasing of the material. But that wouldn't inspire much trust as a source.
And specifically about them not being hacked by malicious code. I'm not seeing anything that suggests it's about trying to remove traces of a file's origin.
I don't see why it would need a warning for something it's not designed for at all.
> Dangerzone works like this: You give it a document that you don't know if you can trust (for example, an email attachment). Inside of a sandbox, Dangerzone converts the document to a PDF (if it isn't already one), and then converts the PDF into raw pixel data: a huge list of RGB color values for each page. Then, outside of the sandbox, Dangerzone takes this pixel data and converts it back into a PDF.
With this in mind, Dangerzone wouldn't even remove conventional watermarks (that inlay small amounts of text on the image).
I think the "freedomofpress" GitHub repo primed you to think about protecting someone leaking to journalists, but really it's designed to keep journalists (and other security-minded folk) safe from untrusted attachments.
The official website -- https://dangerzone.rocks/ -- is a lot more clear about exactly what the tool does. It removes malware, removes network requests, supports various filetypes, and is open source.
Their about page ( https://dangerzone.rocks/about/ ) shows common use cases for journalists and others.
Isn't this what newspapers do?
I don't think watermarking is a winning game for the watermarker, with enough copies any errors can be cancelled.
This is a very common assumption that turns out to be false.
There are Tardos probabilistic codes (see the paper I linked) which have the watermark scale as the square of the traitor count.
For example, with a watermark of just 400 bits, 4 traitors (who try their best to corrupt the watermark) will stand out enough to merit investigation and with 800 bits be accused without any doubt. This is for a binary alphabet, with text you can generate a bigger alphabet and have shorter watermarks.
These are typically intended for tracing pirated content, so they carry the so-called Marking Assumption (if given two or more versions of a piece of content, you must choose one. A pirate isn't going to corrupt or remove a piece of video, that would be unsuitable for leaking). So it would likely be possible to get better results with documents, may require larger watermarks to get such traitors reliably.
I'm not totally convinced that the threat model is realistic. The watermarker has to embed the watermark, the only place to do that is in the least significant bits of whatever the message is. If it's an audio file then the least significant bits of each sample would work. If it's a video file then the LSBs in a DCT bin may also be unnoticeable. It can really only go in certain places, without it affecting the content in a meaningful way. If it's in a header, or separate known location, then the pirate can just delete those bits.
The threat model presented says the pirates have to go with one of the copies, or only correct errors that are different between 2 copies. That's the part that I don't think is realistic. If the pirates knew that the file was marked, and the scheme used to mark it, but didn't know the key (a standard threat model for things like encryption), then they could inject their own noise into wherever the watermark could be hiding, and now the problem is the watermarker trying to send a message on a noisy channel, where the pirates have a jammer. I don't even think you have to sacrifice quality, since the copy you have already has noise, and you just need to inject the same amount (or more).
See AWS offering:
<https://docs.aws.amazon.com/wellarchitected/latest/streaming...>This will be more challenging for text. Not as difficult for images.
> the only place to do that is in the least significant bits
This is also false, it's the most naive way to watermark content. They do it in the mid range frequencies these days. And then make the watermarks robust to resizing, re-encoding, cropping and even rotation in some cases. They survive when someone holds a camera to record a screen.
https://github.com/caradoc-org/caradoc
http://spw16.langsec.org/slides/guillaume-endignoux-slides.p...
For some reason, printing 1 page of an Excel or Word document to a PDF often gets up to around 4MB in size. Passing it through this compresses it quite well.
Just ran a quick test:
- 1-page Excel PDF export: 3.7MB
- Processing with Dangerzone (OCR enabled): 131KB
How hard did you look the other times?
It doesn't seem to be meant for usage at scale -- it's not for general-purpose conversion, as the resulting files are huge, will have OCR errors, etc.
But probably the main security here is just using the chrome pdf viewer instead of the adobe one. Which you can do without google drive. The browser PDF viewers ignore all the strange and risky parts of the PDF spec that would likely be exploited.
Obviously you can't. You assume it's best in class based on various factors including the fact that this is the same juggernaut that runs project zero. They also somehow manage to secure their cloud offering against malicious clients so presumably they can manage to parse a pdf to an image without getting pwned.
It would certainly be interesting to know what their internal countermeasures are but I don't know if that's publicized or not.
https://github.com/mate-desktop/atril
A crafted PDF can potentially exploit a bug in atril to compromise the recipient's computer since writing memory-safe C is difficult. This approach was famously used by a malware vendor to exploit iMessage through a compressed image format that's part of the PDF standard:
https://projectzero.google/2021/12/a-deep-dive-into-nso-zero...
The Chrome PDF parser, originating from Foxit (now open-sourced as PDFium), has been the source of many exploits in Chrome itself over the years.
It is a passion project and will always be free because commercial CDR[1] solutions are insanely expensive and everyone should have access to the tools to compute securely.
1. https://en.wikipedia.org/wiki/Content_Disarm_%26_Reconstruct...
I imagine that folks like journalists could have that type of attack in their threat model, and EFF already do a lot of great stuff in this space :)
0. https://isc.sans.edu/diary/31998
1. https://www.cloudflare.com/cloudforce-one/research/svgs-the-...