Post · BT Free Social

Post

R.L. Dane :Debian: :OpenBSD: :FreeBSD: 🍵 :MiraLovesYou:

The lossless data compression fairies are having fun with me today...

Scan 8.5" x 11" document at 1200dpi @ greyscale
-> 60 MiB PNG, thank you
Open PNG in GIMP, select a good threshold point, convert to 1bpp
-> 514 KiB PNG
Wait... 116:1 compression from 8-bit PNG to 1-bit PNG? HOW??
convert to pdf
"Warning, this file is really huge and may actually be a decompression bomb" lol, ok.
-> 515 KiB PDF, nice
ocrmypdf foo.pdf document.pdf
-> 194 KiB PDF
WHAT? HOW?!?
pdfimages -png document.pdf foo
-> 514 KiB PNG
WHAT IS HAPPENING?!?

P.S., I found out that by default, ocrmypdf uses (lossless) #JBIG2 compression. That's why it was so well compressed. Also, the resultant PNG file at the end (which was basically the same PNG file that went into the PDF) was converted from JBIG — pdfimages converts images, it doesn't extract them in their natively stored format (but a -list will show you what the native format is). Also, I think pdfimages -all will just export the native format, whatever it is, but I haven't tried that yet.

mirabilos

@mirabilos@toot.mirbsd.org replied · 16 hours ago

@rl_dane PDF can PNG? I thought it could only TIFF or JPEG.

R.L. Dane :Debian: :OpenBSD: :FreeBSD: 🍵 :MiraLovesYou:

@rl_dane@polymaths.social replied · 15 hours ago

@mirabilos

Actually, you're right. The native lossless image format in PDF isn't PNG. I'm not totally sure what it is.
pdfimages just says "image," not "PNG", "TIFF", or "PPM"

rld@Intrepid:tmp$ pdfimages -list foo.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 2040 2040 rgb 3 8 image no 7 0 96 96 3155K 26%
1 1 smask 2040 2040 gray 1 8 image no 7 0 96 96 8188B 0.2%

mirabilos

@mirabilos@toot.mirbsd.org replied · 14 hours ago

@rl_dane gah, don’t make me open ~/Misc/books/specs/pdfreference1.0.pdf at this time of the night…

R.L. Dane :Debian: :OpenBSD: :FreeBSD: 🍵 :MiraLovesYou:

@rl_dane@polymaths.social replied · 14 hours ago

@mirabilos

There's a quiet cruelty in the fact that the pdf reference is a PDF.

Kinda like the "how to use your VCR" videocassettes of old 😁

kabel42

@kabel42@polymaths.social replied · 16 hours ago

@mirabilos @rl_dane can't you embed everything? Wasn't there at least one printer manufacturer that embedded firmware updates?

mirabilos

@mirabilos@toot.mirbsd.org replied · 15 hours ago

@kabel42 @rl_dane you can attach arbitrary files, yes, see for example the PDFs under https://mbsd.evolvis.org/music/free/, but that’s not inline as graphic

Index of /music/free

mirabilos

@mirabilos@toot.mirbsd.org replied · 14 hours ago

@kabel42 @rl_dane Okular shows these, btw, do give it a try

R.L. Dane :Debian: :OpenBSD: :FreeBSD: 🍵 :MiraLovesYou:

@rl_dane@polymaths.social replied · 14 hours ago

@mirabilos @kabel42

I'll have to wait until I'm on one of my Plasma machines. ;)

rld@Intrepid:~$ doas pkg install okular
doas (rld@Intrepid) password:
Updating FreeBSD repository catalogue...
FreeBSD repository is up to date.
Updating FreeBSD-kmods repository catalogue...
FreeBSD-kmods repository is up to date.
All repositories are up to date.
The following 75 package(s) will be affected (of 0 checked):

...

Number of packages to be installed: 75

The process will require 292 MiB more space.
69 MiB to be downloaded.

Proceed with this action? [y/N]: n
rld@Intrepid:~$

kabel42

@kabel42@polymaths.social replied · 14 hours ago

@rl_dane @mirabilos 292 MiB for 75 Pkgs, thats not a lot :)

1 more replies (not shown)

kabel42

@kabel42@polymaths.social replied · 16 hours ago

@rl_dane that's cursed, why would you unnecessarily convert images? The only time i heard anything about JBIG was this ccc talk

R.L. Dane :Debian: :OpenBSD: :FreeBSD: 🍵 :MiraLovesYou:

@rl_dane@polymaths.social replied · 15 hours ago

@kabel42

ocrmypdf doesn't even use JBIG by default, so I have no idea how that happened. But that is what happened.

kabel42

@kabel42@polymaths.social replied · 15 hours ago

@rl_dane it would make sense as preprocessing for OCR

R.L. Dane :Debian: :OpenBSD: :FreeBSD: 🍵 :MiraLovesYou:

@rl_dane@polymaths.social replied · 14 hours ago

@kabel42

Why though? It would cause more errors in the OCR! XD

I mean, yes, both tesseract and JBIG have to identify something akin to character cells, but they're not exactly sharing algorithms, AFAIK.

kabel42

@kabel42@polymaths.social replied · 14 hours ago

@rl_dane you could maybe reuse the extraction from JBIG?

R.L. Dane :Debian: :OpenBSD: :FreeBSD: 🍵 :MiraLovesYou:

@rl_dane@polymaths.social replied · 14 hours ago

@kabel42

Dunno. I think tesseract is much older than JBIG.

1 more replies (not shown)

pixx

@pixx@merveilles.town replied · 4 days ago

@rl_dane
Is ocrmypdf replacing imagesnwith text? I don't get the "what why??"

R.L. Dane :Debian: :OpenBSD: :FreeBSD: 🍵 :MiraLovesYou:

@rl_dane@polymaths.social replied · 2 days ago

@pixx

no, ocrmypdf just performs the OCR (using tesseract) and inserts it as textual metadata with the original images intact.

Someone suggested it may be using JBIG compression (lossy, cursed) for the image, but that would be weird! I've never seen ocrmypdf compress that well before.

If I had thought of it, I'd have looked to see if the resultant PNG file (once extracted) was the same as the original going in, but I don't think I have the intermediary files anymore.

Nathan Vander Wilt

@natevw@toot.cafe replied · 5 days ago

R.L. Dane :Debian: :OpenBSD: :FreeBSD: 🍵 :MiraLovesYou:

@rl_dane@polymaths.social replied · 2 days ago

@natevw

To your first point, you're absolutely right. Thresholding yeilds far more than an 8:1 compression because PNG is far more able to crunch bilevel graphics vs. grayscale.

To your second point, you're describing the #JBIG lossy compressor for scanned documents and monochrome images, and yeah, that's super cursed. I'd be surprised if that's what ocrmypdf is doing, but it's possible? ¯\_(ツ)_/¯

McTwist

@mctwist@social.accum.se replied · 5 days ago

@rl_dane In certain specific cases you can take an already compressed image and encode it with base64 and then compress it 10 times further.

R.L. Dane :Debian: :OpenBSD: :FreeBSD: 🍵 :MiraLovesYou:

@rl_dane@polymaths.social replied · 2 days ago

@mctwist

That would be a very strange edge case where expanding the data stream into base64 somehow exposed regularities that the compression algorithm somehow missed in the original data.

I've personally never seen base64/uuencoded files become smaller than the original files when compressed. (compared to the original files compressed the same way)

McTwist

@mctwist@social.accum.se replied · 2 days ago

@rl_dane I've seen one of fediverse that did this to poison AI bots. A seemingly 2MB image extracts to 32GB.

R.L. Dane :Debian: :OpenBSD: :FreeBSD: 🍵 :MiraLovesYou:

@rl_dane@polymaths.social replied · 2 days ago

@mctwist

Wouldn't that poison, I dunno, fediverse clients as well? ^___^

McTwist

@mctwist@social.accum.se replied · 2 days ago

@rl_dane Nah, it's only located publicly through robots.txt, a file only bots *should* read.
Technically it is a 4GB bitmap compressed to 20MB PNG, encoded to base64 and then put inline into an HTML that the webserver compresses down to 800kB, or something. Not sure about values, but I guess you get the point.

R.L. Dane :Debian: :OpenBSD: :FreeBSD: 🍵 :MiraLovesYou:

@rl_dane@polymaths.social replied · 2 days ago

@mctwist

That's wild. XD

Screwtapello

@Screwtapello@teh.entar.net replied · 5 days ago

@rl_dane I know PDF can store 1-bit images with more specialised compression formats developed for fax machines and the like, I wonder if they outperform PNG for that specific use-case.

R.L. Dane :Debian: :OpenBSD: :FreeBSD: 🍵 :MiraLovesYou:

@rl_dane@polymaths.social replied · 2 days ago

@Screwtapello

The fax compression algorithms are very limited, designed for a time where RAM was scarce. They basically compressed a couple rows of pixels at a time, nothing more, AFAIK.

There is #JBIG, which is a LOSSY 1-bit monochrome image compression algorithm. Yes, it is exactly as cursed as that sounds, there have been many cases where numbers and figures were changed by JBIG because a 6 looked like an 8. Horrifying. XD

I'm not totally sure what happened in this example, because I realized after I posted this that the pdfimages utility is converting whatever the PDF stored the images as into PNGs, not just extracting any embedded PNGs it finds.

BT Free Social

BT Free is a non-profit organization founded by @ozoned@btfree.social . It's goal is for digital privacy rights, advocacy and consulting. This goal will be attained by hosting open platforms to allow others to seamlessly join the Fediverse on moderated instances or by helping others join the Fediverse.

BT Free Social: About · Code of conduct · Privacy ·

Bonfire social · 1.0.1 no JS en

Automatic federation enabled