@rl_dane I know PDF can store 1-bit images with more specialised compression formats developed for fax machines and the like, I wonder if they outperform PNG for that specific use-case.
@rl_dane I know PDF can store 1-bit images with more specialised compression formats developed for fax machines and the like, I wonder if they outperform PNG for that specific use-case.
@rl_dane PDF can PNG? I thought it could only TIFF or JPEG.
Actually, you're right. The native lossless image format in PDF isn't PNG. I'm not totally sure what it is.
pdfimages just says "image," not "PNG", "TIFF", or "PPM"
rld@Intrepid:tmp$ pdfimages -list foo.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 2040 2040 rgb 3 8 image no 7 0 96 96 3155K 26%
1 1 smask 2040 2040 gray 1 8 image no 7 0 96 96 8188B 0.2%
@rl_dane gah, don’t make me open ~/Misc/books/specs/pdfreference1.0.pdf at this time of the night…
@mirabilos @rl_dane can't you embed everything? Wasn't there at least one printer manufacturer that embedded firmware updates?
@kabel42 @rl_dane you can attach arbitrary files, yes, see for example the PDFs under https://mbsd.evolvis.org/music/free/, but that’s not inline as graphic
I'll have to wait until I'm on one of my Plasma machines. ;)
rld@Intrepid:~$ doas pkg install okular
doas (rld@Intrepid) password:
Updating FreeBSD repository catalogue...
FreeBSD repository is up to date.
Updating FreeBSD-kmods repository catalogue...
FreeBSD-kmods repository is up to date.
All repositories are up to date.
The following 75 package(s) will be affected (of 0 checked):
...
Number of packages to be installed: 75
The process will require 292 MiB more space.
69 MiB to be downloaded.
Proceed with this action? [y/N]: n
rld@Intrepid:~$
@rl_dane that's cursed, why would you unnecessarily convert images? The only time i heard anything about JBIG was this ccc talk
@rl_dane it would make sense as preprocessing for OCR
@rl_dane you could maybe reuse the extraction from JBIG?
@rl_dane
Is ocrmypdf replacing imagesnwith text? I don't get the "what why??"
no, ocrmypdf just performs the OCR (using tesseract) and inserts it as textual metadata with the original images intact.
Someone suggested it may be using JBIG compression (lossy, cursed) for the image, but that would be weird! I've never seen ocrmypdf compress that well before.
If I had thought of it, I'd have looked to see if the resultant PNG file (once extracted) was the same as the original going in, but I don't think I have the intermediary files anymore.
@rl_dane the surprising steps are the lossy ones ;-)
* the (lossy) downsampling to 1bpp and (lossy) thresholding enabled "lossless" run-length encoding or whatnot to compress at such a high ratio
* the OCR step likely also wasn't lossless — for every very-slightly-unique splotch on the page with a visual pattern _close enough_ to a prototypical `a`/`b`/`c`/… (visually) it probably got replaced with a shared version of said ~letter instead
To your first point, you're absolutely right. Thresholding yeilds far more than an 8:1 compression because PNG is far more able to crunch bilevel graphics vs. grayscale.
To your second point, you're describing the #JBIG lossy compressor for scanned documents and monochrome images, and yeah, that's super cursed. I'd be surprised if that's what ocrmypdf is doing, but it's possible? ¯\_(ツ)_/¯
@rl_dane In certain specific cases you can take an already compressed image and encode it with base64 and then compress it 10 times further.
That would be a very strange edge case where expanding the data stream into base64 somehow exposed regularities that the compression algorithm somehow missed in the original data.
I've personally never seen base64/uuencoded files become smaller than the original files when compressed. (compared to the original files compressed the same way)
@rl_dane I've seen one of fediverse that did this to poison AI bots. A seemingly 2MB image extracts to 32GB.
@rl_dane Nah, it's only located publicly through robots.txt, a file only bots *should* read.
Technically it is a 4GB bitmap compressed to 20MB PNG, encoded to base64 and then put inline into an HTML that the webserver compresses down to 800kB, or something. Not sure about values, but I guess you get the point.
@rl_dane I know PDF can store 1-bit images with more specialised compression formats developed for fax machines and the like, I wonder if they outperform PNG for that specific use-case.
The fax compression algorithms are very limited, designed for a time where RAM was scarce. They basically compressed a couple rows of pixels at a time, nothing more, AFAIK.
There is #JBIG, which is a LOSSY 1-bit monochrome image compression algorithm. Yes, it is exactly as cursed as that sounds, there have been many cases where numbers and figures were changed by JBIG because a 6 looked like an 8. Horrifying. XD
I'm not totally sure what happened in this example, because I realized after I posted this that the pdfimages utility is converting whatever the PDF stored the images as into PNGs, not just extracting any embedded PNGs it finds.