So one of my pdfs has a page number and a link at the bottom of every page. It’s
around 500 pages so I dont want to edit it manually. Is there any way I can
delete those things all at once from all pages of the pdf? Maybe ghost script or
python script can do this? I also notice there isn’t a PDF com...
So one of my pdfs has a page number and a link at the bottom of every page. It's around 500 pages so I dont want to edit it manually. Is there any way I can delete those things all at once from all pages of the pdf?
Maybe ghost script or python script can do this?
I also notice there isn't a PDF community in Lemmy, maybe somebody should create one.
A PDF is (or at least can be) similar to a HTML document on the inside. A long time ago we used that at my company to edit PDFs through java code.
Is it possible for you to share the document so we can take a closer look at it? Or if you don't want it on the internet, is there a way to share it privately?
It's not as HTML. It's just that PDF is a structured file format (as is html, but very different). There are libraries for most programming languages that allow you to edit this structure.
you can read the HTML like structures inside a PDF and then find out details about the elements you want to remove and then remove them by using that found common property.
and open it with a text editor like kate. You will see a lot of encoded content data, but also the "html-like" structure in plaintext (in between the encoded stuff but also more at the bottom)
You might find that editing the PDF by hand will break it completely, that is because it is complicated. Iirc you'd need to fix the index, recalculate the checksum or do some other magic bullshit. But that is often taken care of by the library.
Here is a shitty python example for that demo pdf that redacts the image at the last page by drawing a white rectangle over it. There is no way in pymupdf to delete an image or a textblock ... but this is just an example. Other libraries might be able to do it (the one I used a decade ago in java could). I just wanted to point you in the general direction, hope you can see from here how iterating over all the pages, picking the right element and redacting it would work.
import pymupdf # PyMuPDF
# Open the PDF
doc = pymupdf.open("./file-sample_150kB.pdf")
# Get the last page
page = doc[-1]
# Get all images on the page
images = page.get_images(full=True)
if images:
# Get the xref of the first image
xref = images[0][0]
# Find all instances of the image and redact their bounding boxes
for info in page.get_image_info(xrefs=True):
if info["xref"] == xref:
rect = pymupdf.Rect(info["bbox"])
page.add_redact_annot(rect, fill=(1, 1, 1)) # white fill
page.apply_redactions()
# Save the modified PDF
doc.save("./modified.pdf")
doc.close()
A way simpler approach might be to crop all pages at the bottom.
import pymupdf # PyMuPDF
doc = pymupdf.open("input.pdf") # open the PDF
for page in doc:
rect = page.rect # original page size
new_rect = pymupdf.Rect(rect.x0, rect.y0 + 100, rect.x1, rect.y1) # crop bottom 100px
page.set_cropbox(new_rect)
doc.save("output.pdf") # save the cropped PDF
doc.close()
I don't know how comfortable you are writing your own, but pdf saves the components with coordinates, bounding box etc so you should be able to automate it with a small script that reads pdf components directly.
Also try qpdf to convert pdf into qdf format, then you can open it in a text editor, find the element you want to remove. Look at examples of few pages, find the pattern and do regex replace. Make sure to keep a copy and check the diff before accepting it.
doc = fitz.open('./test.pdf')
for page in doc:
# For every page, draw a rectangle on coordinates (1,1)(100,100)
page.draw_rect([1,1,100,100], color = (0, 1, 0), width = 2)
Like it or not, reddit isnt an illegal site or anything. Asking mods to do something about linking to reddit is like asking them to do something about linking to twitter. Its not even an "advertisement".
Look at how to do it with python, you'll learn interesting stuff, get a working result, and not destroy your brain using a chat simulator as a programming help.
I don't get why people are fine with comments that are as absurd as saying "to hang a painting, first stab a screwdriver in the wall then attach the painting to it, sometimes it's not too bad"