What are PDF annotations and how to remove them?

What are PDF annotations and how to remove them?

Recently, I came across this scenario where I had to remove links from some PDF data, in base64 form, I was receiving from a service call in my application.

Annotations are additional objects that can be added to a document. This includes marking text, highlighting, sticky notes, links and many more.

Here is a sample PDF I picked up from the internet that has multiple annotations -

image.png

My very first thought was that I should probably render the pdf data in an html form where I can select the links and remove the attributes one by one. While I couldn't successfully convert it into HTML, I was able to render the pdf in an SVG format. SVG is nothing but a vector image. Being an image, it made sure the links were not present at all. Having solved the issue with the links, I picked out the links that were in blue and changed the color to black. But now the question that remained was - where was the PDF Viewer?

Because the data was getting rendered as images, the pdf viewer was not present at all. Integrating a PDF Viewer was an option but that was in-turn reducing performance.

Having tried multiple solutions in the front-end to render the pdf data into an acceptable format, and failing, I went back to the service. My service - that was receiving the PDF data.

pdf-lib is a library that helps you create and modify PDFs. On first glance of the documentation, there is not a straightforward method to directly remove any or all kinds of annotations from the pdf data.

After a lot of trial and error and endless hours spent on reading the documentation, a feasible solution was found.

The first thing you need to do is give the load function the base64/ Uint8Array of your pdf -

const pdfDoc = await PDFDocument.load(existingPdfBytes);

The load function takes in multiple data types and basically converts it into the PDFDocument format that the library needs.

Read the page one by one and look for annotations in every page and clear them out -

await pdfDoc.getPages().forEach(async (page) => {
            await page.node && page.node.Annots() && page.node.Annots().array.forEach(async (annot) => {
                await page.node.Annots().context.indirectObjects.delete(annot);
            })
        })

Save the PDF in a base64 or Uint8Array type and download it as you require.

const pdfBytes = await pdfDoc.save();
fs.writeFileSync('Download.pdf', pdfBytes)

Viola! your PDF is free of annotations.

Find the Github repository here.