Hi everyone,
I have thousands of PDF documents that I’d like to use as a knowledge base for my Rovo agents.
I’ve tried importing them by converting the PDFs into pages, but the content is essentially vector-based images. As a result, the conversion produces pages full of images with no usable text. I’ve attempted different approaches (LibreOffice and other methods), but I haven’t been able to extract the text properly.
Would you recommend using OCR in this case? Or are there alternative approaches you’ve found effective?
Not being able to import PDFs as proper Confluence pages is quite limiting and is currently putting my “Single Source of Truth” initiative at risk.
Thanks in advance for any suggestions!
Best regards,
Giuseppe
Has anyone here ever tried building a KB ingestion layer that retrieves documents (including PDFs), extracts their content, and transforms them into AI-ready Confluence pages (i.e., not just a simple 1 document = 1 page conversion)?
Curious to hear if anyone has experimented with something like this and what approaches or tools you’ve found effective.
I implemented a RAG based system using Claude code. It fetches all the confluence spaces using APIs, transforms them into embeddings using gemini-embeddings and stores it in postgres database. Afterwards, i use claude to create any kind of marketing document, answer product queries, Q&A etc. It has two modes: Restricted mode which limits it to use my data only to answer queries and say no if it can't find any relevant data. This keeps it grounded on docs we have, it sync nightly. AI Mode where it is allowed to go loose but it remains limited to internal team only for research purpose.
It has an additional section to drop any types of files like pdf, csv etc which can be linked to existing spaces. Took a week to create using claude code.
You can look into gemini-embedding-2-preview It maps text, images, video, audio, and documents into a unified embedding space, enabling cross-modal search, classification, and clustering across over 100 languages
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
How about this approach?
I haven't tried if it is working, but it sounds it could, and it is easy to try. I don't know if it works well with your vector images, that adds further complexity. (I am quite sure it could work well with "text-only" PDF documents...)
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Unfortunately, due to security constraints, at the enterprise level it’s not allowed to connect Rovo to sources outside the Atlassian platform (e.g., SharePoint, Google Drive, etc.).
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
That's a tough one. I did some manual converions from PDF to Confluence but we're talking 10s of pages... not thousands.
I cannot offer a working solution for that volume, so I'm just sharing my past experience that may help put you on the right path.
About two years ago, I used this https://www.onlinedoctranslator.com/en/translationform to translate user manuals for a friend. It was a complex guitar multi-effect manaual with many diagrams, annotations, text in columns etc.
I uploaded a PDF and it spat out a 1:1 copy with translated texts placed where it was supposed to be. Including messages that were on illustrations of the gear's screen.
It means that the tool is able to extract and process text indpendently.
The service offers a PDF to Word conversion so I'd give it a try - just to find out if it produces a reasonable copy.
A long time ago I got good results with Adobe Acrobat and Quark XPress.
OCR might be good option too and I'm pretty sure there must a be commercially available tool that handles complex content.
But... as with any migration, you will have to do a cleanup - formatting of text, moving images etc.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Thanks @Kris Klima _K15t_ for sharing this — the tool clearly delivers good results for manual or low‑volume use.
That said, from an enterprise point of view it’s hard to consider it production‑ready:
So while it’s great for experimentation or small migrations, it doesn’t really fit enterprise‑scale KB transformations or AI‑ready pipelines (e.g. preparing content for Rovo Agents) where automation, control and reliability are key.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
That is exactly same thing that most of the people want to get their confluence to speed up to. Unfortunately the rovo is not smart enough yet to read files that would be a game changed, I am just afraid of the limit on what would be the file size to be supported.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
@Giuseppe Torraco - yes, this is obviously true, I only suggested it to verify the plausability. My thinking was that if a free tool can do that, there must be a commercially available solution that would meet your criteria or developing smth internally.
Confluence can import HTML (it's beta, so your mileage may vary) but it's also an option.
A friend of mine suggest Foxit
https://www.foxit.com/pdf-editor/convert-pdf/
https://www.foxit.com/resource-hub/user-manuals/foxit-pdf-editor-api-reference-for-application-communication/
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.