Hi everyone,
We're really excited about Rovo's capabilities, especially because we're on a Premium plan and this feature would strongly tip the scale towards us considering a jump to Enterprise. However, we have a couple of questions about how to integrate and crawl third-party documentation sites, or if this is even possible.
Our primary goal is for Rovo to be able to crawl and learn from the official documentation of tools we heavily rely on in our operations. Some examples of these sites include:
Cloud Providers: AWS, GCP (Google Cloud Platform), CloudFlare, Snowflake.
Security: Palo Alto, Zscaler, Global Protect.
Productivity & CRM: Zoho, Salesforce.
Communities: Specialized Sub-reddits (like /r/raws and some that we moderate).
We've attempted to add these sites via the "Custom Websites" option within the Rovo Connectors. Unfortunately, we've encountered two recurring types of errors:
"Site could not be added": A generic message without further details.
"robots.txt" message: We're informed that we need to add specific lines to our robots.txt
file. We understand the purpose of robots.txt
in controlling bot crawling, but we wonder if Rovo has the capabilities to access these external resources, especially when we are not the owners of these sites (e.g., AWS documentation).
Our question is: Is there a specific way or recommended procedure to connect and crawl these third-party sites? It would be perfect to have Rovo feed and be "trained" with the vast amount of documentation available for the applications we use daily, and this capability would truly be a grand decisive selling point for us to upgrade to Enterprise.
We'd greatly appreciate any guidance, tips, or solutions you can offer.
Thanks in advance for your help!
Hi @Mau Jimenez
We connected our own websites and had some issues at first. Even though we already had the robot.txt edited, Rovo gave us error messages. A few days later.. It worked.
Not changes on our side so a classic case of "have you tried to turn it off and on again". Maybe try again in a couple of days? ;)
I would expect any major Software distributor to soon edit their robots.txt to allow larger Bots to access their contents. Your only chance is to contact their Support about it. It might be a strategy to NOT allow random Bots/LLMs to process their contents; so not much you can do about it.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.