Hello!
I'm trying to use Apache Tika in a confluence plugin I'm developing.
The problem seems to be depenency issues, as tika has a lot of them.
I also do not have a lot knowledge about OSGI and how Confluence handles this.
When using the dependncy tika-bundle I get the exception:
ClassNotFoundException: org.apache.xmlbeans.XmlException
When using the dependency tika-parsers the following error occurs:
ClassNotFoundException: org.w3c.dom.Node
Has anybody here already included tika in a plugin?
How can I locate the problem and fix it?
Please help, I'm stuck.
Confluence versions 4.3 and 5.1
Community moderators have prevented the ability to post new answers.
My issues were related to OSGI package imports. The plugin loaded classes with factories at runtime, but confluence could not automatically detect these package imports at build time, therefore the errors at runtime.
As the intention was to extract textual content of MS Office files the way like the OfficeConnector does it, here is the way I got it to work:
One further issue is, that a confluence plugin has to use a certain xerces version, as mentioned in this talk: http://www.atlassian.com/company/about/events/summit/2010/presentations/under-the-hood/plugins2-and-osgi-gotchas.jsp -- As POI (or Tika) use xerces, it was necessary to find a poi version that matches the xerces version of confluence.
By looking at the atlassian-plugin.xml and pom.xml of the OfficeConnector plugin, I found out which dependencies work:
<dependency> <groupId>org.apache.poi</groupId> <artifactId>poi</artifactId> <version>3.5-FINAL</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-scratchpad</artifactId> <version>3.5-FINAL</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>3.5-FINAL</version> <exclusions> <exclusion> <groupId>stax</groupId> <artifactId>stax-api</artifactId> </exclusion> <exclusion> <groupId>xml-apis</groupId> <artifactId>xml-apis</artifactId> </exclusion> </exclusions> </dependency>
With these dependencies, I can extract powerpoint, excel and word files of current formats.
Next thing was to manually write the OSGI Import-Package section in the atlassian-plugin.xml.
In fact, I merged the generated entries that you can find in confluence's admin section in the OSGI browser with the entries of the OfficeConnector, found in the atlassian-plugin.xml.
And that worked.
Thanks for sharing the solution Philipp!
So, you got Tika working for Parsing Office files using the above dependencies? I know its been long, but if you remember, can you please share some more details.
I'm trying to integrate with Tika as well, any info you provide would be really helpful!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You will need to add the dependency in POM.xml of your plugin , refer http://www.sonatype.com/books/mvnex-book/reference/customizing-sect-add-depend.html
you might first need to install Tika in your local maven repository
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Do I understand you correctly, to add the tika dependency in the pom.xml?
I've done that.
The issue might be, that tika has a lot of transitive dependencies that are already shipped with confluence but in different versions.
I can compile the plugin, also atlas-run starts.
But when tika gets called, it cannot find these classes.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
The exceptions is because of org.apache.xmlbeans.XmlException
& org.w3c.dom.Node
May be you will need to add dependency for these classes
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.