Migration: Process for single HTML Export of existing wiki and single import using an (unknown) Bulk Import Command

Freedom Is Not Anarchy January 6, 2012


GOAL: Migration: Process for single HTML Export of existing wiki and single import using an (unknown) Bulk Import Command

##################################
### USE CASE:
##################################

(1) Company-Z has an Existing Wiki implemented in, www . mediawiki . org

(2) This existing wiki is relative simple.

** Less than 20 Images (.png)
** Less than 20 HTML Tables

(3) The plan of action that Company-Z desires is this:

** A simple Bulk-Export into a filesytem.
** Followed by, a Bulk-Import using a command line.

(3.a) First, Export. A one-time HTML Export of the existing Wiki ...

* This one-time Export is trivial.

* A very satisfactory 1-time Export is waiting and has been accomplished by using "wget":

wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --no-parent --domains media-wiki.company-z.com http://company-z.com/media-wiki/index.html

* This 1-time Export was created by "wget" and is completely browsable on the filesystem as "C:\wiki_project\company-z.com\media-wiki\index.html"

(3.b) Finally Import. What is the Bulk Import Command Line Tool ?

* Our site is so simple that we will accept a simple 1-time-Import into Confluence, and proceed from there to author new content.

* Is there a simple and bulk importer of a file-system tree authored by and supported by Atlassian?

* Invoking an analogy with the common tool Subversion, one would issue the command,
svn import"C:\wiki_project\company-z.com\media-wiki" http://wiki-confluence.company-z.com/wiki-confluence/

*
We imagine that Atlassian supports a bulk import command, similar to this:

wput --recursive --domains wiki-confluence.company-z.com --source "C:\wiki_project\company-z.com\media-wiki" --destination http://wiki-confluence.company-z.com/wiki-confluence/

* We need this command to support our potential purchase from Atlassian.


##################################
### BACKGROUND DISCUSSION ALREADY INCLUDES:
##################################

HTML import into Confluence: ( https://jira.atlassian.com/browse/CONF-1072 )

Importing Content Into Confluence: ( http://confluence.atlassian.com/display/DOC/Importing+Content+Into+Confluence )

Importing Content from another Wiki: ( http://confluence.atlassian.com/display/DOC/Importing+Content+from+another+Wiki )

Content Import Plugin: ( https://plugins.atlassian.com/plugin/details/18345 ) ( From "Communardo Software GmbH" )




3 answers

1 vote
Nic Brough -Adaptavist-
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
January 7, 2012

1) Html is *not* a data exchange format, it's for presenting data to a browser, so the browser can render it to a human. In some cases, there's nothing more than html, but for most applications, there is a lot less information in the rendered html than there is in the raw data. For a wiki, it's also unnecessarily complex and contains a lot of redundant data.

If you look carefully at the html and the wiki markup for a complex page, you will find that you can always work out the html from the markup, but you can NOT always work out the markup from the html.

So, html is NOT a "universal standard for wiki output" and should never be thought of as that.

2) It's "complicated" because migrating between wiki formats is NOT simple. Wikis are easy to use because they manipulate data before presenting it to the user making it very easy to enter data for them. But the rules they use for interpreting that data into html vary, and hence an importer needs you to explain how you want those rules interpreted in the import. Also, it needs all your settings in order to read the source. Read the documentation and you'll see that there is a need for each parameter and the properties file.

Short story is you need to provide that information to ANY conversion tool. "One simple command for importing html" is pretty much a pipe dream (either you need to write all your conversions into code, making it useless for any other wiki, or you need to tell a generalised import how to do the conversion). You really need to forget that idea and start thinking about how to import your data properly.

As I said before, I'd bin the html, it's not a useful format for a wiki conversion. (unless the data is immensely simple, to the point where you can run html2text and import plain text, or (slightly better), open each page in word, save as word document and import one page at a time from there)

0 votes
Freedom Is Not Anarchy January 7, 2012

(1)
It looks like this very request is an open issue for Atlassian.
As of today this issue has 46 Yes-votes, and 22 Watching-this-issue.
HTML import into Confluence: ( https://jira.atlassian.com/browse/CONF-1072</strong<>> ) ( CONF-6623 Provide simple HTML to Confluence conversion )

(2)
Many products allow bulk imports of data from a file system. Here are some ...

Microsoft Sharepoint: Bulk uploads include (
http://spbulkdocumentimport.codeplex.com/ , http://www.softpedia.com/get/Internet/Servers/WEB-Servers/Document-Import-Kit-for-SharePoint-2007-DocKIT.shtml )
Knowledgetree Document Management System: Bulk uploads of documents built into the product.
Oracle: Bulk uploads of data is possible including with CSV files
Subversion, Perforce, Clearcase: Bulk import of filesystem

0 votes
Nic Brough -Adaptavist-
Community Leader
Community Leader
Community Leaders are connectors, ambassadors, and mentors. On the online community, they serve as thought leaders, product experts, and moderators.
January 6, 2012

I'm not aware of any single-line importer, mostly because there is no standard format for wiki based data. They all store data differently and hence there's always a set of data conversions to be done. No-one is going to be able to write a "single line command" that will be able to read your mind and convert all your data flawlessly.

You've already got the links to get you started on doing an import. I'd simply bin their html/wget export and use the mediawiki importer - this will be as close as you can get to a "single line" and unless their data is horrendously complex, it does work rather well. https://studio.plugins.atlassian.com/wiki/display/UWC/UWC+Mediawiki+Notes

Although, for 20 pages, I'd be tempted to do it by hand - I can "import" a page in less time thanit would have taken me to type up your question, albeit without a lot of formatting. Another option might be to greab the pages as Word documents - Confluence's word importer is good too.

Freedom Is Not Anarchy January 7, 2012

@ NicB: Thank you Nic for your response and the effort that you have taken ! Please allow me to comment on ideas that you have led me to think about ...


(1)

NicB writes, " ... no standard format for wiki based data, all store data differently and hence there's always a set of data conversions to be done." .

I am not sure I agree with this (Perhaps we are not connecting).

In the sense that I mean it, there is indeed a standard output format ... HTML output is the standard.

Every Wiki does one thing: (Read/Write Wiki-Markup Text) ==> (HTML Output) ==> (Browsers).

Since HTML Output is a Universal Standard for all Wiki Output, why would Atlassian neglect to supply an HTML importer ?

It is trivial to export an entirely operational website that is browsable and in all senses fully operational onto a filesystem.

Has Atlassian overlooked the possibility that HTML sitting on a filesystem, could be the Universal Import Format that would be the quickest way for most people to migrate to Confluence ?

Please note that the scenario I am describing, it would be best if the HTML exporting user is a Wiki super user. , because then that way, the wget export can export everything in the Wiki (because a super user has access to everything. Also, if you don't have access to wget/Cygwin, there are many other free export websites as HTML tools.

(2)

NicB writes, " ... mediawiki importer - this will be as close as you can get to a 'single line' ... https://studio.plugins.atlassian.com/wiki/display/UWC/UWC+Mediawiki+Notes , https://studio.plugins.atlassian.com/wiki/display/UWC/UWC+Quick+Start ".

This tool seems to be asking for 10 or more parameters within a "exporter.mediawiki.properties" file, to start a conversion.

Again, why so complicated ?

I already have a fully functional website stored on my C:\ drive, containing 582 html files, 18 png, 16 gif, several html tables, and many hyperlinks.

Let's go, and import that tree with a clean and direct HTML import.

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events