How to list Confluence spaces and extract different information (last contributor and date, number of pages, blogpost, like, comment, total size attachment...) in pure SQL? and all with one quick query.
Ok let's see how to do it step by step.
The goal is to build an SQL query to list all the Confluence spaces of an Atlassian server or Atlassian Datacenter instance (*). Here is the list of information and indicators retrieved for each space in this request:
We did not do any specific benchmark to measure query execution time. But as an example we obtain in our test Confluence instance an execution time of 9.8 seconds for 768 spaces.
(*) Tested only on Confluence server 7.13 with PostgreSQL. The query is written for PostgreSQL database without use exotic SQL function, so there shouldn't be a lof of adjustement to do for other DBMS.
https://gist.github.com/v20100v/c6a196ef364276b62429c0af90bad137
SELECT
s.spaceid as space_id,
s.spacename as space_name,
s.spacekey as space_key
FROM spaces s
To do this, we need to join two table. The description of a space is in the content table. Then just filter on content type condition (contenttype=SPACEDESCRIPTION) like this :
SELECT
s.spaceid as space_id,
s.spacename as space_name,
s.spacekey as space_key,
sd.body as space_description
FROM spaces s
LEFT JOIN content ON content.spaceid = s.spaceid
LEFT JOIN bodycontent sd ON sd.contentid = content.contentid
WHERE content.contenttype = 'SPACEDESCRIPTION'
Still in the table space, we find the creator and creationdate fields that can be easily joined.
SELECT
s.spaceid as space_id,
s.spacename as space_name,
s.spacekey as space_key,
sd.body as space_description,
s.creationdate as space_creation_date,
up.username as space_creator_username,
u.display_name as space_creator_displayname,
u.email_address as space_creator_email
FROM spaces s
LEFT JOIN content ON content.spaceid = s.spaceid
LEFT JOIN bodycontent sd ON sd.contentid = content.contentid
LEFT JOIN user_mapping up ON up.user_key = s.creator
LEFT JOIN cwd_user u ON u.user_name = up.username
WHERE content.contenttype = 'SPACEDESCRIPTION'
Be carefull to the interpretation of the lastmodifier and lastmoddate field! It is by no means the last contributor in the space who would have modified a page or deposited an attachment. In fact, this is the last user to have modified the description field of the space.
We will see below how to identify the last contributor to a space, and more precisely the last creator of new content, and the last user to have modified content.
Just use the MAX() aggregate function on the creationdate field of the content table, and group the result by space. In this example, we have excluded the description of a space from the contributions of a space. So if you want to focus only on pages or blogposts, it will suffice to filter on these types of content.
SELECT c2.spaceid, MAX(c2.creationdate) as last_creationdate
FROM content c2
WHERE c2.content_status = 'current'
AND c2.creationdate IS NOT NULL
AND c2.contenttype != 'SPACEDESCRIPTION'
GROUP BY c2.spaceid
Ok it's easy to fetch the last creation date, but to add the associated creator it's a bit more difficult. To do this, we use an SQL subquery like this.
SELECT c.spaceid, up.username, c.creationdate
FROM content c
LEFT JOIN user_mapping up ON c.creator = up.user_key,
(
SELECT c2.spaceid, MAX(c2.creationdate) as last_creationdate
FROM content c2
WHERE c2.content_status = 'current'
AND c2.creationdate IS NOT NULL
GROUP BY c2.spaceid
) last_creation
WHERE c.content_status = 'current'
AND c.spaceid = last_creation.spaceid
AND c.creationdate = last_creation.last_creationdate
Ok nice! We can now add to our starting query with a simple join on the space identifier.
SELECT
s.spaceid as space_id,
s.spacename as space_name,
s.spacekey as space_key,
sd.body as space_description,
s.creationdate as space_creation_date,
up.username as space_creator_username,
u.display_name as space_creator_displayname,
u.email_address as space_creator_email,
last_creation.username as last_creator,
last_creation.creationdate as last_creation_date
FROM spaces s
LEFT JOIN (
SELECT c.spaceid,up.username, c.creationdate
FROM content c
LEFT JOIN user_mapping up ON c.creator = up.user_key,
(
SELECT c2.spaceid, MAX(c2.creationdate) as last_creationdate
FROM content c2
WHERE c2.content_status = 'current'
AND c2.creationdate IS NOT NULL
AND c2.contenttype != 'SPACEDESCRIPTION'
GROUP BY c2.spaceid
) last_creation
WHERE c.content_status = 'current'
AND c.spaceid = last_creation.spaceid
AND c.creationdate = last_creation.last_creationdate
) last_creation ON last_creation.spaceid = s.spaceid
LEFT JOIN content ON content.spaceid = s.spaceid
LEFT JOIN bodycontent sd ON sd.contentid = content.contentid
LEFT JOIN user_mapping up ON up.user_key = s.creator
LEFT JOIN cwd_user u ON u.user_name = up.username
WHERE content.contenttype = 'SPACEDESCRIPTION'
We do the same thing by doing the following left join:
SELECT
(...)
last_modification.username as last_modifier,
last_modification.lastmoddate as last_modification_date
FROM spaces s
LEFT JOIN (
SELECT c3.spaceid, up2.username, c3.lastmoddate
FROM content c3
LEFT JOIN user_mapping up2 ON c3.lastmodifier = up2.user_key,
(
SELECT c4.spaceid, MAX(c4.lastmoddate) as last_modificationdate
FROM content c4
WHERE c4.content_status = 'current'
AND c4.lastmoddate IS NOT NULL
AND c4.contenttype != 'SPACEDESCRIPTION'
GROUP BY c4.spaceid
) last_modification
WHERE c3.content_status = 'current'
AND c3.spaceid = last_modification.spaceid
AND c3.lastmoddate = last_modification.last_modificationdate
) last_modification ON last_modification.spaceid = s.spaceid
(...)
Rather than separating this calculation into several queries with GROUP BY clause, we group them into a single sub-query for performance reasons. It is much faster to do a matrix transpose than to perform several joins.
LEFT JOIN(
SELECT
c5.spaceid,
c5.contenttype,
CASE WHEN c5.contenttype = 'PAGE'
THEN COUNT(c5.contenttype) END nb_of_page,
CASE WHEN c5.contenttype = 'BLOGPOST'
THEN COUNT(c5.contenttype) END nb_of_blogpost,
CASE WHEN c5.contenttype = 'ATTACHMENT'
THEN COUNT(c5.contenttype) END nb_of_attachment,
CASE WHEN c5.contenttype = 'CUSTOM'
THEN COUNT(c5.contenttype) END nb_of_custom
FROM content c5
WHERE c5.content_status = 'current'
AND c5.contenttype != 'SPACEDESCRIPTION'
GROUP BY c5.contenttype, c5.spaceid
) t ON t.spaceid = s.spaceid
And finally we add it to our starting query with a simple join on the space identifier, as we saw previously. But this time we add an additional GROUP BY taking the MAX in order to denormalize the result, i.e. do the transpose.
SELECT DISTINCT
s.spaceid as space_id,
s.spacename as space_name,
s.spacekey as space_key,
s.spacetype as space_type,
s.spacestatus as space_status,
sd.body as space_description,
s.creationdate as space_creation_date,
up.username as space_creator_username,
u.display_name as space_creator_displayname,
u.email_address as space_creator_email,
MAX(t.nb_of_page) as nb_of_page,
MAX(t.nb_of_blogpost) as nb_of_blogpost,
MAX(t.nb_of_attachment) as nb_of_attachment,
MAX(t.nb_of_custom) as nb_of_custom
FROM spaces s
(...)
LEFT JOIN(
SELECT
c5.spaceid,
c5.contenttype,
CASE WHEN c5.contenttype = 'PAGE'
THEN COUNT(c5.contenttype) END nb_of_page,
CASE WHEN c5.contenttype = 'BLOGPOST'
THEN COUNT(c5.contenttype) END nb_of_blogpost,
CASE WHEN c5.contenttype = 'ATTACHMENT'
THEN COUNT(c5.contenttype) END nb_of_attachment,
CASE WHEN c5.contenttype = 'CUSTOM'
THEN COUNT(c5.contenttype) END nb_of_custom
FROM content c5
WHERE c5.content_status = 'current'
AND c5.contenttype != 'SPACEDESCRIPTION'
GROUP BY c5.contenttype, c5.spaceid
) t ON t.spaceid = s.spaceid
(...)
WHERE content.contenttype = 'SPACEDESCRIPTION'
GROUP BY
s.spaceid, s.spacename, s.spacekey, s.spacetype,
s.spacestatus, sd.body, up.username, u.user_name,
u.display_name, u.email_address, last_creator,
last_creation_date, last_modifier, last_modification_date
A comment is also contain into the content table. But we must do a reflexive join to find the associated space, because a comment does not have an associated spaceid. And so you have to find the page associated with the comment to find the associated space.
SELECT DISTINCT
(...)
MAX(t2.nb_of_comment) as nb_of_comment,
(...)
LEFT JOIN(
SELECT c7.spaceid, COUNT(c6.contenttype) as nb_of_comment
FROM content c6
JOIN content c7 ON c6.pageid = c7.contentid
WHERE c6.prevver IS NULL
AND c6.content_status = 'current'
AND c6.contenttype = 'COMMENT'
GROUP BY c6.contenttype, c7.spaceid
) t2 ON t2.spaceid = s.spaceid
(...)
We use the likes table.
SELECT DISTINCT
(...)
MAX(t3.nb_of_like) as nb_of_like
(...)
LEFT JOIN(
SELECT c8.spaceid, COUNT(l.id) as nb_of_like
FROM likes l
JOIN content c8 ON l.contentid = c8.contentid
WHERE c8.prevver IS NULL
AND c8.content_status = 'current'
GROUP BY c8.spaceid
) t3 ON t3.spaceid = s.spaceid
(...)
We use the contentproperties table.
SELECT DISTINCT
(...)
MAX(t4.total_attachments_size) as total_attachments_size
(...)
LEFT JOIN(
SELECT c9.spaceid, SUM(cp.LONGVAL) as total_attachments_size
FROM content c9
JOIN content c10 ON c9.contentid = c10.pageid
JOIN contentproperties cp ON c10.contentid = cp.contentid
WHERE c10.contenttype = 'ATTACHMENT'
GROUP BY c9.spaceid
) t4 ON t4.spaceid = s.spaceid
(...)
Here we go !
To implement a server-side paging system (for a Confluence data mining API), the simplest idea is to add the usual statement "LIMIT $limit OFFSET $offset" at the end. But the performance gain is extremely limited with regard to the number of rows in each join to be processed. Indeed, in our test instance, extracting 10 spaces takes almost 9.1 seconds.
The paging gain is minimal. In fact, for the gain to be significant, the dataset of each join should be filtered with the following WHERE condition :
(...)
AND s.spaceid IN (
SELECT spaceid FROM spaces ORDER BY spaceid LIMIT 10 OFFSET 0
)
By adding this WHERE condition on every join, extracting 10 spaces takes almost 518ms in our test instance. Harder, Better, Faster... Stronger !!!
Below is an example of adding this pagination clause in the join that calculates the total size of the attachments. For the implementation with an application server (Java, PHP, nodeJs...), it is enough to replace the numerical values of this example by the variables: limit and offset.
SELECT DISTINCT
(...)
LEFT JOIN(
SELECT c9.spaceid, SUM(cp.LONGVAL) as total_attachments_size
FROM content c9
JOIN content c10 ON c9.contentid = c10.pageid
JOIN contentproperties cp ON c10.contentid = cp.contentid
WHERE c10.contenttype = 'ATTACHMENT'
AND c10.spaceid IN (SELECT spaceid FROM spaces ORDER BY spaceid LIMIT 10 OFFSET 0)
GROUP BY c9.spaceid
) t4 ON t4.spaceid = s.spaceid
WHERE content.contenttype = 'SPACEDESCRIPTION'
AND s.spaceid IN (SELECT spaceid FROM spaces ORDER BY spaceid LIMIT 10 OFFSET 0)
GROUP BY
s.spaceid, s.spacename, s.spacekey, s.spacetype,
s.spacestatus, sd.body, up.username, u.user_name,
u.display_name, u.email_address, last_creator,
last_creation_date, last_modifier, last_modification_date
Vincent Blain
2 comments