Debugging database size, searchindex and revisions table bloat in Craft CMS 3

I recently helped debug a Craft CMS 3 installation where its database would grow exponentially larger over time. After researching and resolving the issue I figured it might be worth sharing some notes on that work.

The web site I needed to debug was norceresearch.no. We built the site two years ago and have added integrations to both Azure Active Directory and Cristin.no (Current Research Information System in Norway). And with Craft’s FeedMe plugin we run four periodic jobs where we import:

  1. Users from an Active Directory XML export.

  2. Research publications ordered by the research id.

  3. Research projects from Cristin. This job runs twice, where it imports first Norwegian and then English data.

This was running all well and fine but we noticed that the database was quickly growing over time. This was perplexing because we were not importing more and more content into Craft, yet our database backups were growing ever more in size.

So, what’s taking up space?

Large, wild vines covering a wall.

Thanks to this snippet I was able to get an overview of the tables’ sizes:

SELECT
    table_name as "Table",
    table_rows as "Rows",
    data_length as "Length",
    index_length as "Index",
    round(((data_length + index_length) / 1024 / 1024),2) as "Size (mb)"
FROM information_schema.TABLES
WHERE table_schema = 'nameofmynicecraftdatabase'
ORDER BY `Size (mb)` DESC;

After running this script I learned that the searchindex table was around 3GBs. Okay, so my query revealed that the searchindex table was the largest, but what exactly was taking up space inside of it?

I inspected the table and saw that it had these table fields: elementId, attribute, fieldId, siteId and keywords. There wasn’t much to learn from that but I wondered if I could do a SQL query to aggregate the most common fields and join it against a table that would reveal their names.

So, I wrote this SQL query:

select count(*) as num_search_entry, f.handle
from searchindex s, fields f
where s.fieldId = f.id
group by f.handle
order by num_search_entry DESC;

This query showed me that some of the biggest space takers were matrix block fields related to one of the FeedMe imports. When we originally setup the imports we decided to deactivate versioning for imported entries and deactivate add-to-search-index for many of the imported fields. This query showed us that we had overlooked some fields that were not necessary to add to the searchindex table.

I deactivated the fields from being included in the searchindex, but I wasn’t sure how to reset that table. I first tried calling "./craft gc” but the command failed on a SQL timeout when it tried to call DELETE FROM on the searchindex table. It was literally so large that the delete operation would run out of time. So, I could try to adjust the timeout configuration or maybe just empty the table completely. But was that safe to do?

After searching in the community Discord chat and the web I found some snippets of internal Craft CMS code that truncated the table before re-indexing, which was a strong hint that it should be okay to just empty the table (I made a backup just in case).

  1. I ran truncate searchindex.

  2. I then used the Craft CLI to run ./craft resave/entries and ./craft resave/users, to rebuild the search index.

  3. I then re-ran the SQL query to calculate the table sizes and the searchindex table had gone from from 3GB to 20MB in size (!).

The deed was done, or so I thought.

When I now looked at the results from the SQL query that ranked tables by size it showed that some tables still seemed a bit large. There was one matrix block table which when queried showed it was 692MB large and contained 3 588 088 rows. Hmm, could it be stale data somehow? I ran ./craft gc --delete-all-trashed, which resulted in reducing the matrix block table down to 1 693 906 rows, yet increasing table size to 745MB (?). It was puzzling.

Diving deeper 🌊

At this point it felt like holding a Rubik’s cube, turning it around and trying to figure out how to go forward. It’s like you’re holding the codebase, turning it around, poking and prodding it, watching to see what falls out. So, here’s what I did.

Just for kicks I tried upgrading from Craft CMS 3.3 to 3.4 to see if that would have an effect. Afterwards when I went to call ./craft resave/entries I noticed that there was a new utility CLI command called ./craft utils/prune-revisions/index. I tried calling it. I then saw the terminal print out all the work it was doing and I noticed that a lot of entries consistently had the same amount of revisions, forty to be exact.

From the terminal output I copied the ID of one of the entries that had had forty revisions and did a SQL query in the elements_sites table. I then learned that it was a Cristin project, a resource that was automatically imported by FeedMe every day. It was strange that this entry would have revisions because in the admin settings I had unchecked the setting “Enable versioning for entries in this section?".

I wrote a SQL query to tally up revisions in a way that would give hints to what entry types they were connected to:

select count(*) as total_revisions, r.sourceId, siteId, uri, e.type
from revisions r, elements_sites es, elements e
where r.sourceId = es.elementId and es.elementId = e.id
group by r.sourceId, siteId, uri, e.type
order by total_revisions DESC;

Sure enough, the SQL query showed that the database had a lot of revisions for sections where versioning was supposed to be disabled. I searched in the community chat and found some discussions about a maxRevisions setting in relations to enableVersioning:

“Hi everyone. I’m having an issue on a live site, where there are way too many revisions saved for every entry (most of them months old from a set of imports), and I can’t find a way to remove old versions that aren’t needed.” ~ cameo.

“Actually yeah, it looks like if you disable Entry Versions on a section that’s previously had them enabled - even saving the entry doesn’t remove the old versions.” ~ Daryl Knight

“I think if you set maxRevisions = 1 then re-save an entry, it works. But you’re right, if versioning is disabled nothing happens.” ~ croxton

In addition to searching in the community chat I also made a search in Craft’s Github repository for mentions of enableVersioning. I was curious of how and where it was used and mainly found this function where it was used. This supports the experience of the quoted users. If versioning is deactivated for a Craft section it won’t run any revision related code, not even delete any existing revisions.

At this point I was fairly sure that I’d found the culprit for the lingering database bloat. A way forward now would be to run ./craft utils/prune-revisions/index with maxRevisions set to 1 in order to clear out any unnecessary revisions. The only downside to this approach is that it would also remove revisions worth keeping. So, I went over to Craft’s Github repository and suggested two potential fixes in this issue.

Maybe the maintainers will address this issue, or maybe they won’t. That’s fine though, at least they’ve been made aware, and now you know as well.

Thanks for reading this debugging deep dive into Craft’s inner workings.