Is it normal for a db to keep compacting for days after no activity during last db connection

Hello,

I’ve embedded a RocksDB 7.7.3 database into an application and ingested 30 billion entries racking on about 14TB on an SSD. When it finished, I understood it as “job done”.

A couple of days later when I opened the database again, log showed compaction activities using a single CPU. It’s been days and it seems the compaction is still going on.

I don’t remember if this db was behaving the same immediately after ingestion when I opened and closed it a few times, and if days elapsed had anything to do with it (stale factor?). Could this happen, and are there time-based rules governing compaction behavior?

As the database still keeps compacting automatically, is it safe to close to db by calling db.close? And further, would there be data loss if the main application gets a segfault from somewhere else and application exits?

Thanks,
Joe

https://groups.google.com/g/rocksdb/c/Lruy6LV4g7U/m/JIeo0bQxAAAJ

Reply by Hilik, Speedb’s co-founder and chief scientist :
This may be a cause of TTL or periodic compaction. Please search for the term “compaction_reason” in the rocksdb LOG file …

Reply from Mark Callaghan:
‘Congrats, 14T is a lot of data.
Compaction state is crash-safe and AFAIK you have to modify source code to make it not crash safe.
Can you share details on the RocksDB configuration. If there is just one thread doing a long compaction I will guess you are using universal.’

Reply by the question author:
It should be the default level compression. During ingestion I do see multiple threads working, like around 8.
I think it finally finished.
In the FAQ there is this:
Q: Can I close the DB when a manual compaction is in progress?
A: No, it’s not safe to do that. However, you call CancelAllBackgroundWork(db, true) in another thread to abort the running compactions, so that you can close the DB sooner. Since 6.5, you can also speed it up using DB::DisableManualCompaction().
Would this be the exception to the rule of crash safety?
Here are my configs. Do they look OK?
db_options.create_if_missing = true;
db_options.create_missing_column_families = true;
db_options.unordered_write = true;
auto num_threads = 32;
db_options.env->SetBackgroundThreads(num_threads, rocksdb::Env::Priority::HIGH);
db_options.env->SetBackgroundThreads(num_threads, rocksdb::Env::Priority::LOW);
db_options.max_background_jobs = num_threads;
db_options.bytes_per_sync = 1048576; // 1MB
std::vectorrocksdb::ColumnFamilyDescriptor column_families;

rocksdb::ColumnFamilyOptions cf_options;
cf_options.compression = rocksdb::CompressionType::kLZ4Compression;
cf_options.bottommost_compression = rocksdb::CompressionType::kZSTD;

// custom for SSD RocksDB Tuning Guide · facebook/rocksdb Wiki · GitHub
cf_options.write_buffer_size = 64 << 20;
cf_options.max_write_buffer_number = 4;
cf_options.min_write_buffer_number_to_merge = 1;
cf_options.level_compaction_dynamic_level_bytes = true;
// table_options.index_type = rocksdb::BlockBasedTableOptions::kHashSearch; // maybe good for prefix db

// recommended at Setup Options and Basic Tuning · facebook/rocksdb Wiki · GitHub
rocksdb::BlockBasedTableOptions table_options;
table_options.block_cache = rocksdb::NewLRUCache(128 << 20);
table_options.filter_policy.reset(rocksdb::NewBloomFilterPolicy(10, false));
table_options.optimize_filters_for_memory = true;
table_options.block_size = 16 * 1024;
table_options.cache_index_and_filter_blocks = true;
table_options.pin_l0_filter_and_index_blocks_in_cache = true;
cf_options.table_factory.reset(rocksdb::NewBlockBasedTableFactory(table_options));

// prefix
cf_options.prefix_extractor.reset(rocksdb::NewFixedPrefixTransform(sizeof(uint32_t)));
cf_options.memtable_prefix_bloom_size_ratio = 0.1;
cf_options.memtable_whole_key_filtering = true;

column_families.push_back(rocksdb::ColumnFamilyDescriptor(rocksdb::kDefaultColumnFamilyName, cf_options));
Reply by Mark C.:
When compaction took a long time to run

  • was it a manual or normal compaction
  • how many threads appeared to be in progress?
  • do you have any compaction IO stats in the RocksDB LOG (grep for L0, it is formatted as a table)

Too bad the docs don’t elaborate on what is meant by “not safe”. Compaction state should be crash-safe and I would consider it to be a bug otherwise. But I have little experience with manual compaction – long ago it was single-threaded so I stayed away from it given the poor performance.

Options look OK for the most part.

  1. I don’t know much about prefix bloom filters, so I won’t answer about that
  2. I filed a bug to improve the docs for SetBackgroundThreads, Fix the docs for SetBackgroundThreads vs max_background_jobs · Issue #11097 · facebook/rocksdb · GitHub
  3. I am not sure about your usage of SetBackgroundThreads

You have:
auto num_threads = 32;
db_options.env->SetBackgroundThreads(num_threads, rocksdb::Env::Priority::HIGH);
db_options.env->SetBackgroundThreads(num_threads, rocksdb::Env::Priority::LOW);
db_options.max_background_jobs = num_threads;

I prefer:
auto num_threads = 32;
auto num_flushes = num_threads/4;
auto num_compactions = num_threads - num_flushes;
db_options.env->SetBackgroundThreads(num_compactions, rocksdb::Env::Priority::HIGH);
db_options.env->SetBackgroundThreads(num_flushes, rocksdb::Env::Priority::LOW);
db_options.max_background_jobs = num_threads;

Reply by the question author:
It was a non-manual compaction as soon as db was opened. Seemed to be 1 thread.

I did see in the logs previously there were a few levels holding data and now there is only L6 holding data. So it must have done the right thing.

To be clear do you know what “safe” means in that context?

Thank you for the issue there. I should monitor it.

Reply by Matt:
Hi Joseph, were you able to figure this out? I’m seeing it on my end also. Have level 0/1 set to 1GB and lots of smaller (about 70) .sst (about 700MB), a few 1GB .sst and one large (55GB) .sst. 75 total .sst. It appears to be constantly trying to compact the 55GB file getting a little bigger each time (maybe about 500gb)… running java rocksdb 7.7.3.

Reply by Hilik, Speedb’s co-founder and chief scientist :
Matt & Joseph : in levelled compaction target_file_size_base, target_file_size_multiplier define the maximum size of target file . We recommend keeping the default (64M and 1) so that the files that would be created will be small and this will keep the compaction steps small. There are other reasons to use small file (smaller index that will fit into the shard etc…) Universal compaction compact the entire level regardless so if you use it there will be huge steps in the compaction. Using universal compaction to reduce the write amp is only postponing the problem … as Mark written the compaction may be stopped at any time and the consistency of the data is not impacted .

Hi Joe and Matt,

The problem that you face is a known issue in ‘universal compaction’ on a large-scale installation and may become worse as times go by. Universal compaction is much better than leveled compaction in terms of write amplification, but it requires significantly more space and eventually needs to run a full compaction of the entire database.
Using leveled compaction on such a high scale is also not feasible due to the huge write amplification of random updates.
There is a lot of research work on trying to achieve better balances between write-amplification and space-amplification. Much is described in these two surveys (1) “LSM-based Storage Techniques: A Survey” and (2) “Constructing and analyzing the LSM compaction design space”. For example, the Dostoevsky and Spooky papers (SIGMOD2018 & VLDB2022) propose lazier and finer-grained compaction policies that better balance between write and space amplification. However, there is no public code for these and they might still suffer from various problems like high tail latency.
At Speedb on top of our opensource, we have our enterprise version, specifically for scale and performance at scale use cases.
In the enterprise version, we have designed a hybrid compaction mechanism that can achieve low write-amplification and space-amplification at the same time.

Reply by Matt:
‘Thanks Dan. I think what may be occurring, in my case, is that, somehow, I ended up with a lot of fragmented .sst files; some that are small (lots of 50mb-ish, around 60-70 files) and some large (one large 50gb). Once compaction kicks in, it appears to be consolidating small files with very large files and since there are lots of small files, it takes a really long time to do the merge sort. If the small files can be consolidated first, it should dramatically reduce the time involved to compact the entire set. Not sure if this is what Joe experienced but I’ve seen this a least 2x in our installation.’

Reply by Hilik, Speedb’s co-founder and chief scientist :
Matt, large files are problematic in many aspects (compaction and huge index & filters) . We would like very much to understand the reason for this files . Do you use Universal compaction ? can you also look for the value of the couple of options that determine the file size (target_file_size_base, target_file_size_multiplier ) ? Do you do ingest of an external SST?
p.s. if you can look for the event table_file_creation in the rocksdb log it has a field of data_size . this will allow you to find the job that created this huge file.

Reply by Mark C. :
“Using leveled compaction on such a high scale is also not feasible due to the huge write amplification of random updates.”

Not feasible, should I turn off the web-scale deployments of MyRocks that are running with leveled compaction?

Having supported leveled compaction at scale, the write amp …

  • is larger than what you get from universal/tiered, but leveled does much better on space-amp
  • is much smaller than what you get from a world-class b-tree like InnoDB
  • isn’t as bad as the occasional claims I read in conference papers, alas too many of these claims are poorly documented

Tiered has a big problem – single-threaded compaction for a large SST is too slow. I look forward to the solutions that have arrived (in Speedb, in ScyllaDB, not sure where else). Otherwise, sharding is the workaround.

For one example of real numbers for write efficiency see table 1 in section 5.1, IO write rate with InnoDB was ~4X larger than with MyRocks

Reply by Hilik, Speedb’s co-founder and chief scientist :
Mark, I think we agree … We will be happy to review with you the solution we have in SpeedDB that allow for TB(s) scale single shard database. Our hybrid compaction is an adaptive, practically a combination of universal and levelled. It is always using very small compaction steps . it is also trying to achieve balance between read and write amp ( use as many levels as needed so the write will flow but not more)

Reply by Hilik, Speedb’s co-founder and chief scientist :
Joe, allow me to guess what happened. You did a load of 14TB while disabling the compaction (prepare for bulk load?) and then reopened the database (and now all the L0 files needed to be compacted together). Assuming you have enough disk space (u need approximately twice the size of the data for doing this) this will work OK (and takes about a week) . There are ways to make this process shorter if you are interested… if the data is now in a read - only mode than u can ignore the type of compaction …

Reply by Mark C.:
Your hybrid solution sounds great. A nice side effect of publishing occasionally imperfect open source DBMS software is that smart people can come along and improve on it.