I am an SSD designer, and would like to use blktrace tool to analyze the block I/O pattern and behaviors when running rocksDB.
I know users of rocksDB normally do small size read and big size write, and the IO pattern of compaction is possible to be large file read and write.
So do we have any way to identify if it is user operation or compaction operation?
Blktrace tool only show the process when I run db_bench.
I cannot distinguish what block are from compaction.
Hi,
Let me start with a brief description of the i/o pattern.
-
User Get: usually need to read a single “block” of data this is a rocksdb block which may not be aligned and will preform normally 2 blocks of reads.
-
User Iterator: read from a few sequential streams one block at a time (pre-fetch is optional)
-
Flush write sequential stream
-
Compaction: read from a few sequential streams usually two, and writes a sequential stream.
-
Most of the users of rocksdb use WAL and hence there is a small sequential write (usually with sync) for every write performed by the user.
Unless you set the option of “direct i/o” those requests are send to the file system which has its own logic on fetching and staging of data.
The place that a dedicated SSD can help is the support for multiple parallel sequences and a dedicated persistent cache for the WAL. The response time is critical for user read and WAL writes the bandwidth is critical for other read and for the sequential writes
from Mark C.:
Some notes here → Small Datum: RocksDB internals: prefetch and/or readahead
I have little experience with blktrace. This might be easier when RocksDB is configured to use O_DIRECT for both user reads and compaction. With O_DIRECT for compaction the option writable_file_max_buffer_size determines the size of the write done by RocksDB from userland. And the option compaction_readahead_size determines the size of the reads done by compaction from userland. So compaction reads and writes should be much larger than user reads done for queries (which have a size <= the block_size option).