Deep Dive into the Implementation of Metadata Storage

This article introduces the design and implementation of metadata storage in Alluxio Master, either on heap and off heap (based on RocksDB).

Background

Alluxio is the world’s first open source data orchestration platform for analytics and AI for the cloud. Alluxio bridges the gap between data driven applications and storage systems. It brings data from the storage tier closer to the data driven applications and makes it easily accessible enabling applications to connect to numerous storage systems through a common interface. Alluxio’s architecture enables data access at speeds orders of magnitude faster than existing solutions.

As a distributed system, Alluxio employs a Master-Worker architecture. An Alluxio cluster consists of one or more Master nodes and several Worker nodes. The Alluxio Master serves all user requests and journals file system metadata changes. Alluxio workers are responsible for managing user-configurable local resources allocated to Alluxio (e.g. memory, SSDs, HDDs). Alluxio workers store data as blocks and serve client requests that read or write data by reading or creating new blocks within their local resources.

This article introduces how Alluxio implements metadata storage in Alluxio Master.

Two Ways of Storing Metadata in Alluxio

Inside Alluxio master, each file or a directory is represented by a data structure called inode which contains attributes like file permissions, and other metadata like block locations. Alluxio master stores the metadata of the entire file system as an inode tree, similar to HDFS and other UNIX-based file systems.

Alluxio provides two ways to store the metadata:

ROCKS: an on-disk, RocksDB-based metastore
HEAP: an on-heap metastore

The default is the ROCKS.

The Alluxio source code provides two interfaces, InodeStore and BlockStore, in the /core/server/master/src/main/java/alluxio/master/metastore directory. The metadata managed by InodeStore includes individual inodes and the parent-child relationship between different inodes. BlockStore is responsible for managing the block size and block location of file data blocks. The cachingInodeStore implements an on java heap configurable cache in front of RocksDB to improve performance. In HEAP, these interfaces are implemented through HeapInodeStore and HeapBlockStore. The overall dependency relationships are as follows.

During the AlluxioMasterProcess, two Factories are generated to corresponding stores, and different InodeStore and BlockStore are generated in the Factories based on the configurations.

ROCKS Metastore

RocksDB is an embedded Key-Value database. Users can achieve efficient KV data storage and access by calling the API interface.

The BlockStore interface is implemented by RocksBlockStore, with the process as follows.

Factory is added to the mContext in AlluxioMasterProcess.
MasterUtils.createMasters() will create all the Master threads in order. When creating DefaultBlockMaster, it will call BlockStore.Factory to create a RocksBlockStore instance.
RocksDB.loadLibrary() will be called when initiating RocksBlockStore to load the dependent libraries and then create an instance of RocksStore class according to the configuration file. RocksStore is used to operate the RocksDB database, including initializing the database, backing up and restoring the database, etc.
When creating RocksStore, an instance of RocksDB class is created by calling RocksDB.open() method in the create() method.
DefaultBlockMaster reads and writes to the RocksDB database by using this RocksDB class.

RocksBlockStore uses the following main methods:

getBlock(long id)
putBlock(long id, BlockMeta meta)
removeBlock(long id)
getLocations(long id)
addLocation(long id, BlockLocation location)
removeLocation(long blockId, long workerId)

Let’s take getBlock() as an example. getBlock() is used to get the metadata of the corresponding block by blockId as follows.

 @Override
  public Optional<BlockMeta> getBlock(long id) {
    byte[] meta;
    try {
  meta = db().get(mBlockMetaColumn.get(), Longs.toByteArray(id));
    } catch (RocksDBException e) {
      throw new RuntimeException(e);
    }
    if (meta == null) {
      return Optional.empty();
 }
    try {
      return Optional.of(BlockMeta.parseFrom(meta));
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
  }

Below is to interact with RocksDB:

meta = db().get(mBlockMetaColumn.get(), Longs.toByteArray(id));

The db() method in the above code will return the previously created RocksDB class and then call the RocksDB.get() method. get() method needs two parameters, the first one is the ColumnFamilyHandle and the second is blockId.

Each K-V pair in RocksDB corresponds to a ColumnFamily, and ColumnFamily is equivalent to the logical partition in RocksDB. When we need to query the data in a ColumnFamily, we need to operate the underlying database through ColumnFamilyHandle, which is created while creating RocksDB instance.

The K-V data stored in RocksDB are stored as byte strings, so we need to convert the blockId to byte[], and convert the value returned from RocksDB back into a BlockMeta Java object using Google Protocol buffers.

The InodeStore interface in ROCKS method is implemented by CachingInodeStore and RocksInodeStore. The CachingInodeStore uses memory to store the metadata cache, while the RocksInodeStore is a metastore implemented by RocksDB as the backing store of CachingInodeStore.

When the metadata of the Alluxio cluster can be completely stored in the CachingInodeStore, Alluxio does not interact with the RocksInodeStore, but uses CachingInodeStore to get better performance instead. When the storage capacity of the CachingInodeStore reaches a threshold, Alluxio automatically migrates the metadata from the CachingInodeStore to the RocksInodeStore. At this point, the performance of metadata access depends on the cache hit rate of CachingInodeStore and the performance of RocksDB.

The process of creating and using RocksInodeStore is similar to RocksBlockStore. If MASTER_METASTORE_INODE_CACHE_MAX_SIZE is set to 0, then it uses RocksInodeStore. if it is not 0, then it requires creating both CachingInodeStore and RocksInodeStore.

case ROCKS:
InstancedConfiguration conf = ServerConfiguration.global();
  if (conf.getInt(PropertyKey.MASTER_METASTORE_INODE_CACHE_MAX_SIZE) == 0) {
    return lockManager -> new RocksInodeStore(baseDir);
  } else {
    return lockManager -> new CachingInodeStore(new RocksInodeStore(baseDir), lockManager);
  }

Heap Metastore

Alluxio uses heap memory as storage in the Heap method. When creating the AlluxioMasterProcess, the HeapInodeStore and HeapBlockStore are built to implement the InodeStore and BlockStore interfaces.

When creating the HeapInodeStore, a ConcurrentHashMap called mInodes is created to store the Inode information of files and folders. At the same time, a TwoKeyConcurrentMap called mEdges will be created to store the parent-child relationship of different nodes.

private final Map<Long, MutableInode<?>> mInodes = new ConcurrentHashMap<>();
  // Map from inode id to ids of children of that inode. The inner maps are ordered by child name.
  private final TwoKeyConcurrentMap<Long, String, Long, Map<String, Long>> mEdges =
      new TwoKeyConcurrentMap<>(() -> new ConcurrentHashMap<>(4));

TwoKeyConcurrentMap is a class defined in Alluxio, which implements a ConcurrentMap supporting two keys with the following logical structure: <k1, <k2, value>>.

Each key in the mEdges data structure maps an a Inode ID to a map of its children, which then maps each child’s name to its Inode ID.

In HeapBlockStore, a ConcurrentHashMap called mBlocks is created to store the metadata of each block, and a TwoKeyConcurrentMap called mBlockLocations is created to store the location of the block in the worker.

  // Map from block id to block metadata.
  public final Map<Long, BlockMeta> mBlocks = new ConcurrentHashMap<>();
  // Map from block id to block locations.
  public final TwoKeyConcurrentMap<Long, Long, BlockLocation, Map<Long, BlockLocation>>
      mBlockLocations = new TwoKeyConcurrentMap<>(() -> new HashMap<>(4));

The two keys in mBlockLocations are blockId and workerId, and the value is the specific locations on the worker where the block is stored. The blockId can be used to get the location of the block stored in the worker.

Summary

Alluxio provides two ways to store metadata, ROCKS and HEAP, and the default storage is the ROCKS. For ROCKS, in addition to using RocksDB, Alluxio provides a memory-based cache to improve metadata read and write performance. As a result, we can get high performance when the size of the metastore is limited. With RocksDB, we can store metadata to the hard disk to get more storage space.

In general, we should use the RocksDB to store metadata in Alluxio Master. If you only need to store a small amount of metadata which requires very high metadata read and write performance, you can also consider using the HEAP method.

About the Author

Changsheng Gu, Big Data Engineer at China Mobile.

Gu works in China Mobile Cloud Center, developing cloud-native data lake with a focus on HDFS and Alluxio.