Hadoop Across Multiple Data Centers

Overview of Fundamental Technology

Apache Hadoop is a software-hardware complex that allows tasks related to storage and processing of large data arrays to be performed. Hadoop is designed for batch data processing using the computational power of more than one physical machine, as well as for storing datasets that do not fit on a single node.

You may need to use Hadoop if your needs meet the following criteria:

It is necessary to accept unstructured/semi-structured data (JSON) and then convert it into a normal/structured form.
Long-term storage and processing of large data arrays are required, with volumes that cannot be accommodated in a relational database management system (more than 100GB).
Data processing is needed that cannot be performed within the scope of a service/set of services because it does not fit into the RAM or requires the storage of large intermediate results.
It is necessary to transfer large datasets (hundreds of millions of rows, hundreds of gigabytes) between services/clusters - using Hadoop as an integration bus.
Multi-stage data processing operations (ETL) are required that cannot be implemented within a single service, or that require the use of logic in languages prohibited for integration in services (Python).
Ad-hoc analytics on data stored in Hadoop.

Examples of tasks currently implemented on Hadoop:

Intermediate storage of data extracted from master systems for further use in DWH (ETL, integration bus).
Calculation of mathematical models and forecasts.
Storage of user behavior data in a tabular form.

Prerequisites for Hadoop Across Multiple Data Centers:

As a rule, Hadoop is located in one data center, but what if we require high availability of a service like Hadoop that would allow it to withstand the shutdown of one of the data centers? Or, as is not excluded, perhaps you simply do not have enough racks in one data center, and you need to increase both data and resource capacity. We faced a choice - to place consumers on different Hadoops in different data centers (DC) or to do something special.

We chose to do something special, and thus a concept that we developed, tested, and implemented was born - Hadoop Multi Data Center (HMDC).

Hadoop Multi Data Center

As the name suggests, it lives in multiple data centers. This solution provides cluster resilience to the loss of capacity in any individual data center and efficiently distributes resources among consumers. HMDC provides two main services to consumers: the distributed file system HDFS and the resource manager YARN, which also acts as a task scheduler.

This technology allows for a much more fault-tolerant Hadoop cluster, although it has certain features, a brief overview of which is given below.

In each data center, there is one Master node and a set of worker Slave nodes. The Master nodes have the following components installed:

HDFS NameNode - responsible for the metadata of our HDFS, where each data block is located, and to which file it belongs.
YARN ResourceManager - our entry point for running tasks on YARN, responsible for distributing tasks among Slave nodes.

Both services operate in High-Availability mode based on the Active-Standby principle. That is, at any given moment, one Master node serves client requests. If it becomes unavailable, another Master node switches to “Active” mode and starts serving client requests. In this way, we can survive a data center outage (DC-1) without losing functionality.

Other advantages of this “stretched” architecture include:

Data providers support uploads to only one cluster.
Replication with a factor of 3, data is stored in each data center (DC).
Easy switching between DCs during a DC-1 outage.
Adding new resources is practically “on-demand”.
Migrating clients from one DC to another causes problems and takes only minutes.
YARN tasks of a specific queue (client) run only in one DC, and data is read only in that DC. As a result, we generate cross-DC traffic only during writing.

Distinctive features of HMDC

HDFS in HMDC uses HDFS Rack-awareness based on the principle “one data center is one rack” The definition of the nearest cluster is specified in the core-site.xml file. HDFS is deployed in a High Availability configuration with QJM and, as a service, consists of the following components:
- HDFS Namenode
- HDFS Journalnode
- HDFS ZKFC
- HTTPFS
- HDFS Datanode
YARN Capacity Scheduler in HMDC relies on Node Labels for resource allocation among node managers in different data centers. YARN is represented by the following components:
- YARN Resource Manager
- YARN Node Manager

HDFS Datanodes (DN) and YARN Node Managers (NM) are configured on machines functionally grouped as slave nodes. These are the main working machines in the cluster: the sum of the available volumes of allocated hard drives (taking replication into account), RAM, and virtual cores on these machines make up the total resource pool of HMDC cluster.

Control services - HDFS Namenode, ZKFC, YARN ResourceManager, HiveServer2, and HiveMetastore - are placed on machines grouped as master nodes. In each data center, there is one master node. The configuration of each service is implemented in such a way that the three master nodes provide High Availability for the service as a whole.

Main Components of the System

Let's take a closer look at the main components of the system.

HDFS Datanode (DN)

HDFS Datanode is a service responsible for storing data on cluster machines and a daemon that manages the data. On the machine, it is represented as a systemd service called hadoop-hdfs-datanode.service and runs as root, as it requires access to privileged ports for secure operation.

YARN Node Manager (NM)

YARN NodeManager is a daemon that manages the operation of YARN application processes on a specific machine. When a YARN application is accepted by the resource manager (RM), it instructs the selected NM to allocate resources (memory and virtual cores) for a container in which a specific application process will run (for example, Spark Executor, Spark Driver, MapReduce Mapper, or Reducer).
HDFS NameNode (NN)

HDFS NameNode is the main controlling service of HDFS. In HMDC cluster, three NNs operate simultaneously, but only one can be active at any given time. The other two NNs are in Standby State during this time (sometimes they are called Standby NN or SbNN).

HDFS JournalNode (JN)

HDFS JournalNode is a daemon responsible for synchronizing HDFS Namespace changes between NameNodes. On each master node, a systemd service called hadoop-hdfs-journalnode is running, and together they support the distributed QJM journal. The active NN reports changes in the HDFS Namespace through edits, which carry information about completed transactions in HDFS.
HDFS ZKFC

To ensure automatic fault detection of HDFS High Availability (HA) components and automatic failover, a Zookeeper Quorum and ZKFailoverController (ZKFC) are running on HMDC master nodes. ZKFC is a process represented by the systemd service hadoop-hdfs-zkfc.service, acting as a Zookeeper client and providing:
- Monitoring the state of NN through health check probes.
- Managing sessions in Zookeeper: ZKFC keeps an active session in Zookeeper if the NN is healthy. Also, if the NN is active, it holds a znode.
- If the NN is healthy, and no other NN holds a znode, ZKFC will attempt to win the election for a new active NN.
Zookeeper In HMDC

Zookeeper servers are also deployed on the master nodes, providing a means for distributed coordination of HA services. Zookeeper has two main functions:
- Detecting faults in NN operation.
- Ensuring re-election of the active NN.

YARN ResourceManager (RM)

YARN ResourceManager is a service responsible for managing resources in the YARN cluster. It operates in an HA configuration and is represented as a systemd service called hadoop-yarn-resourcemanager.service on the master nodes. It provides a web interface on port 8088 (Standby RM always redirects to the active RM's UI). To fully utilize resources and prevent downtime, RM implements two mechanisms:
- Queue Elasticity: Each queue is assigned parameters that determine guaranteed and maximum possible resources. If resources in the queue are underutilized, the application can get more resources than guaranteed, but not exceeding the values set by the second parameter.
- Container Preemption: To ensure applications always receive guaranteed resources, a mechanism is in place to reclaim resources from already running applications that are using more than their guaranteed capacities. Containers are terminated with the corresponding exit code and status in the UI. When a client authenticates to the cluster and sends a request to launch an application, RM calculates whether the required resources are available, accepts the application into the queue, and allocates memory and virtual cores to it.
Hive High Availability

HA for Hive is achieved by installing multiple Hive Metastore servers and Hiveserver2 in each of the data centers, respectively. For HA Hiveserver2, Zookeeper is used. You can connect to Hiveserver2 using JDBC. When serviceDiscoveryMode=zooKeeper, a connection is established to a random available Hiveserver2 server.

Benefits of HMDC

As I said at the very beginning - we were faced with a choice: to create independent data centers, where each consumer is located in their own data center (in our case, there are 3) or to implement this concept - HMDC. It is worth noting that despite the separation of consumers by data centers - many consumers need the same data (it is not possible to divide the data).

Let's compare these 2 paradigms:

	Independent DCs	HMDC
Data management	×3, mapping	As a single cluster
Error with data	Possible loss	Possible loss
Data volume	×3 - ×9	×3
DC loss migration	Several weeks	Hours, config editing
Inter-DC data replication	Custom service	HDFS functionality
Support	3 different domains	Unified cluster
Inequality of clusters	Flexibility in uploading	Block Placement Policies

What Does HMDC Give Us as a whole?

In case of a DC outage, you can migrate users with just a configuration change.
This concept allows you to manage data under the paradigm of “it's a single cluster”.
In case of an error during data processing, there may be data loss, as replication cannot be carried out.
The volume of stored data is spread across three data centers, with one copy in each data center. Yes, this creates some drawbacks. By lowering replication to two, you risk increasing network load due to data retrieval from other data centers.
Data replication between data centers is carried out through HDFS functionality.
If necessary, and with proper control, you can manage stored data by isolating it in one of the DCs thanks to the HDFS block placement policy.
You can add servers and data centers to your installation virtually on-demand when scalability is needed.