How AWS data pipeline configurations are different from each region

AWS Data pipelines are used to move data between different storages. In this article, we discuss how to deploy data pipelines in different regions and how the settings are different from region to region.

The use-case of the data pipeline in this article is to retrieve data from an SQL server and store them into AWS Dynamodb tables. This approach is not a direct path to achieve. The following are the steps to be done,

1. SQL to S3 bucket — CopyActivity

2. S3 bucket to Dynamodb — HiveActivity

First, the data needs to be taken from the SQL server to S3 server in a CSV format which is known as a CopyActivity. Secondly, data needs to be sent to our Dynamodb tables from the S3 buckets which is known as a HiveActivity.

When creating two activities we need to specify resources for the two activities. These resources should be specified for the target region of the Dynamodb during the HiveActivity. Here’s an overall architecture.

Let’s see how the cross-region works with each scenario below:

Scenario 1: Deploy S3, Dynamodb & Pipeline on the same region Ireland region

After deploying the data pipeline, the pipeline worked well. With no region being specified with the Dynamodb tables or in EMR clusters. Since the data pipeline is running on the same region as s3 and Dynamodb, the default values are configured. Therefore the pipeline works smoothly.

Scenario 2: Deploy S3 & Dynamodb in N. Virginia and Data Pipeline in Ireland regions

When a cross-region data transfer occurs, for S3 it will not be an issue if the region is not specified to the EC2 instance created because s3 bucket names are unique. But for Dynamodb, the region needs to be specified and the EMR cluster region needs to be specified.

As for the settings of data pipeline EMR resource, m1.medium is defined as the core instance type and EMR release label is given as 4.4.0 with the region specified as Frankfurt. If the region is not specified for the EMR cluster, the data pipeline will run successfully but data is not sent to Dynamodb in the N. Virginia region.

Scenario 3: Deploy S3 & Dynamodb in Frankfurt and Data Pipeline in Ireland regions

After deploying S3 & Dynamodb to Frankfurt region, there are few things that needs to be considered since the region was after 2014, does not support some of the configurations. Following configurations needs to be concerned if you are moving to a region which supports new technologies.

1. Instance type for EMR

2. AMI version

3. Reading logs & data from S3

As for the instance type, previous generation instances are not supported with the region and versions needed to be updated. When it comes to EMR supported instances in the region there, in this article m4.large instance type is selected. And the AMI version needs to be 5.13.0 or later.

When reading the logs, it will not support from the data pipeline, due to an AWS Signature 4 error (AWS4-HMAC-SHA 25). To avoid the issue, create another S3 bucket in another region such as Ireland. And the logs are directed to another region would simply solve the issue.

Furthermore, by not defining the region to the EMR, during the HiveActivity, it would fail by giving the AWS4-HMAC-SHA 25 error. If you run into any errors, please drop a comment and thank you for reading.