How To Copy Terabytes of Data Between AWS S3 Buckets

Written by hariohmprasath | Published 2020/07/07
Tech Story Tags: aws | aws-emr | elastic-map-reduce | aws-s3 | bigdata | data-migration | amazon | s3

TLDR We were trying backup data in s3 bucket with 1,000,344 and Size: ~130 GB. We noticed it took hours and hours to perform the copy. There is no way to make it faster. The only workaround we found is to run these aws commands in parallel in multiple terminals so they all can operate on different s3 partitions at the same time. When I created a job to copy the contents of the bucket with KMS key encryption enabled got the following error: "Unsupported encryption type used: SSE_KMS"via the TL;DR App

Problem statement:
Part of our regular production upgrade we were trying backup data in s3 bucket with
Item Count: 1,000,344 and Size: ~130 GB
We were basically initiating a backup using the regular s3 commands like:
aws s3 cp --recursive s3://<bucket>>
aws s3 sync s3://<bucket> s3://<bucket>>
During execution, we noticed it took hours and hours to perform the copy. There is no way to make it faster. The only workaround we found is to run these aws commands in parallel in multiple terminals so they all can operate on different s3 partitions at the same time and perform copy faster, which is neither an elegant solution nor scalable.
Other options:
We tried a couple of other options mentioned in stack overflow and AWS forums like
S3 Batch operations
S3 batch operations seem to solve this problem, but at this point, it doesn’t support it on objects encrypted based on the KMS key. When I created a job to copy the contents of the bucket with KMS key encryption enabled got the following error:
Unsupported encryption type used: SSE_KMS
When I read more about this AWS docs it stated under “Specifying a Manifest” section → Manifests that use server-side encryption with
customer-provided keys (SSE-C) and server-side encryption with AWS KMS
managed keys (SSE-KMS) are not supported
s3-dist-cp
s3-dist-cp seems to be promising but when I ran it against a bucket with had closer to 6 TB of data the job failed while running “reduce” task after
40 minutes without any clear indication of why it failed

Custom approach:

Unfortunately, none of those mentioned above approaches solved our problem, so we came up with this approach. This approach can be further optimized, so think as a first step to solve this problem.
It's a 2 step process, which is a combination of shell script and spark code. First, we need to generate the record file (with object keys), then running a spark code to copy the files in parallel across nodes in multiple tasks.
Generating the record file:
We need to generate a text file containing object keys of the items inside the source s3 bucket (to be copied), is done by running this command on any EC2 instances:
aws s3 ls s3://test_bucket --recursive | awk '{print $4}' > /tmp/output.txt
Output: (just object keys one in each line)
data/solution=33/test1.mov
data/solution=33/test2.mov, etc
sql.read()
 .textFile(file)
 .repartition(2000)
 .flatMap((FlatMapFunction<String, String>) s -> Arrays.asList(s.split("\n")).iterator(), Encoders.STRING())
 .map((MapFunction<String, String>) s -> String.format("aws s3 cp %s s3://%s/%s", String.format("s3://%s/%s", source, s), target, s), Encoders.STRING())
 .foreachPartition((ForeachPartitionFunction<String>) iterator -> {
       while (iterator.hasNext())
         Runtime.getRuntime().exec(iterator.next()).waitFor();
 });
Spark Submit:
spark-submit — class com.s3.S3Copy s3://test_bucket/copier.jar test_bucket back_up_bucket s3://test_bucket/output.txt
args[0] → Source bucket
args[1] → Target bucket
args[3] → s3 record file generated in previous step
This code will read the “output.txt” file and splits into multiple partitions and runs them in parallel across multiple nodes.
Performance Test
With 15 EMR core nodes each of m4.xlarge instance type, we were able to copy 5.5 TB of data in less than 40 minutes. Since we pay EMR only for the time, we use it is cost-effective (further cost reduction is possible by going with SPOT or EC2 fleet configuration) and much scalable compared to the previous approach.
Spark submit:
spark-submit —conf spark.network.timeout=420000s —conf
spark.executor.heartbeatInterval=410000s —conf
spark.yarn.scheduler.mode=FAIR —conf
spark.shuffle.service.enabled=true —conf
spark.serializer=org.apache.spark.serializer.KryoSerializer —conf
spark.executor.memoryOverhead=1024 —conf
spark.driver.memoryOverhead=1024 —conf
spark.executor.instances=74 —conf
spark.executor.cores=6 —conf spark.driver.cores=6 —conf
spark.driver.memory=10g —conf spark.executor.memory=10g —conf
spark.default.parallelism=888 —deploy-mode cluster —master yarn —conf
spark.sql.broadcastTimeout=360000 —class com.s3.S3Copy s3://dmp-dms-k8s-dev-fico-pto-tenant/copier.jar test_bucket back_up_bucket s3://test_bucket/output.txt

Written by hariohmprasath | https://www.linkedin.com/in/hariohmprasath/
Published by HackerNoon on 2020/07/07