When AWS launched their Simple Storage Service, aka S3 in 2006 clients had at their fingertips a very simple, powerful and reliable way to store a virtually unlimited amount of data. Having designed many S3 ecosystems I really appreciate the portability of data. With S3, it’s easy to set up replication between buckets. But caveat emptor, pre-existing data is not automatically included as part of the replication process. There’s a number of ways to go about solving this.
For small and medium sized data sets, this is typically solved by using the CLI to do an S3 sync, and cranking up the number of concurrent operations as high as your computer can handle. If there are very large files, subdividing them up into digestible chunks to be sent in parallel helps as well.
For larger data sets, it’s not tenable to do this and you need to offload the work to a server or set of servers to take on the load. When I’m advising clients I often recommend a look at s3distcp, a solid solution that uses EMR to transfer the files. While the cost and complexity of running an EMR cluster can be prohibitive, these obstacles can be overcome and it’s often worth it to access this proven solution used to move large amounts of data between buckets.
Fortunately, users need not consider these to be an either/or choice and can employ an all-of-the-above strategy. AWS has added S3 batch to the S3 offering. This solution differs from AWS Batch and was created to transfer or transform large quantities of data in S3. It’s also relatively straightforward to set up:
- Create a service IAM role for the job.
- Create a manifest of source files.
- Create a copy object job in the destination region.
- Configure a job report (optional).
I recently had a customer who was enabling replication on nearly 200 buckets and needed to seed the replicas with the existing data. Using the console would have been tedious, so instead I put together a script. I opted to use Python so I could use the Amazon Python SDK, which provides the boto3 library, but it could easily have been done with a basic shell script as well.
The script first creates an IAM role that grants read permissions to the source bucket, and write permissions to the destination bucket. Then it creates a manifest file. The manifest is very simple; it’s a 2 column CSV containing the bucket name and key name. A list operation is all you need. Just don’t forget to paginate if you have a large number of keys in the bucket. With the manifest saved in the source bucket, it can be stored elsewhere.
After that, it takes a template json config and supplies the manifest, role, source, and destination. That is then sent to the s3control API to create the job. The job creates a report in the destination bucket, so anyone can review any errors that may have happened while running.
Due to the number of buckets, I added the ability for us to run this tool in detached and attached mode. If just one job is needed, it will stay attached and query the state of the job in a loop until it finishes before exiting. In detached mode, it just creates the job and exits, allowing for many jobs to be scheduled all at once.
Here’s an example of the output:
(default):bin$ ./s3_backup.py foghorn1.bucket foghorn2.bucket
Creating manifest for bucket: foghorn1.bucket
Uploading manifest: s3://foghorn1.bucket/manifest.csv
Created s3 batch job: 7d65913e-901b-4632-a058-113fa56d01e8
State: Complete, Total: 836, Success: 836, Failed 0
There’s one final tip to wrap things up. If you are planning to turn on replication it’s better to copy the old files after replication is turned on. You may overlap a couple of files in the process, but you eliminate the gap between creating the manifest and starting the replication.
S3 Batch is useful for several other transformation options on S3 resources. If needed S3 Batch can apply a custom lambda function on every object in a bucket. Next time you need to move data between buckets (as a one-off or periodic operation), take a look at S3 batch.