Modernizing build and deployment procedures have caused significant usage and increased reliance on existing internal repository solutions. With any growth, there are challenges to scale capacity to meet demand, and there is an increase in the operational support required to ensure that internal repositories remain available for use. As such a single, managed artifact solution was needed that could handle scale, security, and availability requirements going forward. In order to meet the potential growth, it was clear we needed to build a solution in the cloud that could scale to handle arbitrarily large workloads.
At the beginning of the project there were already 4 different implementations of Artifactory spanning 3 major versions (4.x – 6.x) all running on local storage with around 70TB of various binary artifacts, containers, and replicas of third-party repositories. By the end of the project, we wanted to meet the following requirements:
- The deployment and configuration of Artifactory should be defined as code and repeatable.
- Deployment and configuration of repositories should be defined as code and repeatable.
- Users should use the existing SSO solution to sign in and access repositories.
- The total capacity for artifact storage should be expandable without constraints while minimizing costs.
- Artifacts must be made available in each region to avoid cross-region latency.
- Regions should be transparent to the end-user.
- There should be in-region and cross-region automatic failover and recovery.
- All existing repositories should be migrated and consolidated to the new service.
At a high level, this architecture is similar to any standard multi-region service. At the top level is a latency-based Route53 record that directs traffic to an Application Load Balancer in the nearest region. The load balancer then sends traffic to an HA Artifactory cluster on EC2, which has a local disk cache to help accelerate performance. The cluster stores state and configuration in RDS, while S3 is then used as the binary store.
Following zero trust principles, a WAF is configured in front of the public load balancers to limit traffic to only whitelisted sources. We use ACM on the load balancers and put self-signed certificates on the EC2 instances and use TLS for the S3 and RDS endpoints. All this ensures that traffic is encrypted end to end and source traffic has been actively approved.
Associated with this are the usual supporting services. We use cloudwatch to capture all system and application logs. We also create alarms in cloudwatch to send notifications when problems occur.
If you take a close look at the architecture diagram, you can see that some additional details were discovered and adapted during the course of the project.
Getting a multi-region setup with Artifactory did run into a few challenges. In addition to the requirement that all data be replicated cross-region, Artifactory uses a transactional database for state and configuration. In order to avoid cross-region write latency, this meant that we had to decouple the HA clusters into regions and set up the repositories to handle.
Fortunately, this scenario is something that has been solved before by JFROG, making things a little easier. We configured a multi-site mesh topology with event-based push replication. In effect, each HA cluster is configured identically and runs independently of each other. Then, each repository is actually a combination of 1 virtual and 2 or more local repositories. The virtual repositories share the same name, while the local repositories are region-based. Each region has a single read/write local repository and one more read-only local repository that contains replicated data from each of the other regions. On top of that, a push-based replication is configured on the read/write repository to push to all the other regions.
Some Additional Challenges
Since we used configuration management to stand up and provision Artifactory, it was easy to ensure that repositories were all configured properly in each region. Replicating all the existing data had two paths. One was to create a backup, save it to s3, and then run an import. The other, which we chose to use, was to replicate from the existing instances to the new service. Based on our tests it would take on the order of a week to transfer all the data, so by having replication configured, it let us ensure the delta between the start of copying data would also be replicated once cutting over to the new service.
It’s also fairly easy to overload individual Artifactory servers during large replication events, especially on the local storage. So to reduce the load on EC2 and EBS costs, we use signed URLs where possible to allow clients to directly access the data in s3 instead of proxying through the Artifactory instances.
We also had to work around the requirement of having a primary server in the HA cluster. Fortunately, AWS made this extremely easy to solve. We used a different autoscaling group for the primary and any secondary servers. Although there is only one instance running as the primary, using an Auto Scaling Group allows it to recover automatically in the event of an EC2 instance failure.
Final Requirements Checklist
|Automated Deployment||We used Terraform to store the IaC (infrastructure as code), which is consistent with the rest of the company.|
|Configuration as Code for Repositories||We also used Terraform’s Artifactory provider for repository and user configuration.|
|SSO||Artifactory has built-in support for SSO which was very straightforward to configure.|
|Arbitrary Size||Using S3 as the binary storage provider, we are able to scale to arbitrary sizes. In addition, using signed URLs and disabling the local S3 cache let us keep EBS volumes small and to scale larger with fewer storage/compute resources.|
|Cross-Region Support and Transparency||We used the multi-site mesh push strategy documented by JFROG. End users are transparently routed to the closest region.|
|Data Availability||Artifacts were made available in each region to improve cross-region latency and endure regional outages.|
|Data Resilience||By using native RDS snapshots, S3 availability, and encrypted Parameter Store strings, we are able to restore data as needed, which also works around Artifactory’s requirement to create its backups on the local volume, which is not possible with the given size of data set.|
|Service Resilience||Using a combination of Route53, Elastic Load Balancing, and Auto Scaling we have redundancy, automatic failover, and automatic recovery.|
|Data Migration||Completed using Artifactory’s push-based replication, with some minor adjustments to handle replication from Artifactory 4.x to 7.x.|
|Artifact Storage||The total capacity for artifact storage was made expandable without constraints, while minimizing costs.|
|Complete Migration||All existing repositories were migrated and consolidated to the new service.|
In the end, we were able to meet all the requirements of the project and get all the data up and running on the new service. As with most large projects that move from disparate organic solutions to a consolidated service we did have some challenges. However, using AWS as a cloud provider for scale, resilience, and reliability, we were able to get a working implementation of JFROG Artifactory that can grow far beyond the current requirements.