An Anatomy of Tiny URL Provider Service

Written by yagnesh-aegis | Published 2021/01/29
Tech Story Tags: java-development | java-developers | java | java-programming | backend | programming | bitly | software-development

TLDRvia the TL;DR App

Basic understanding of URL shortener Service/Tiny URL Creation Service
  • Here in this blog, I will be explaining to you the detailed steps for designing a URL shortening service or Tiny URL creation tool or service. This service is going to provide users the short aliases of the long original URLs and when the user will access the short URL returned by our service then that short URL will get redirected to long URLs.
  • URL shortening is the process of creating shorter aliases out of the long URLs. These shortened aliases are called “short links.” Users will be redirected to the original actual long URL when they access/invoke these short links. The main advantage is that these short links help in saving a lot of space when printed, displayed, messaged, or used across many other services. Additionally, with the user of these short links, users will not do common mistakes of mistyping the shorter URLs as they might do in the case of long URLs.
Let’s take an example: Suppose, we want to shorten this page through our URL Shorteners Service:
After shortening the url: http://aegistiny.com/a8uhs
Note: The tiny /short URL is nearly one-third the size of the actual URL.
  • URL shortening is used for URL links optimization across various devices and also useful for keeping track of many individual URLs to analyze the frequency or most accessed links and this will in turn help in campaigning performance and also with this, the Original l affiliated URLs will not be visible and remained hidden.
  • If you are still confused about the requirement then for a better understanding of the requirement, you can visit some of the online available tools of URL shortener. If you haven’t yet visited the tinyurl.com site, please give it a try and create a new shortened URL out of any long URL. Please spend a few minutes going through the various functionalities and options the tinyurl.com tool offers. This will help you in understanding the basic idea and then you would gain more knowledge out of this blog regarding the system design.
Requirement Anatomy
Our URL shortner tool should be capable of meeting the following requirements:
Understanding the Requirements:
  1. Here, we will give the privilege to the user to choose their custom short URLs from their original long URL. Our system should generate a tiny URL which is the unique alias of the original URL (input). This is called a tiny URL or a short link which is very short and can be easily utilized in applications very easily.
  2. Once user gets the short links from their original URL, then users will be able to invoke that tiny URL and in our service, we have to take care of redirecting that tiny URL (Short link) to the actual long URL.
  3. When providing the users to choose the short links for a given URL, then we should also allow users to specify the expiry time so that till that period that specific short link/tiny URL will work and after that, the link will get expired.
  4. Our system should support high availability and high scalability. Our system should be capable enough to handle the traffic (millions of requests) for URL redirection with low latency support.
Estimating the Capacity for our System
  • Our system will be invoked heavily for redirection compare to the new tiny URL creation requests. So, there will be loads of redirection demands from the users contrasted with new URL shortenings requests. We can assume that if 20 redirection requests come then two new URL shortening requests might come.
Traffic estimates:
  • Assuming, we will have 1M new tiny URL requests come per day assuming the read/write ratio to be 20:1, then we can expect 20M redirections within a day.
  • So, in this case, we can able to calculate the new tiny URL creation request per second will be 28 URLs/second and if we calculate the total number of redirection requests then the value will come 20*28= 560 URLs/second. ( Here, I have considered the read /write ratio to be 20:1)
Storage estimates:
  • In this case, we have to store each of the requests that come in for the new tiny URL generation. So, we have to store both the original URL as well as the tiny URL in our database. If we want to support this for 1 year then let’s calculate the number of requests.
  • As we can expect 1M new tiny URL generation request per day and then per month 30M request and in a year total request is 360M. So we can say that the number of objects that we have to store in our db is 360M. Since we expect to have 500M new URLs monthly, so the total number of stored objects would be 360M.
Memory estimates:
  • Here, the memory is required because we need to cache the requests that have a greater number of host. We need to cache these frequently invoked URLs. So to cache these URLs we need memory to store the URLs.
  • Now, let’s calculate how much memory will be required according to our request estimation. Suppose, 10% of the request generates 90% traffic and, in this case, this 10% are the hot URLs and we need to cache these 10% URLs. As already we have calculated that we will have 560 requests per second then per day we will have approximately 48M requests per day.
  • Now, here we need to cache 10% of the total requests. So, if we calculate the approximate memory required for caching will be around 2.5GB.
Backend APIs
  • So, in our design, we will prefer Rest APIs to be developed for this purpose. It is easy to develop and will serve our requirements also. Why we will prefer rest here over Soap because even though with SOAP also we can achieve the requirement but with rest, the advantages are more.
  • REST is the most sensible, proficient and inescapable norm in the formation of APIs for microservices or internet services. REST is an interface between frameworks utilizing HTTP to fetch data and then using these data to create many operations to give it a form of JSON or XML or text.
  • So, we have to create the Rest endpoint to expose our service’s functionality. So, the payload for the rest endpoint will look like as below format:
Payloads for Generating the Tiny URL/Short URL:
Request URL format:
  • Method Type: POST
  • Request headers:
Explanations of the Headers:
  • api _key (String): This is the API key of an authorized or already registered account. This will be used for uniqueness of the user account and the associated privileges for that particular user account.
  • user_email(String): This is user’s email id
  • user_name(String): This is user name
  • org_URL (String): This is the actual URL that user wants to shorten.
  • alias_user_preferrence (String): Optional custom key for the URL.
  • valid_upto (Date): This is optional, and it is the expiry date for the tiny URL.
Return Type: (JSON)
  • After the successful execution, the above API will return the {tiny URL: url_string} in the JSON format will be passed to the UI module to show the value to the user as a string URL response. In case of any failures, the rest API should return an error code with a proper error message.
  • Here tiny_url is the key and the url_string is the value of the shortened URL.
System Designing Process:
  • The requirement here is how we can generate the tiny URL out of the original long URL. Also, here another thing we have to solve is to create a unique key for that long original URL.
  • In our example I have given here is to generate the short url i.e. “http://aegistiny.com/a8uhs” from the original URL: "https://www.aegis.com/dataEmp/booking/training/course/page/100566820/198088/"
  • You can notice the value “a8uhs” in our tiny URL which is nothing but the unique short key that we want to generate for every long URL respectively. Now, the question arises that how we can achieve this generating the short key here.
The solution that we should implement here is the URL encoding of the actual/original long URL.
  • For URL encoding, we can use Java development security features. Here, we have to generate the unique SecretKeySpec( hashkey) of the original input URL and for generating this hashkey which is of type SecretKeySpec, we can either use HmacSHA512 or HmacSHA256 Algorithm( This is created from the SHA-256 /SHA-512 hash function utilization).
Sample code for reference:
Mac mac = Mac.getInstance(“HmacSHA256”);
              SecretKeySpec hashKey = new SecretKeySpec(original_url.getBytes(),“HmacSHA256”);
              mac.init(hashKey);
  • So, once you generate the hashkey, you have to initialize this with mac.init(hashKey). Then the hashkey value should be Base64 encoded using the java.util.Base64 class and using the Base64.genEncoder () method for encoding that hashKey value. So, this will generate a unique hash value of the input original URL. This is the URL what the user has passed as input for generating the short/tiny URL.
  • But a sensible question arises here is: what ought to be the length of the short key? 6, 8, or 10 characters." If we utilize the HMACSHA256 algorithm, then also it will generate a 32-character hash key value and after base64 encoding, it will increase the value by adding 6 more extra characters.
  • If we utilize the MD5 calculation as our hash work, it'll produce a 128-digit hashkey value. After base64 encoding, we'll get a string having more than 21 characters (since each base64 character encodes 6 pieces of the hash value). Since we just have space for 8 characters for every short key, in what capacity will we pick our key at that point? We can take the initial 6 (or 8) letters for the key. This could bring about key duplication and still we have a solutions for that but that is not also not better.
  • But even though we use this URL encoding approach, then it is going to create the same hashkey value always for the same URL. But in a real-time scenario, multiple users may want to generate the tiny URL for the same original URL and in that case, our service will return the same tiny URL response to each of the different users which will break the functionality.
  • Here, we can solve this issue by injecting some unique value in the URL so that our service can distinguish between each new tiny URL creation request. But why to make the design so complex here. What we can do here is just we can create another microservice for generating a unique “serial_number” for any new request.
  • That means this new microservice will have the logic of creating a random 6 digits number which is called the serial_number. So, once user request for the new tiny URL creation, then our create API will be calling this new service (URL_Serial number generation API) for fetching a new 6 digits unique value and then this serial_number, we will append it to the URL to make every request to be unique and no duplicity.
  • In that URL Serial number generation service, we have to write the logic of creating the random 5digits number and then inserting it to any SQL DB. We can create a primary key of this serial_number so that the duplicate number won’t get inserted into the DB.
  • Suppose, by any chance, the random number generation logic generates a duplicate number then obviously it will give an exception and then in that case we have to write retry logic where it will again call the random generation logic for generating new random number till it generates the unique one. Till that time it won't return any response. So, with this, our URL_Serial number generation service will always return a unique number.
  • In our approach, concurrency cannot cause problems because in our key serial_number generation service, we are always making sure that a unique key is always been generated for each new tiny URL request. So, here, there is no chance of key duplication. Suppose, there are many requests from different hosts invoking our service concurrently, then in this case our unique number generation service takes care of allocating a unique key to the URL request.
Finding the Original URL from the key Serial Number stored in Database
  • Also, this service makes sure not to give the same key to multiple requests. Now with these unique key serial_number, we can able to search for the original URL. We can find the original long URL by looking up the key serial_number already stored in our database. We will send a 302 status as a response if the look up is successful.
  • This “HTTP 302” status code is for URL redirection so that we will pass the original/actual URL associated with the “key serial_number” in the request payload as part of the “Location” field. In case, the user passes or invokes the URL with the wrong hashkey then, in that case, the key will not be found in our database and here, we can return the response as 404 HTTP error code: “URL not found” against that particularly requested hashkey.
Designing Database Schema:
  • Here, we can go for any SQL database either MySQL or Oracle SQL r Postgres. The key thing is that we will be storing here millions of objects in the DB. Mostly read operation will be heavy compared to the write. We need to store the user information along with the key Serial_number for each request of tiny URL creation.
Database Schema:
  • So, according to our requirement, only two tables we have to create here. One of the table names is “User_Details” table and another is called let’s say “URL_Mapping” table.
  • In User_Details table, we will store all basic information of the user like “user_id”, “user_name”, “user_email”, “user_created_date” and “updated_date “ field.
  • In the URL_Mapping table, we will have a foreign key of “user_id” and then we will have the primary key as the “key_serial_number”, “created_date” and “validity_date” and the Original URL as org_URL.
  • You can also choose for NoSQL DB but when you expect a billion requests then in that case you can go for NoSQL DB like Mongo, Cassandra, etc. for better performance and scalability.
  • Again, for high scalability, you can go for database portioning. This will improve the performance and managing the multiple partitioned tables for managing the data and then our data will be highly available through the multiple partitioned tables.
Caching Requirement:
  • As I already discussed, we need to cache the frequently accessed user URLs. We can use any of the caching technologies like redis cache, Memcached, or Guava, etc. Both Memcached and Redis cache is part of the NoSQL family and based on key-value pair data storage. Redis cache you can choose here as it is better in sophisticated memory management and provides the benefit of lazy and active both types evictions. You can find more information about various caching technologies. Here, let’s choose Redis for caching.
  • In our Redis cache, we just have to store the original/actual long URLs with their respective value of “key_serial_number”. So, when the request comes to the controller and in the service layer only you can verify if that URL redirection request exists in the cache or not, before hitting the backend database call. This will improve performance.
  • Spring boot Redis cache also we can use as it will be easy to implement. You just have to add Redis dependency in the pom file and then add property “spring.cache.type=redis” in the “application. Properties file”. And need to use the @EnableCaching annotation on the main class of spring boot.
  • Here, you have to store the key value in the cache using the @Cacheable annotation.
  • @Cacheable(value = "orginal_URL",  key = " key_serial_number ")
                   public String getURL(@PathVariable  String key_serial_number) 
                      { 
                          return userRepository.findOne(Long.valueOf(key_serial_number)); 
                      }
  • In the cache mapping, the getURL() method will put the hash value into a cache named as ‘orginal_URL, identifies that person by the key as ‘key_serial_number.
Updating Cache:
  • Cache values we need to update with the URL and the key_serial_number whenever the value of the actual object is updated. This can be done using @CachePut annotation.
  • @CachePut(value = " orginal_URL ", key = " key_serial_number ")
Clearing Cache
  • If our cache becomes full of some of the data needs to be deleted from the actual Database. So, at this point, it is not required to keep the URLs in cache anymore. So, it is better to clear cache data using @CacheEvict annotation.
  • @CacheEvict(value = "users", allEntries=true)
  • Redis cache is very-fast and it still does not have any limits on the storage of any amount of data on a 64-bit system.
Scalability and Loadbalancing with TLS Security:
  • Here for empowering the design of scalability and enabling the configuration of load balancing, you can utilize and implement various available load balancing technologies which are open source but, in this case, we can use Nginx web server” as a load balancer. Nginx is open source and it is also a reverse proxy server.
  • Load balancing with Nginx utilizes the popular round-robin algorithm by default if you have not defined any explicit configuration. In our case, we have to enable the HTTPS for more security in our Nginx load balancer. In your load balancer configuration file, you just have to add “listen 443 SSL”.
  • Then, we can set up encryption in our load balancer. As our backend service is in the private network so just need to pass forward the HTTP calls and terminating the SSL at the load balancer level. So, in your server segment, you just need to add the configuration for listening to port 80 with a server name and then adding a redirection to your https port. You can check more on this for configuring SSL/TLS in Nginx load balancer.
Health Checks:
  • For checking server availability, Nginx includes the passive server health checks. This allows us to adapt our server backend to the current demand by activating/powering on/off the hosts as per the need. “ngx_http_healthcheck_module” deals with surveying, in technical terms, it is called “polling”. It polls the configured backends servers and if all servers respond with success HTTP 200 error code with an optional request body, then these servers will be marked as healthy or in a good state. Otherwise, the servers are unhealthy and marked bad state. For performance and high availability of our service, this Nginx takes good care of it.
Application Security:
  • In the case of our requirement, the user can create a tiny URL without even registering in our application. So, in that scenario, we cannot be able to impose any security. But when the user signs in to our app after registering and then access our service for new URL creation then at that time you can impose some kind of security permission by making the URL private in our database.
  • So that only that user can access that tiny URL not any other user. Bur for those users who have not signed in and if they access our service for tiny URL creation in that case, we can mark that URL public so that any other users can access that tiny URL.
  • The second main part is to avoid any attack through the malicious URL. So, in that case, we have to validate the user requested URL in our service and verifying the malicious characters are present or not in that URL. We can validate the URL by using org.owasp.esapi.ESAPI.validator.
Cleaning Up Database:
  • Why our DB clean-up is necessary? Because as per our requirement, we have an expiry date or validity date for a particular URL and after that URL is no longer valid. So in that case, the entries for those URLs will still be persisting in the database increasing the unnecessary data. So, in this scenario, we need to clean up those expired URLs from the database as part of the Lazy cleanup process.
  • If a user tries to access any expired links, then we can return an error code to the user with the error message like “URL not found”. We should create cron jobs that would take care of this database clean up task. There should be two scripts where one script should run periodically to mark the table row with the expired links as “Marked for deletion”.
  • Then the next script should run following the previous script to delete the same marked records.
  • This service is required to be very lightweight and we should schedule it to be run only when the user traffic we can expect is expected to be very low.
Conclusion
In this context, the requirement is designing a URL shortener service that will shorten any kind of long URL. All the design step we have mentioned so that you can be able to implement the requirement with minimal effort.

Written by yagnesh-aegis | Software and Web Application Developer at Nexsoftsys - Software Development Company
Published by HackerNoon on 2021/01/29