Analytics for RESTful interfaces

Written by igor_98383 | Published 2017/12/27
Tech Story Tags: google-cloud-platform | analytics | rest-api | bigquery | data-visualization

TLDRvia the TL;DR App

The following is an example project on how to track each request to our RESTful API, analyze and visualize the data in Googles DataStudio Reports.

This solution is very straight forward way of tracking API calls while having the ability to have different questions answered from our collected data. GitHub link is attached on the bottom of this article.

We’ll be using the following Google Services:

Appengine Flex gives as a lot of advantages over manual server management. We can autoscale our instances and not handle increased traffic. Deployment management is easy as pie. TLS management is automated by google so renewing certificates is also managed for us if we want to use custom domain secured by TLS. But for our example this could by any Java Application Server or host provider.

BigQuery is going to be our data warehouse for analytics.

We’re going to use DataStudio which connects to BigQuery to visualize the analytics.

Code — Model

All we need is a filter that would intercept each request and queue it for BigQuery ingestion. But first lets prepare the data structure for events we will be ingesting.

Event object stores name of the event which is our API request path. In our example there are 2 API services: /api/ping and _/api/greeting, c_urrent system time of the event, userId (random in the GitHub example), method GET/POST/PUT/DELETE and geo location from which request originates. You can easily extend the data you’d like to capture simply by adding fields to the Event object.

public class Event {public String name;public LocalDateTime time;public String userId;public String method;public String country;public String city;public String region;}

Country, City and Region are extracted from googles headers which are added by load balancer which sits in front of our Appengine Flex instances. Check the filter code bellow. You can also Read more about it here.

Code — Filter

With a simple filter we intercept each request to our API. We ignore all other requests since we’re trying to analyze only RESTful API.

@WebFilter(filterName = "RequestFilter", urlPatterns="/api/*")public class RequestFilter implements Filter {

@Overridepublic void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)throws IOException, ServletException {

HttpServletRequest req = (HttpServletRequest)request;// catching all requests to api and loggingEvent event = new Event();event.name = req.getRequestURI();event.time = LocalDateTime.now();event.method = req.getMethod();event.userId = "user_" + (new Random().nextInt(1000) + 1);

event = augmentEventWithHeaders(event, req);

BigQueryUtil.eventQueue.offer(event);

chain.doFilter(request, response);}

private Event augmentEventWithHeaders(Event event, HttpServletRequest request) {

event.country = request.getHeader("X-AppEngine-Country");event.city = request.getHeader("X-AppEngine-City");event.region = request.getHeader("X-AppEngine-Region");return event;}

}

Before we dive into the utility methods lets just think about scalability and reliability. In the example we chose to use Producer-Consumer pattern. On each request we insert event in our bounded thread-safe BlockingQueue with capacity of 10000 events. Learn more about BlockingQueue here.

Now we need a consumer to push those queued events into BigQuery. We create separate thread and start it on server start by implementing a WebListener — ServletContextListener. That way we ingest events independent from REST API functionality. For this example we use only 1 consumer so we define ExecutorService with a fixed thread pool of 1.

private static ExecutorService queueService = Executors.newFixedThreadPool(1);public static final BlockingQueue<Event> eventQueue = new LinkedBlockingQueue<>(10000);

public static void startQueueListener() {Runnable startTask = () -> {try {

while (true) {int numberOfIngested = 0;if (eventQueue.size() > 0) {List<Event> events = new ArrayList<>();eventQueue.drainTo(events, 50);

   insertEventsToBigQuery(events);  
   
   numberOfIngested = eventQueue.size();  
}  
if (numberOfIngested < 50) {   
 Thread.sleep(5000);  
}  

}} catch (Exception e) {e.printStackTrace();}};queueService.submit(startTask);}

Disadvantages of this approach

There is a couple of things we need to be aware of:

  • What happens if the queue is full? Are we ok with dropping the events?
  • How do we handle scenario when BigQuery is not available (this can and probably will happen in some point in time)
  • What if our consumer dies due to a bug or something else? How do we monitor it, restart it?
  • How many messages can we afford to loose?
  • Is one consumer enough if load surges?

For production systems with heavy loads where we want ingestion of our events to be as precise as possible this is not a good solution. When we need more robust approach we would want to look into separating our producers and consumers. Lets see what we could gain by using something like Google Cloud Pub/Sub:

  • Loose coupling — we can make our consumers run on different computer instances, separating the BigQuery ingestion logic from our REST API service code. This also makes our design cleaner.
  • Scalability — In cases of heavy traffic we could scale number of our consumers to match the load made by producers without touching the core REST API. There is an option with Google Pub/Sub to have queue-based autoscaling. Read more about it here.
  • Easier testing and monitoring of our Pub/Sub
  • Reliability — In case there is a problem with our consumer or BigQuery service, our events will persist on Googles Pub/Sub for about 7 days from the day they were published. That would give our consumers possibility to catch up with ingestion after all services are operational again.

BigQuery table scheme and event ingestion

All of the interactions with BigQuery are handled in BigQueryUtil class.

BigQueryUtil also implements methods for checking if the table we need exists, extracts the schema from our Event object and creates necessary table structure in BigQuery. All of this happens on server start so when the first request reaches our REST API methods we’re all ready.

After we have a table ready in BigQuery we can start sending our events:

private static void streamRows(final BigQuery bigquery, final String tableId, final String suffix, final List<Map<String, Object>> rows) throws IOException {

TableId table = TableId.of(BIGQUERY_DATASET, tableId);Builder insertRequestBuilder = InsertAllRequest.newBuilder(table);insertRequestBuilder.setTemplateSuffix(suffix);

for (Map<String, Object> row : rows) {insertRequestBuilder.addRow(row);}

InsertAllRequest insertallRows = insertRequestBuilder.build();

InsertAllResponse insertResponse = bigquery.insertAll(insertallRows);}

We use a template suffix in this example. This simply creates multiple tables based on our suffix. In our case this is YEAR_MONTH. Suffix can be anything we want. We just chose to have it formatted as eventsYEAR_MONTH (example: events2017_DECEMBER, next month would be as events2018_JANUARY and so on). Read more about it here.

We can now deploy our example and start calling our REST API methods:

mvn appengine:deploy

Check in the terminal for your endpoint URL and replace http://localhost:8080 with your own. Lets make a couple of calls to our API:

while true; do curl -H "Content-Type: application/json" http://localhost:8080/api/ping ;done

Command-C to stop.

Lets do a couple more:

while true; do curl -H "Content-Type: application/json" -X POST -d '{"time":"2017-12-25T14:26:50.466"}' [http://localhost:8080/api/ping ;](http://localhost:8080/api/ping;)done

while true; do curl -H "Content-Type: application/json" http://localhost:8080/api/greeting ;done

while true; do curl -H "Content-Type: application/json" -X POST -d '{"userName":"Igor"}' http://localhost:8080/api/greeting ;done

Visualization

In BigQuery console we can preview our events with simple SQL query:

SELECT * FROM [myproject:stream.events2017_DECEMBER] order by time desc

Our events in BigQuery table

Lets connect our table to Google Data Studio and try to visualize our how many calls were made to each of our REST APIs methods.

Google Data Studio

First we need to create a new Custom DataSource to BigQuery. Go to Data Sources and find BigQuery in the list. Then select Custom Query. The input field opens up. Insert following SQL query (replace ‘myproject’ with your project id and replace events2017_DECEMBER with the desired table):

SELECT concat(method, ' ', name) as name_method, count(*) as request_count FROM [myproject:stream.events2017_DECEMBER] group by name_method

Result of the SQL query if we ran it in BigQuery Query editor

SQL query will count all request made to our API and group the results by endpoint and method of call (either GET or POST in our case).

Click Connect. You should see our custom BigQuery connection.

BigQuery Custom connection from Data Studio

Click Create Report and in the menu select Insert -> Pie Chart.

Now you can click on View button and Voila we have our custom analytics for REST API. There is also refresh data button which fires a new query to our BigQuery data table so we can monitor the calls in almost real-time.

Analyzing request counts to each REST method

Based on collected data you could see number of request from each user or requests grouped by country, city or basically anything SQL language allows you to do.

What else can we do with those metrics?

Depends on your SQL skills and use case. If you’re tracking number of calls to each endpoint API that is probably all you’ll need. You could go pretty far with SQL querying.

If your goal is to see how many users clicked on a specific button from an email you’ve sent them and what was the conversion rate in terms of new subscribers or purchases from that specific email button then you’re probably looking at the wrong approach here.

Just in case you are looking for a solution that would give you more insight into customer behavior from email marketing campaigns then maybe you should have a look at SixthMass (still in development but if you’d like to learn more leave your email bellow).

SixthMass Customer Journeys Retention Analysis

Example GitHub project:

igorrendulic/GoogleFlexBigQuery_GoogleFlexBigQuery - Analytics for RESTful interface via Google Appengine Flex, BigQuery and Datastudio_github.com


Published by HackerNoon on 2017/12/27