Do you know why microservice design is so popular within the development of BI tools? The answer is clear: it helps to develop scalable and flexible solutions. But microservice architecture has a great drawback. Its performance usually requires great improvements.

The FreshCode team also faced the problem and I’ve decided to show how we coped with it. The article is written together with FreshCode CTO and based on our recent case of development reporting microservice. You will find here its tech scheme, estimates, as well as a list of tools for on-premise and SaaS products.

MEET MICROSERVICE DESIGN. WHAT SHOULD YOU CARE ABOUT?

If you wonder why is microservice style so popular, you should think about the recent IT trends. The demand for Agile and DevOps practices led to microservice popularity. Today such great players as Uber, Airbnb, Netflix use microservices to solve their business problems.

The best way to explain what does microservice design mean is to compare it with a common monolith app. The monolithic system uses one processor for all the logic. Meanwhile, microservice includes a few separate processors. They usually are:

database

server

Any change in the system leads to the deployment of a new version of the server part of the system. Let’s consider the concept in detail.

MICROSERVICE DESIGN IN DETAIL

Microservice design means a set of services, but the definition is vague. I can single out 4 features that a microserver usually has:

the decentralized control of languages and data

responsibility for a specific business need

automatic deployment

endpoints

On the picture below you can see microservice design compared to a monolith app.

Scalability in microservices

One of the main benefits of the microservice design is its scalability. You can scale several services without changing the whole system. So, you save resources and keep the app less complex. One of the most famous cases that prove this fact is Netflix user base. The company had to cope with the growing subscribers’ database. The microservice design was a great solution for scaling it.

Each microservice needs its own database. Otherwise, you can’t use all the benefits of the modularization pattern. But the variety of databases leads to challenges in the reporting process. We will discuss the problem later.

Microservice design speeds up app development and allows to launch the product earlier. Each part can be rolled out separately. So, the deployment of microservices is quicker and easier.

Pros of microservices

1. The possibility of convenient horizontal system scaling

2. Increased development team members productivity

3. Simplification of the debugging and maintenance processes

4. The ability to work in smaller teams and use an Agile approach

5. Flexibility in continuous integration and deployment

Cons of microservices

Despite all these benefits, microservice architecture has its own drawbacks. I mean the necessity of operating many systems and completing various tasks in the distributed environment. So, the main microservice pitfalls are:

1. The complexity of microservice design makes developer plan and act more carefully.

2. The external API communication in microservice architecture leads to more significant risks of attacks.

3. Sometimes it’s difficult to switch between them in the development and deployment processes.

REPORTING IN MICROSERVICE SYSTEM

We worked on a legacy EdTech project. The system was very complex and included many microservices. Its main parts were:

sophisticated financial and billing system

multi-organisation structure for large group entities

workflow management tool for business processes

integrated bulk email, SMS and live chat

online system for surveys, quizzes, examination

flexible assessment and learning management system

FreshCode worked on the project on the stage of migrating to a new interface. The product was preparing for the global launch. The microservice system was supposed to process great amounts of data. As for the app target audience, it was developed for

large education networks that manage 100s of campuses

governments that have up to 200k schools, colleges and universities

Meanwhile, the EdTech app design was convenient both for great education networks and a small school of about 100 students.

So, FreshCode development team faced the problem of managing and improving the performance of the complex microservice architecture. It should be mentioned that the client wanted to build both SaaS and self-hosted systems. So, we have chosen the technical solutions keeping this fact in mind.

IMPROVING PERFORMANCE IN MICROSERVICES

The process of generating reports required engagement with different services. Thus, it caused performance issues. That’s why Freshcode team decided to optimize the app architecture by creating a separate reporting microservice. It received data from all the databases. Then, it saved them and transformed into custom reports.

On the picture below you can see the scheme of reporting microservices system and technologies for its implementation.

Yellow color marks all microservices in the system. Each of them has its own database. The reporting module tracks all changes in them with the help of a messaging system. Then, it stores the new data in its own report database.

6 STEPS OF THE MICROSERVICE IMPLEMENTATION

Let’s look at the 6 main part of the reporting system, technologies that can be used and the best solutions.

Change Data Capturing (CDC)

CDC tracks every single change (insert, update, delete) and performs some logic on it. There were 3 possible tools for the first step of implementing the microservice reporting system.

1. Apache NiFi

It allows creating simple CDC without coding at all. Apache NiFi has a lot of built-in processors and supports data routing, transformation and system mediation logic.

Pros:

Support of cluster mode and easy scaling

Built-in PutToKafka and PutToKinesis activities

Implementation of custom activities on any JVM language

User-friendly UI

Cons:

No predefined data format for messaging between activities

Supports only JVM languages

The quality of default activities isn’t perfect

No Oracle CDC activity

2. StreamSets Data Collector

Popular open source solution for continuous big data ingestion in a microservice reporting system. Its main advantages are simple creation of data pipelines and support of many widespread technologies.

Pros:

Built-in AWS S3, Kinesis, Kafka, Oracle, Postgres processors

Open source software can be adjusted for your needs

Simple and convenient UI

Support of most of the popular tools

Cons:

It’s a new solution that is still actively developing

It’s a little bit difficult to start working with StreamSets Data Collector

3. Matillion

The innovative ELT architecture has an easy-to-use interface. It is built specifically for Amazon Redshift, Google BigQuery and Snowflake.

Pros:

A proprietary tool

Support of the development team

Well-tested solution

Cons:

Only several databases can be used with this tool

ELT architecture doesn’t match to all projects

Oracle was the main database of our microservice reporting system. So, we choose StreamSets Data Collector, because of Oracle CDC support out of the box.

Messaging System

It allows sending messages between computer systems, as well as setting publishing standards for them.

1. Apache Kafka

One of the most famous tools for real-time analytics. Apache Kafka has high throughput and reliability characteristics.

Pros:

High throughput, fault tolerance, durable

Great scalability, high concurrency

Batch mode, native computation over stream

A great choice for on-premise microservice reporting system

Cons:

Requires DevOps knowledge for correct setup

No built-in monitoring tool

2. AWS Kinesis

It simplifies collecting, processing, analyzing streaming data. Amazon Kinesis offers key capabilities for the cost-effective process at any scale.

Pros:

Easy to manage and scale

Great integration with other AWS services

Almost no DevOps effort

Built-in monitoring and alert system

Cons:

Needs some cost optimizations

No way to use for on-premise software

Although Apache Kafka required a bit more effort to deploy and setup, we used it as a cost-efficient on-premise solution.

Streaming Computation Systems

The high-performance computer system analyzes multiple data streams from many sources. It helps to prepare data before ingestion. So, it’s possible to denormalize/join them and add any info if needed.

1. Spark Streaming

Brings Apache Spark’s language-integrated API for stream processing. So, it allows writing streaming jobs the same way we write batch jobs.

Pros:

Stateful exactly-once semantics out of the box

Fault-tolerance, scalability

In-memory computation

Cons:

Pretty expensive to use

Manual optimization

No built-in state management

2. Apache Flink

It is useful for stateful computations over unbounded and bounded data streams. Apache Flink suits for all common cluster environments and performs computations at in-memory speed.

Pros:

Exactly once state consistency

SQL on Stream & Batch Data

Low latency, scalability, fault-tolerance

Support of very large state

Cons:

Requires high programming skills

Complicated architecture

Flink community is less than Spark but growing

3. Apache Samza

The scalable data processing engine for real-time analytics that can be used in a microservice reporting system.

Pros:

Can maintain a large state

Low latency, high throughput, mature and tested at scale

Fault-tolerant and high performance

Cons:

At-least-once processing guarantee

Lack of advanced streaming features (watermarks, sessions, triggers)

4. AWS Kinesis Services

The set of tools includes Data Firehose, Data Analytics, and Data Streams. As a result, it helps to build powerful stream processing without implementing any custom code.

Pros:

Pay only for what you use

The easiest way to process data streams in real time with SQL

Handle any amount of streaming data

Cons:

No way to use on-premise

Complicated to customize

AWS provides a great set of tools for ETL and data procession. It’s a good start point. But there is no way to deploy it on custom servers. That’s why it doesn’t fit for on-premise solutions.

Apache Flink is the most feature reach and performant solution. It allows storing large application state (multi-terabyte). But it requires more developers to be involved and should be deployed by yourself.

Data Lake

The central repository of integrated data from one or more disparate sources. It stores current and historical data in one single place. So, we can use them for creating analytical reports, machine learning, etc.

1. AWS S3

The object storage service offers industry-leading scalability, data availability, security and performance.

Pros:

Easy to integrate with other AWS services

Designed for 99.999999999% (11 9’s) of data durability

Cost-effective for rarely accessed data

Has an open source implementation with full API support

Cons:

High network pricing

Previously S3 met availability issues, but it’s not a problem for a Data Lake

2. Apache Hadoop

The primary data storage system used by Hadoop applications. It allows storing and processing large amounts of data.

Pros:

Efficiently works with huge amounts of data

Integration with many analytical and operational tools (Impala, Hive, HBase, etc)

Cons:

Complicated to deploy and manage

Needs to set up monitoring and high availability

We decided to start with AWS S3. It has an open source implementation. That’s why we could integrate it to the on-premise microservice reporting system.

Report Databases

1. AWS Aurora

It is up to 5 times faster than standard MySQL databases and 3 times faster than PostgreSQL databases.

Pros:

Pretty fast SQL database

High Availability and Durability

Fully Managed

Easy to scale

Cons:

Bad performance for analytical reports in case of big data projectsThe minimally available instance is too big, but we can easily replace it by plain PostgreSQL

2. AWS Redshift

Redshift delivers 10 times faster performance than other data warehouses. It is using machine learning, massively parallel query execution and columnar storage on high-performance disk.

Pros:

May run queries on external S3 files

Easy to set up, use and manage

Columnar storage

Cons:

Doesn’t enforce uniqueness

Can’t be used as a live app database

It’s mostly useful for run aggregation on a large amount of data

3. Kinetica

The vectorized, columnar, memory-first database designed for analytical (OLAP) workloads. Kinetica automatically distributes any workload across CPUs and GPUs for optimal results.

Pros:

Pretty fast aggregation performance, run on GPU and CPU

Supports materialized join views, and can update them incrementally

Cons:

GPU instances still cost a lot

No way to join data between different partitions

4. Apache Druid

It generally works well with any event-oriented, clickstream, time series, or telemetry data, especially streaming datasets from Apache Kafka. Druid provides exactly once consumption semantics from Apache Kafka and is commonly used as a sink for event-oriented Kafka topics.

Pros:

Druid can be deployed in any *NIX environment on commodity hardware

Best for interactive dashboards with full drill-down capabilities

Stores only pre-aggregated data

Cons:

Isn’t perfect for custom reports that may be built by users

Works only on time series data

No full join support

All of these databases are amazing. But our client’s goal was to create reports based on all data from all microservices. So, the development team considered AWS Aurora as the best choice for this task. It simplified the workflow a lot.

Report Microservice

The report microservice was responsible for storing information about data objects and relations between them. It also stood for managing security and generating reports itself. Since these reports were based on the chosen data objects.

TECH SOLUTIONS

We prepared 2 variants of the technological stack for the microservice reporting system. As for the SaaS product on AWS, we used:

StreamSets for CDC

Apache Kafka as a messaging system

AWS S3 Data Lake

AWS Aurora as a reporting database

AWS ElasticCache as an in-memory data store

The reporting microservice was written in NodeJS. You can see rough estimates for SaaS solution on the table below.

Note: These are calculations for production deployment. The development process required much smaller infrastructure.

Such infrastructure was the most appropriate for the client’s requirements. Its main advantage was the easy way to replace AWS services with self-hosted solutions. It allowed us to avoid code/logic duplication for different deployment schemas.

For on-premise one we used Minio, PostgreSQL, Redis accordingly. Their APIs were fully compatible. So, we didn’t have any significant problems in the microservice reporting system at all.

RESULTS

Our team solved the clients’ technical challenges. The reporting microservice module was effective and convenient. It was capable of:

Generating clear and convenient reports

Providing many standard reporting templates

Adding a large number of filters

Customizing report interface

FreshCode client improved the microservice reporting system and achieved these goals:

to update the app’s architecture and design

to improve the product by adding new features

to optimize performance, increase flexibility and scalability

If you are interested in solving the same problem or have any other technical challenges, contact our team. We provide free expert advice for startups, small business and enterprises. Check FreshCode portfolio to find out other interesting projects.

Would you like to read more case-based articles? Let me know in the comments below and stay in touch!

***

The original article was published on FreshCode blog

Business Intelligence in microservices: improving performance

MEET MICROSERVICE DESIGN. WHAT SHOULD YOU CARE ABOUT?

MICROSERVICE DESIGN IN DETAIL

Scalability in microservices

Pros of microservices

Cons of microservices

REPORTING IN MICROSERVICE SYSTEM

IMPROVING PERFORMANCE IN MICROSERVICES

6 STEPS OF THE MICROSERVICE IMPLEMENTATION

Change Data Capturing (CDC)

Messaging System

Streaming Computation Systems

Data Lake

Report Databases

Report Microservice

TECH SOLUTIONS

RESULTS