Compete on Data Analytics using Spring Cloud Data Flow

Written by kayalvizhi | Published 2020/02/15
Tech Story Tags: data-architecture | data-analytics | data-driven | big-data | datastreaming | architecture | distributed-architecture | scdf

TLDR NewVantage Partners’ 2019 Big Data and AI Executive Survey found that businesses are failing to be data driven. The survey comprised of several c-level technology and business executives representing large corporations such as American Express, Ford Motor, General Electric, General Motors, and Johnson & Johnson. It is because of the Three V’s effect - Volume, Variety & Velocity - Extracting, transforming and loading data in real-time and delivering the analytics at the same speed seem to be a herculean task for many ETL tools.via the TL;DR App

Data Driven

Being data-driven is one of the most essential prerequisite for any organisation to achieve the desired digital transformation. In this regard, firms have started treating their data as assets and adjusting their strategies to emphasise them.
The bars were set high a decade before but the transformation process is slower than the expectation. Many proofs exist on the failures of modern enterprises going data driven. One such study is the findings of NewVantage Partners’ 2019 Big Data and AI Executive Survey
The survey comprised of several c-level technology and business executives representing large corporations such as American Express, Ford Motor, General Electric, General Motors, and Johnson & Johnson
Though there are tremendous investments in big data and AI initiatives, the negativity constitutes more than 50% in all the above cases.

The Three V’s

Why are the enterprises failing to be data driven? It is because of the Three V’s effect - Volume, Variety & Velocity
Volume:
It is becoming difficult to handle the volume of ever-growing data produced and consumed within and outside the organisation. Every now and then organisations upgrade the storage and processing power to support the expected data volume.
Variety:
The variety of data produced by an enormous amount of sensors and applications are of mixed type - structured, semi-structured or unstructured. Many traditional and proprietary data platforms and tools are finding it difficult in onboarding these different types.
Velocity:
All these sensors and applications generate data at the speed of light. Extracting, transforming and loading them in real-time and delivering the analytics at the same speed seem to be a herculean task for many ETL tools.

One Such SMB:

One of our customers, who is a leader in Through-Channel Marketing Automation, segmented in the Small and Medium Sized enterprise category, wanted to become real-time data driven to compete on data analytics. They did not want to stay behind using traditional ETL tools in analysing old time data.
Business Challenges:
The enterprise’s product had well defined integrations by leveraging Pentaho (Kettle) ETL features, which pretty much supported their business for a decade. Later in the new data era, it started to receive different sets of expectations from the business.
  • Analytics on Real-Time Data: The end users of the business wanted to make decisions based on real-time data rather than near-real time/ old data.
  • Not Business Developers Friendly: Business developers heavily depended on the ETL developers for every customisation in the data pipeline - maybe it was complicated or less complicated.
  • Lack of Automation: The number of cycles went through in delivering the data pipelines to production were huge as it involved end-to-end manual steps.
Technical Challenges:
  • Multiple SaaS Channels: The supported SaaS sources were - Google & Salesforce by Pentaho, but the business demanded to have most of the social platforms as their data sources. Pentaho’s latest version 8.0 supports Realtime Streaming data but that is not targeted to receive stream from all the social platforms off the shelf. Instead it requires a couple of steps to be configured here and there by ETL developers. Ref - stream-data-from-twitter-api-with-oauth-using-kettle 
  • Support for Multiple Authentication protocols: Each APIs demand a different set of authentication mechanism. Though most of the platforms support either OAuth / OAuth 2, there were needs to have Basic Authentication as well. 
  • Data Fields Selection & Transformation: The response of any streaming would be totally different from how it is stored for further processing. Apart from storing or doing any transformation, nothing more could be expected from ETL integration tool. But the below challenges were to be addressed. The selection of the fields from the response needs to be easily achieved and minimal transformation like datetime format, currency format change based on locale were also expected. 
  • Business Developer Friendly:  Business developers should be able to create the pipeline without much effort for working with any of the social platforms. A simple drag and drop of components should allow creating any data pipeline. This tailor made data pipeline creator is not something supported out-of-the box by any of the data integration tools in the market.
  • Time to Market: If a business developer wants to change a step in the pipeline - either by adding a filter or by removing a verification step, the process should like that be deployed into the respective environment for testing and delivered to production in no time.
The Approach:
We conducted day long workshops and design thinking sessions to understand the pain points of all the stakeholders - Product Managers, Business Developers and Data Engineers.
The data points - functional and non-functional requirements were carefully collected and we have come up with a recommended solution of creating data pipelines dynamically by leveraging Spring Cloud Data Flow and it met the expectations!
The Functional View
  • Data Sources: The incoming realtime streams from various social platforms.
  • API Connector: Connectors of supporting various API  (SOAP, REST & Graph) & Authentication protocols (OAuth, OpenID, Basic) 
  • Metadata Picker: Lists of all the attributes from the respective stream which can be chosen for processing further 
  • Data Formatter: Customisation of the attributes by Formatting & applying transformation logic is done here.
  • DB Tables: The sink tables (GCP - Bigtable) where the data analytics queries get fired.
Systems View
Authentication & Authorisation
  • The allowed users are from the GSuite (the enterprise’ Identity Provider)
  • With the help of Cloud IAP, the access to the application is controlled.
GCR with App or Task Images
  • Container Repository of pre built images of applications/ tasks to be developed as Source, Processor & Sink. These images are templates, used for creating the components in any stream. The images could be of source, processor or sink type.
  • These images would be accepting the configurable properties like endpoint URI, accessToken, Consumer Key and Consumer Secret etc...and pass them to the underlying applications/tasks so that they can consume/process/receive the data based on the type.
Data Pipeline Creator UI
  • To configure the data sources with respective authentication configurations
  • To render the metadata for selection/ filtering
  • To specify the data formatting per data source
Data Pipeline Creator Service
  • A containerised microservice exposing REST APIs used in the Data PipeLine Creator UI 
  • All the configurations  specific to a data pipeline - from the source to the destination are preserved in the metadata database to which this service has access to.
  • This service abstracts out the Spring Cloud Data Flow services for the given needs as shown below
  • Spring Cloud Data Flow Server
  • The server responsible for creating a data pipeline (for the given configuration) per data source configuration and deploying in the Skipper Server as data streams
  • It is configured with Kafka streams to support real time data processing.
    Mission Accomplished!
    The recommended solution has been implemented; the enterprise has done the end-to-end automation of continuously deploying the real time data services and pipelines into production without much help from ETL developers but by the business developers. The mission accomplished!

    Conclusion

    There are many reasons why any enterprise fails in achieving the goal to become data-driven. Irrespective of the number of excuses and failures, the amount of data continues to rise exponentially. According to the independent research firm IDC, the growth in connected IoT devices is expected to generate 79.4 Zeta Bytes of Data in 2025. 
    It is just in 5 years and is really HUGE! So, it is time to be aware, nurture data and become a data driven enterprise.

Written by kayalvizhi | Principal Software Architect, Microservices & Cloud Computing enthusiast, Hands-on Java Developer
Published by HackerNoon on 2020/02/15