Technically speaking, microservices are REST APIs

Microservices approach for data science projects

Dr. Oxana Lapteva January 21, 2020

Microservices as modern software architecture polarize the IT world. After great enthusiasm and the corresponding hype, many companies were brought back down to earth and confronted with technical difficulties and problems. On the one hand, microservices approaches offer better scalability of applications in the long term due to their modular structure. On the other hand, they often have a high level of entry complexity for companies. Other features such as maintainability were rewarded in comparison to monolithic systems. In reality, however, many companies are faced with the challenge of actually ensuring maintainability with growing service systems. When looking for the right approach, different factors play a role: from the technical requirements to human and organizational aspects such as team structures, management and internal communication.

This article will discuss the decision-making processes that are necessary when using microservices architecture from the perspective of data scientists. When implementing a microservice architecture for the data science projects, it was important for our team to underline the advantages of the approach and to put the disadvantages into perspective.

Requirements for the data science projects

The focus of the work as a data scientist in our team is on the development of pipelines, which process and analyze data from different sources and formats. Among other things, the focus is on the tasks of automatically extracting relevant information, identifying hidden connections and dependencies, and recognizing anomalies. The number of possible uses is enormous.

How does data science fit together with the microservices approach?

In addition to the general advantages such as scaling of development, performance and resilience, microservices are ideal for creating flexible analysis pipelines and testing different algorithms against each other. In addition, a development team can use ready-made services such as a kit. In companies, this leads to a consumption of resources that is closely based on its needs. A final, but very central point is the ease with which it can be expanded. Newly developed analysis methods are quickly adopted as services in a deployment without major adjustments.

Before the decision was made for the microservices architecture, we identified the following requirements:

  • Possibility to quickly build a pipeline from the existing services and enable the exploratory approach to data analysis,
  • Possibility to set up the development environment quickly and thus concentrate on the development of services,
  • Ability to quickly complete prototypes and production solutions,
  • Possibility of using the components that have already been developed individually or in combination in other projects / use cases without additional effort,
  • Creation of pipelines without additional adjustments and
  • Language independence when developing the services (in Python, Java or other programming languages).

How can microservices meet these requirements?

In order for the advantages of the microservices approach to take effect and not outweigh the disadvantages, solid protocols for communication between services and clearly defined data standards are necessary. These should offer the most generic points of contact between services - this ensures that an existing platform can be efficiently further developed and maintained.

If the focus is on developing a data processing pipeline that is assembled from several services, communication within the pipeline must run smoothly. In addition, it must be taken into account that services can be addressed by different pipelines. And of course that individual services can be written in different programming languages. It is not uncommon for different data scientists or even teams to work on developing the services.

Let's consider the example of the tonality analysis of Twitter data in terms of the positive, neutral or negative tenor of the topics. One team works to develop the theme recognition service and the other team works on a neural network to recognize tonality. It can happen that one team opts for the Java programming language and the other for Python.

That is why the first task for us in the implementation was the definition of a communication protocol:

  • There should be a communication protocol that allows a very flexible routing between services and is supported by every programming language.

During development, everyone has their own style and logic of how a program should look. Experience has shown that when building a data processing pipeline, a lot of time is spent on bringing the services together. Accordingly, capacity is wasted at the interfaces of the services. For this reason, our team decided to develop a basic framework for developing the services:

  • There should be a service base class with various interfaces, on the basis of which you can very easily implement your own services.

Another important aspect is transparency about what is happening in the pipelines and in the system. Our team of data scientists, software architects and system administrators have jointly defined how the information on the services and pipelines should be stored. It was important to us to ensure flexible information design. It was also important to have a basic framework available to enable logging and monitoring.

  • The transport of the message, logging and configuration interfaces and message tracing should be included in the base class. With the message protocol, all information about the pipeline and the data payload should be contained in the message itself.

In the microservice architecture, the services should communicate with each other and exchange information. To make this possible, we opted for a lean and flexible API server, written in Node.js. It should allow us to describe and provide RESTful APIs using configuration files. In addition, it should be easily expandable via plugins.

  • It was important for us that the definition of the endpoints takes place via the configuration file of the server and not by code.

In order to transmit this automatically, an API gateway should be added, which provides the information via a REST API.

Service structure

What can the uniform service structure look like, which at the same time offers enough flexibility in terms of design?

For this purpose, a service base class with several interfaces was developed to cover the following technical aspects:

  • Communication protocol for flexible routing between services.
  • Message transport and message tracing.
  • Interface for configuration.

This means that every developer can develop and provide their own services on this basis. At the same time, flexibility in the internal design of the respective service and also in the design of messages, logs and other configurations is guaranteed. The following code shows the structure of the service base class [1].

The parameter is the message payload that will be added to the pipeline. This is the unique identifier for the specific message. This ID is used for logging purposes (topic logging). Every service needs these three parameters: Surname, ip and port. Together with other additional parameters, they are available via the Dictionary available.

These elements are necessary to write a service. Inside the class MyService the logic of the implementation can be designed according to the user's requirements.

So that the service can be used, is called. The service is thus listened to at the defined address and after the message has been received, the message payload is added to the function.

If further expansion options are required for the message transport, a function is used. This receives the entire message object and not just the payload. This is helpful if e.g. B. additional pipeline information to be tracked. The function is used so that the payload message can be forwarded to the next service in the pipeline. The function is used to process the incoming message. Here the user can decide for himself and flexibly design how the message should be processed.

Data transport

Once the services have been built into a pipeline and started, different information is passed through. Let's take the following example: We want to process and analyze Twitter data. The analysis should identify the tonality (positive, negative, neutral) and the topic of each individual tweet. To make this possible, we can build the following services into a pipeline:

  • Pre-processing service,
  • Sentiment service and
  • Topic modeling service.

With every further step in the pipeline, information is enriched and passed on. At the end, we have a document that contains the results of the analysis: the original tweet, the tweet after preprocessing (all the words in lower case and numbers have been removed), the tweet's tonality and the associated probability as well as the topics of the tweet.

In addition, we can communicate further parameters to a pipeline, e.g. B. Language "EnglishThis means that our services use language-specific elements in the analysis. This means that the sentiment service uses the English training data set to detect the tonality. The same applies to the topic modeling service.

If you want to carry out the language-specific analyzes, but do not want to limit yourself to one language, a service for language identification can also be connected upstream of the pre-processing service and thus the necessary information about the language of each tweet about the services can be passed on.

Deployment

How can we create a pipeline from the existing services in a microservices architecture?

With the message protocol, all information about the pipeline and data payload is contained in the message itself. As soon as the pipeline is created and the message is sent to the first service in the pipeline, it is activated and data processing starts.

To simplify this process, we have added an API gateway that provides these and other functions using a REST API. The decisive factor for the API server [2] is that it can be expanded using plugins. Such a plugin allows us to start and stop services, define pipelines and send data into these pipelines. We receive information about the status of our system and can query message traces. In addition, it gives us the option of persisting and restoring our infrastructure in the form of a JSON document.

The following example illustrates the use of a REST API to interact with the services:

Maintain system overview: logging & monitoring

With the growing number of services and running pipelines, the topic of logging and monitoring should not be ignored. It is extremely important to address these issues and define appropriate solutions before dealing with the growing number of services and pipelines.

We distinguish between two log types:Service logs andSystem logs

Service logs contain all information about service activity, e.g. B. when a message has arrived at the service. These logs have the following parameters:

  • Service - name of the service.
  • Message - messages sent by the service.
  • Level - log level.

System logs do not have a fixed structure and are dependent on the infrastructure used. An example of a system log would be an exception when the service crashed.

We have implemented a basic logger for the logging functionality [3]:

Basic-Logger offers the possibility to have different log levels: debug, info, warn, error and critical [3]:

The log results themselves look like this [3]:

The logs are also monitored via the API gateway. This is done using the endpoint l used.

For peripheral services such as logging and monitoring, we have opted for an injection mechanism that allows other implementations to be used by means of configuration. Our consideration was that there will also be aspects that we cannot cover with the existing infrastructure or if we have to access an infrastructure that is not directly supported by us in one or another project.

When it comes to monitoring, we remained true to our approach and were the first to map a simple monitoring solution in the basic class. This solution saves the message traces in a file.

The monitoring messages from the entire pipeline are sent inmonitoring.log saved [3]:

The basic monitoring covers different situations: When the message was created, received and sent automatically or when it reached the end goal.

The monitoring message can look like this [3]:

At this point, it was also important to us to have the option of creating custom metrics. This is why there is the function in the basic monitoring class . This function expects the service Surname, Metric name and Metric Dictionary with the keywords as arguments [3].

As a counterpart to the basic monitoring class, there is the basic reporting class. This class contains functions to get monitoring information for specific events, e.g. B. also the complete monitoring history or only the last known status.

Summary

When designing the microservice architecture for data science projects, we set ourselves the goal of enabling uniformity and flexibility in the design. This idea was implemented with the help of the basic class with expansion options in various places:

  • When developing individual services,
  • when logging and
  • in monitoring.

Furthermore, we decided to enable the assembly and connection of individual components via plugins.

For us, independence in various forms is also in the foreground:

  • when choosing the programming language,
  • in the choice of tools and infrastructure and
  • in the choice of data storage (database, index, files, etc.).

But was the decision in favor of the microservice approach the right one? From the point of view of the requirements that we set ourselves at the beginning, there is a lot to be said for it. We can develop generic services that can be switched into a wide variety of pipelines. We can quickly integrate newly developed services. The developed infrastructure enables rapid and efficient development of prototypes and solutions for exploratory data analysis. In addition, the developed prototypes can be transferred to a production deployment without great effort.

The flip side of the coin is the development effort required to meet these requirements. It must be clear that this effort is a major cost factor that can only be profitable at a later point in time. In the end, what counts above all is whether you take your own development path or can access existing solutions:

"The golden rule: can you make a change to a service and deploy it by itself without changing anything else?"[4]

From this moment on, the efficiency and speed of development in the data analysis come into play.

  1. Github: Writing a service
  2. Github: API server
  3. Github: Logging & Monitoring
  4. Sam Newman. 2015. Building Microservices (1st ed.). O'Reilly Media, Inc ..

You might also be interested in