Bachelorarbeit / Praktikum

Data Lake Streaming Platforms

Research Area

Advisers

Dipl.-Inf. André Langer

alumni

Description

The traditional approach to read an entire data set into the local memory or hard disc before processing it does not work any longer as soon as we expect Big Data scenarios with several million entities that have to be processed in a short amount of time. Examplarily we assume XML data that contains more than a million RDF triples that shall be evaluated with respect to certain data quality measurements such as property completeness.

Therefore, data streams play a very important role that can be adressed in several programming languages by using certain libraries and specific data service providers. These services can be based on existing solutions such as Confluent (based on Apache Kafka) or Apache Hadoop.

The objective is first to show how these data streams can be addressed by implementing a simple demonstrator that uses all relevant components, and later to evaluate different platforms how they work, what their advantages and disadvantages are and how they can be used in combination with other data applications that are implemented e.g., in NodeJS.

Other services like a Configuration Management via Apache Zookepper, alternatively Konsul or the Spring Config Server may also be important in this context when talking about the distributed processing of data streams.