Jump to main content Hotkeys
Distributed and Self-organizing Systems
Distributed and Self-organizing Systems

Bachelorarbeit / Praktikum

Data Lake Streaming Platforms
Data Lake Streaming Platforms

Research Area

Intelligent Information Management

Advisers

andrelanger

Description

The traditional approach to read an entire data set into the local memory or hard disc before processing it does not work any longer as soon as we expect Big Data scenarios with several million entities that have to be processed in a short amount of time. Examplarily we assume XML data that contains more than a million RDF triples that shall be evaluated with respect to certain data quality measurements such as property completeness.

Therefore, data streams play a very important role that can be adressed in several programming languages by using certain libraries and specific data service providers. These services can be based on existing solutions such as Confluent (based on Apache Kafka) or Apache Hadoop.

The objective is first to show how these data streams can be addressed by implementing a simple demonstrator that uses all relevant components, and later to evaluate different platforms how they work, what their advantages and disadvantages are and how they can be used in combination with other data applications that are implemented e.g., in NodeJS.

Other services like a Configuration Management via Apache Zookepper, alternatively Konsul or the Spring Config Server may also be important in this context when talking about the distributed processing of data streams.

 


Powered by DGS
Edit list (authentication required)

Press Articles