HTTP Message Classification for Content Trust Awareness in a Redecentralized Web
The next generation on the world wide web so-called decentralized web is getting more popular nowadays. A lot of scientists and institutes are working on its development. This leads to a new domain of challenges for web architecture, especially when it comes to data acquisition and processing, such as decentralization of the web will lead to an arbitrary number of data providers, whereby not all of them can be known by one specific web application. Thus, the data provider of new information can be also unknown, and the transmitted data can origin from another source, which might be again unknown or even hidden. For this procedure, it requires a trust evaluation to make trust-aware decisions upon the data.
In the decentralized web, no one can infer trust for a web application completely. Thus, every web application requires to be autonomous with its trust-aware decision making and hence also with its trust evaluation beforehand. Such a trust evaluation should further not only based upon the source of the data or the data provider, but it should be also content- and context-related. A data source does not provide enough information to make a trust-aware decision, so additional factors, such as the content and context of an (HTTP) message could give a lot more inputs. The evaluation, and thus content classification as a part of it, cannot be done only once. It should be done for every message to support the high dynamic changes within the web, because the data of some recourse could be modified within time. Moreover, another application’s behaviour or content can also change without any notice due to the lack of a central authority monitoring it.
In the decentralized web, all HTTP applications require an autonomous way of considering message content regarding their own trust evaluation. However, there is no approach nowadays on how to get this knowledge in the context of trust evaluations automatically without any human activity. The very first information about any HTTP message could be found in the HTTP-header. It contains a data type from the message body. But even if the system is aware of the data type, it doesn't know anything about its content. To understand the content of the message, a system needs to classify the message with some data processing algorithms. To understand the content of the message the information extraction process using NLP (Natural Language Processing) algorithms could be used. Such algorithms could provide the following information about the text: language, topic, general intention of the text, etc. Another challenge here is, these algorithms are not appliable for all data types. For example, most of them don't work with structured data, like JSON or XML. To achieve it, a system firstly needs to parse the document and only after that process its content.
The objective of this thesis is to find an approach or combination of approaches for the previously mentioned problems and tasks in the context of content trust in a redecentralized web. This particularly includes the state of the art regarding content classification of web data and its possible support of NLP. The demonstration of feasibility with an implementation prototype of the concept is part of this thesis as well as a suitable evaluation with exemplary use cases.