Inferring schemata from semi-structured data with Formal Concept Analysis

Jan Luca Liechti. Inferring schemata from semi-structured data with Formal Concept Analysis. Bachelor’s thesis, University of Bern, May 2017. Details.


Semi-structured data do not conform to the schematic rigor of relational databases, but still present their content in a structured way. They are described as self-describing data, because they provide a schema for every record, for example as XML elements or JSON keywords. We are interested in inferring a relational schema from such semi-structured data that both preserves semantic information, i.e. keeps similar records together, and requires as little extra space as possible. We are operating under the assumption that we do not know records’ true types. We employ well-established notions of Formal Concept Analysis and use them in ways similar to basic operations on graphs and automata. Specifically, we create a formal context from the data where records are objects and tags are attributes and compute its concept lattice. Based on the assumption that semantically similar records are also structurally similar, i.e. have a similar set of attributes, we designed and implemented an algorithm that iteratively performs updates on the lattice in order to obtain a partition of the data ful lling the above mentioned criteria. In tests with real-life data, we obtain good results for datasets that are already highly structured — that contain few outliers and many structurally equivalent records — and mixed results at best for very diverse datasets.

Posted by scg at 29 May 2017, 12:15 pm link
Last changed by admin on 21 April 2009