Efficient, Scalable, and Provenance-Aware Management of Linked Data
The proliferation of heterogeneous Linked Data on the Web requires data management systems to constantly improve their scalability and efficiency. Despite recent advances in distributed Linked Data management, efficiently processing large amounts of Linked Data in a scalable way is still very challenging. In spite of their seemingly simple data models, Linked Data actually encode rich and complex graphs mixing both instance and schema level data. At the same time, users are increasingly interested in investigating or visualizing large collections of online data by performing complex analytic queries. The heterogeneity of Linked Data on the Web also poses new challenges to database systems. The capacity to store, track, and query provenance data is becoming a pivotal feature of Linked Data Management Systems. In this thesis, we tackle issues revolving around processing queries on big, unstructured, and heterogeneous Linked Data graphs.
In the first part of this work, we introduce a new hybrid storage model for Linked Data based on recurring graph patterns and graph partitioning to enable complex cloud computation on massive webs of data. We describe the architecture of our approach, its main data structures, and the new algorithms we use to partition and allocate data. Contrary to previous approaches, our techniques perform an analysis of both instance and schema information prior to partitioning the data. We extract graph patterns from the data and combine them with workload information in order to find effective ways of co-locating, partitioning and allocating data on clusters of commodity machines. Our approaches enable efficient and scalable distributed Linked Data management in the cloud, and support both transactional and analytic queries efficiently. We perform an extensive evaluation of our techniques showing that they are between 140 and 485 times faster than state-of-the-art approaches on standard workloads.
In the second part of this work, we extend our techniques to efficiently handle the problem of storing, tracking, and querying provenance in Linked Data. We implement two different storage models to physically co-locate lineage and instance data, and for each of them we implement various algorithms to efficiently track lineage of queries at two granularity levels. Additionally, we tackle the problem of efficiently executing queries tailored with provenance data. We introduce five different query execution strategies for queries that incorporate knowledge of provenance. We also present the results of a comprehensive empirical evaluation of our provenance-aware methods over two different datasets and workloads.
The techniques we develop over the course of this thesis are instrumental in deploying Linked Data Management Systems at large on clusters of commodity machines.