FedX: Optimization Techniques for Federated Query Processing on Linked Data
FedX: Optimization Techniques for Federated Query Processing on Linked Data | |
---|---|
FedX: Optimization Techniques for Federated Query Processing on Linked Data
| |
Bibliographical Metadata | |
Subject: | Querying Distributed RDF Data Sources |
Keywords: | Not available |
Year: | 2011 |
Authors: | Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel, Michael Schmidt |
Venue | ISWC |
Content Metadata | |
Problem: | SPARQL Query Federation |
Approach: | Join processing and optimization approaches |
Implementation: | FedX |
Evaluation: | The practicability and efficiency of FedX |
Contents
Abstract
Motivated by the ongoing success of Linked Data and the growing amount of semantic data sources available on the Web, new challenges to query processing are emerging. Especially in distributed settings that require joining data provided by multiple sources, sophisticated optimization techniques are necessary for efficient query processing. We propose novel join processing and grouping techniques to minimize the number of remote requests, and develop an effective solution for source selection in the absence of preprocessed metadata. We present FedX, a practical framework that enables efficient SPARQL query processing on heterogeneous, virtually integrated Linked Data sources. In experiments, we demonstrate the practicability and efficiency of our framework on a set of real-world queries and data sources from the Linked Open Data cloud. With FedX we achieve a significant improvement in query performance over state-of-the-art federated query engines.
Conclusion
In this paper, we proposed novel optimization techniques for efficient SPARQL query processing in the federated setting. As revealed by our benchmarks, bound joins combined with our grouping and source selection approaches are effective in terms of performance. By minimizing the number of intermediate requests, we are able to improve query performance significantly compared to state-of-the-art systems. We presented FedX, a practical solution that allows for querying multiple distributed Linked Data sources as if the data resides in a virtually integrated RDF graph. Compatible with the SPARQL 1.0 query language, our framework allows clients to integrate available SPARQL endpoints on-demand into a federation without any local preprocessing. While we focused on optimization techniques for conjunctive queries, namely basic graph patterns (BGPs), there is additional potential in developing novel, operator-specific optimization techniques for distributed settings (in particular for OPTIONAL queries), which we are planning to address in future work. As our experiments confirm, the optimization of BGPs alone (combined with common equivalent rewritings) already yields significant performance gains. Important features for federated query processing are the federation extensions proposed for the upcoming SPARQL 1.1 language definition. These allow to specify data sources directly within the query using the SERVICE operator, and moreover to attach mappings to the query as data using the BINDINGS operator. When implementing the SPARQL 1.1 federation extensions for our next release,FedX can exploit these language features to further improve performance. In fact, the SPARQL 1.1 SERVICE keyword is a trivial extension, which enhances our source selection approach with possibilities for manual specification of new sources and gives the query designer more control. Statistics can in uence performance tremendously in a distributed setting. Currently, FedX does not use any local statistics since we follow the design goal of on-demand federation setup. We aim at providing a federation framework, in which data sources can be integrated ad-hoc, and used immediately for query processing. In a future release, (remote) statistics (e.g., using VoID ) can be incorporated for source selection and to further improve our join order algorithm.
Future work
While we focused on optimization techniques for conjunctive queries, namely basic graph patterns (BGPs), there is additional potential in developing novel, operator-specific optimization techniques for distributed settings (in particular for OPTIONAL queries), which we are planning to address in future work. In a future release, (remote) statistics (e.g., using VoID) can be incorporated for source selection and to further improve our join order algorithm.
Approach
Positive Aspects: {{{PositiveAspects}}}
Negative Aspects: {{{NegativeAspects}}}
Limitations: {{{Limitations}}}
Challenges: {{{Challenges}}}
Proposes Algorithm: {{{ProposesAlgorithm}}}
Methodology: {{{Methodology}}}
Requirements: {{{Requirements}}}
Limitations: {{{Limitations}}}
Implementations
Download-page: http://www.uidops.com/FedX
Access API: No data available now.
Information Representation: RDF
Data Catalogue: -
Runs on OS: Windows 2008 Server 64bit
Vendor: No data available now.
Uses Framework: No data available now.
Has Documentation URL: http://www.uidops.com/FedX
Programming Language: Java
Version: 1.0
Platform: Sesame
Toolbox: No data available now.
GUI: Yes
Research Problem
Subproblem of: Federated RDF query processing
RelatedProblem: Find optimization techniques that allow for efficient SPARQL query processing on federated Linked Data
Motivation: The lack of a global schema, querying data from multiple sources could be solved by querying distributed datasets
Evaluation
Experiment Setup: All experiments are carried out on an HP Proliant DL360 G6 with 2GHz 4Core CPU with 128KB L1 Cache, 1024KB L2 Cache, 4096KB L3 Cache, 32GB 1333MHz RAM, and a 160 GB SCSI hard drive. In all scenarios we assigned 20GB RAM to the process executing the query In the SPARQL federation we additionally assign 1GB RAM to each individual SPARQL endpoint process.
Evaluation Method : Compare the results to state-of-the-art federated query processing engines.
Hypothesis: -
Description: [[Has Description::we evaluate FedX and analyze the performance of our optimiza- tion techniques. With the goal of assessing the practicability of our system, we run various benchmarks and compare the results to state-of-the-art federated query processing engines. In our benchmark, we compare the performance of FedX with the competitive systems DARQ and AliBaba7 since these are compa- rable to FedX in terms of functionality and the implemented query processing approach. Unfortunately, we were not able to obtain a prototype of the system presented in [2] for comparison.]]
Dimensions: Performance
Results: With our optimization techniques, we are able to reduce the number of requests significantly, e.g., from 170,579 (DARQ) and 93,248 (AliBaba) to just 23 (FedX) for query CD3.
Access API | No data available now. + |
Event in series | ISWC + |
Has Benchmark | FedBench project page: http://code.google.com/p/fbench/http://www.openrdf.org/http://www4.wiwiss.fu-berlin.de/drugbank/http://kegg.bio2rdf.org/sparql + |
Has Challenges | {{{Challenges}}} + |
Has DataCatalouge | - + |
Has Dimensions | Performance + |
Has DocumentationURL | http://www.uidops.com/FedX + |
Has Downloadpage | http://www.uidops.com/FedX + |
Has Evaluation | The practicability and efficiency of FedX + |
Has EvaluationMethod | Compare the results to state-of-the-art federated query processing engines. + |
Has ExperimentSetup | All experiments are carried out on an HP P … All experiments are carried out on an HP Proliant DL360 G6 with 2GHz
o each individual SPARQL endpoint process. +4Core CPU with 128KB L1 Cache, 1024KB L2 Cache, 4096KB L3 Cache, 32GB 1333MHz RAM, and a 160 GB SCSI hard drive. In all scenarios we assigned 20GB RAM to the process executing the query In the SPARQL federation we additionally assign 1GB RAM to each individual SPARQL endpoint process. |
Has GUI | Yes + |
Has Hypothesis | - + |
Has Implementation | FedX + |
Has InfoRepresentation | RDF + |
Has Limitations | {{{Limitations}}} + |
Has NegativeAspects | {{{NegativeAspects}}} + |
Has PositiveAspects | {{{PositiveAspects}}} + |
Has Requirements | {{{Requirements}}} + |
Has Results | With our optimization techniques, we are able to reduce the number of requests significantly, e.g., from 170,579 (DARQ) and 93,248 (AliBaba) to just 23 (FedX) for query CD3. + |
Has Subproblem | Federated RDF query processing + |
Has Version | 1.0 + |
Has abstract | Motivated by the ongoing success of Linked … Motivated by the ongoing success of Linked Data and the growing amount of semantic data sources available on the Web, new challenges to query processing are emerging. Especially in distributed settings that require joining data provided by multiple sources, sophisticated optimization techniques are necessary for efficient query processing. We propose novel join processing and grouping techniques to minimize the number of remote requests, and develop an effective solution for source selection in the absence of preprocessed metadata. We present FedX, a practical framework that enables efficient SPARQL query processing on heterogeneous, virtually integrated Linked Data sources. In experiments, we demonstrate the practicability and efficiency of our framework on a set of real-world queries and data sources from the Linked Open Data cloud. With FedX we achieve a significant improvement in query performance over state-of-the-art federated query engines. state-of-the-art federated query engines. + |
Has approach | Join processing and optimization approaches + |
Has authors | Andreas Schwarte +, Peter Haase +, Katja Hose +, Ralf Schenkel + and Michael Schmidt + |
Has conclusion | In this paper, we proposed novel optimizat … In this paper, we proposed novel optimization techniques for efficient SPARQL query processing in the federated setting. As revealed by our benchmarks, bound joins combined with our grouping and source selection approaches are effective in terms of performance. By minimizing the number of intermediate requests, we are able to improve query performance significantly compared to state-of-the-art systems. We presented FedX, a practical solution that allows for querying multiple distributed Linked Data sources as if the data resides in a virtually integrated RDF graph. Compatible with the SPARQL 1.0 query language, our framework allows clients to integrate available SPARQL endpoints on-demand into a federation without any local preprocessing. While we focused on optimization techniques for conjunctive queries, namely basic graph patterns (BGPs), there is additional potential in developing novel, operator-specific optimization techniques for distributed settings (in particular for OPTIONAL queries), which we are planning to address in future work. As our experiments confirm, the optimization of BGPs alone (combined with common equivalent rewritings) already yields significant performance gains. Important features for federated query processing are the federation extensions proposed for the upcoming SPARQL 1.1 language definition. These allow to specify data sources directly within the query using the SERVICE operator, and moreover to attach mappings to the query as data using the BINDINGS operator. When implementing the SPARQL 1.1 federation extensions for our next release,FedX can exploit these language features to further improve performance. In fact, the SPARQL 1.1 SERVICE keyword is a trivial extension, which enhances our source selection approach with possibilities for manual specification of new sources and gives the query designer more control. Statistics can in uence performance tremendously in a distributed setting. Currently, FedX does not use any local statistics since we follow the design goal of on-demand federation setup. We aim at providing a federation framework, in which data sources can be integrated ad-hoc, and used immediately for query processing. In a future release, (remote) statistics (e.g., using VoID ) can be incorporated for source selection and to further improve our join order algorithm. further improve our join order algorithm. + |
Has future work | While we focused on optimization technique … While we focused on optimization techniques for conjunctive queries, namely basic graph patterns (BGPs), there is additional potential in developing novel, operator-specific optimization techniques for distributed settings (in particular for OPTIONAL queries), which we are planning to address in future work. In a future release, (remote) statistics (e.g., using VoID) can be incorporated for source selection and to further improve our join order algorithm. further improve our join order algorithm. + |
Has keywords | Not available + |
Has motivation | The lack of a global schema, querying data from multiple sources could be solved by querying distributed datasets + |
Has platform | Sesame + |
Has problem | SPARQL Query Federation + |
Has relatedProblem | Find optimization techniques that allow for efficient SPARQL query processing on federated Linked Data + |
Has subject | Querying Distributed RDF Data Sources + |
Has vendor | No data available now. + |
Has year | 2011 + |
ImplementedIn ProgLang | Java + |
Proposes Algorithm | {{{ProposesAlgorithm}}} + |
RunsOn OS | Windows 2008 Server 64bit + |
Title | FedX: Optimization Techniques for Federated Query Processing on Linked Data + |
Uses Framework | No data available now. + |
Uses Methodology | {{{Methodology}}} + |
Uses Toolbox | No data available now. + |