Difference between revisions of "Querying the Web of Data with Graph Theory-based Techniques"
Line 10: | Line 10: | ||
|Future work=In addition to these improvements, we are planning to explore the co-reference issue in the Linked Data cloud. From the perspective of distributed SPARQL queries, this issue is getting worse as more data are published [3], and we plan to address this issue by using our Virtual Graph approach. | |Future work=In addition to these improvements, we are planning to explore the co-reference issue in the Linked Data cloud. From the perspective of distributed SPARQL queries, this issue is getting worse as more data are published [3], and we plan to address this issue by using our Virtual Graph approach. | ||
|Implementation=GDS | |Implementation=GDS | ||
+ | |Evaluation=evaluation of GDS, DARQ and FedX | ||
+ | |Download-page=http://code.google.com/p/gdsparal/ | ||
+ | |Framework=Jena | ||
+ | |DocumentationURL=http://code.google.com/p/gdsparal/ | ||
+ | |ProgLang=Java | ||
+ | |Platform=Java VM | ||
|GUI=No | |GUI=No | ||
+ | |ExperimentSetup=The three engines are run independently | ||
+ | on a machine having an Intel Xeon W3520 processor, 12 GB | ||
+ | memory and 1Gbps LAN. | ||
+ | |EvaluationMethod=comparing the three distributed SPARQL engines | ||
+ | |Description=We deploy 6 SPARQL endpoints (Sesame 2.4.0) on 5 remote | ||
+ | virtual machines. About 400,000 triples (generated by | ||
+ | BSBM) are distributed to these endpoints following Gaussian | ||
+ | distribution. | ||
+ | |Dimensions=Performance | ||
+ | |Benchmark=FedBench | ||
+ | Berlin SPARQL Benchmark (BSBM) | ||
}} | }} |
Revision as of 10:36, 4 July 2018
Querying the Web of Data with Graph Theory-based Techniques | |
---|---|
Querying the Web of Data with Graph Theory-based Techniques
| |
Bibliographical Metadata | |
Subject: | Querying Distributed RDF Data Sources |
Keywords: | SPARQL, Linked Data, distributed query processing. |
Year: | 2006 |
Authors: | Xin Wang, Thanassis Tiropanis, Hugh C. Davis |
Venue | Web and Internet Science |
Content Metadata | |
Implementation: | GDS |
Evaluation: | evaluation of GDS, DARQ and FedX |
Contents
Abstract
The increasing amount of Linked Data on the Web enables users to retrieve quality and complex information and to deploy innovative, added-value applications. The volume of available Linked Data and their spread across a large number of repositories make a strong case for ecient distributed SPARQL queries. However, in practice, current distributed SPARQL query processing techniques face issues on performance and scalability. In our previous work we provided initial evidence that graph theory-based techniques can address performance issues better than other approaches such as DARQ. Here we further exploit the potential of graph algorithms and we show how they can address performance and scalability for distributed SPARQL queries even better. To that end, we present an improved engine called GDS and we evaluate it by providing a detailed comparison to existing approaches for distributed queries (i.e. DARQ and FedX). By analyzing the evaluation results, we try to identify promising techniques for distributed SPARQL processing, and to outline the problems that need to be addressed in future research.
Conclusion
[[has conclusion:=In this paper we present a novel graph theory-based approach for improved performance and scalability of distributed SPARQL query processing. We propose several key techniques adopted in the distributed SPARQL query engine that we developed (GDS). First, an extended MST algorithm which supports arbitrary SPARQL queries and provides better performance. Second, a simplified cost model using run-time statistics that lows requirement of service descriptions but provides good cost estimations. Third, a combination of semi-join and bind-join along with local cache that reduce network tracffic. We also compare our approach with DARQ and FedX in terms of performance and scalability. The results suggest that graph theory-based approach using lightweight service descriptions can provide better performance and scalability over other approaches Although these results are encouraging, the potential of graph theory-based approach can be developed further. First, GDS applies MST-based algorithm to BGPs rather than the whole query. BGPs are optimized separately even they are from the same query (i.e. UNION and OPTIONAL queries). In the future we aim to work on representing the whole query as a single graph and therefore provide better optimization for queries having multiple BGPs. Second, GDS does not take advantage of FILTER in optimization, which would further improve performance. Third, accurate and detailed service descriptions matters. The more accurate statistics we have, the more optimal query plan we get. Currently collecting quality service descriptions are not feasible on a large scale since most SPARQL endpoints do not provide service descriptions. However, this situation is improving as more and more approaches coming up. For example, SPARQL 1.1 aggregation features enable us to collect service descriptions more efficiently, and VoiD encourages SPARQL endpoints to publish service descriptions as well. Fourth, aggregation features in the upcoming SPARQL 1.1 (e.g. BINDINGS) can also save much efforts of our approach. In addition to these improvements, we are planning to explore the co-reference issue in the Linked Data cloud. From the perspective of distributed SPARQL queries, this issue is getting worse as more data are published [3], and we plan to address this issue by using our Virtual Graph approach.]]
Future work
[[has future work:=In addition to these improvements, we are planning to explore the co-reference issue in the Linked Data cloud. From the perspective of distributed SPARQL queries, this issue is getting worse as more data are published [3], and we plan to address this issue by using our Virtual Graph approach.]]
Approach
Positive Aspects: {{{PositiveAspects}}}
Negative Aspects: {{{NegativeAspects}}}
Limitations: {{{Limitations}}}
Challenges: {{{Challenges}}}
Proposes Algorithm: {{{ProposesAlgorithm}}}
Methodology: {{{Methodology}}}
Requirements: {{{Requirements}}}
Limitations: {{{Limitations}}}
Implementations
Download-page: http://code.google.com/p/gdsparal/
Access API: {{{API}}}
Information Representation: {{{InfoRepresentation}}}
Data Catalogue: {{{Catalogue}}}
Runs on OS: {{{OS}}}
Vendor: {{{vendor}}}
Uses Framework: Jena
Has Documentation URL: http://code.google.com/p/gdsparal/
Programming Language: Java
Version: {{{Version}}}
Platform: Java VM
Toolbox: {{{Toolbox}}}
GUI: No
Research Problem
Subproblem of: {{{Subproblem}}}
RelatedProblem: {{{RelatedProblem}}}
Motivation: {{{Motivation}}}
Evaluation
Experiment Setup: The three engines are run independently on a machine having an Intel Xeon W3520 processor, 12 GB memory and 1Gbps LAN.
Evaluation Method : comparing the three distributed SPARQL engines
Hypothesis: {{{Hypothesis}}}
Description: We deploy 6 SPARQL endpoints (Sesame 2.4.0) on 5 remote virtual machines. About 400,000 triples (generated by BSBM) are distributed to these endpoints following Gaussian distribution.
Dimensions: Performance
Benchmark used: FedBench Berlin SPARQL Benchmark (BSBM)
Results: {{{Results}}}
Access API | {{{API}}} + |
Event in series | Web and Internet Science + |
Has Challenges | {{{Challenges}}} + |
Has DataCatalouge | {{{Catalogue}}} + |
Has Description | We deploy 6 SPARQL endpoints (Sesame 2.4.0) on 5 remote
virtual machines. About 400,000 triples (generated by BSBM) are distributed to these endpoints following Gaussian distribution. + |
Has Dimensions | Performance + |
Has DocumentationURL | http://code.google.com/p/gdsparal/ + |
Has Downloadpage | http://code.google.com/p/gdsparal/ + |
Has Evaluation | Evaluation of GDS, DARQ and FedX + |
Has EvaluationMethod | comparing the three distributed SPARQL engines + |
Has ExperimentSetup | The three engines are run independently
on a machine having an Intel Xeon W3520 processor, 12 GB memory and 1Gbps LAN. + |
Has GUI | No + |
Has Hypothesis | {{{Hypothesis}}} + |
Has Implementation | GDS + |
Has InfoRepresentation | {{{InfoRepresentation}}} + |
Has Limitations | {{{Limitations}}} + |
Has NegativeAspects | {{{NegativeAspects}}} + |
Has PositiveAspects | {{{PositiveAspects}}} + |
Has Requirements | {{{Requirements}}} + |
Has Results | {{{Results}}} + |
Has Version | {{{Version}}} + |
Has abstract | The increasing amount of Linked Data on th … The increasing amount of Linked Data on the Web enables users to retrieve quality and complex information and to deploy innovative, added-value applications. The volume of available Linked Data and their spread across a large number of repositories make a strong case for ecient distributed SPARQL queries. However, in practice, current distributed SPARQL query processing techniques face issues on performance and scalability. In our previous work we provided initial evidence that graph theory-based techniques can address performance issues better than other approaches such as DARQ. Here we further exploit the potential of graph algorithms and we show how they can address performance and scalability for distributed SPARQL queries even better. To that end, we present an improved engine called GDS and we evaluate it by providing a detailed comparison to existing approaches for distributed queries (i.e. DARQ and FedX). By analyzing the evaluation results, we try to identify promising techniques for distributed SPARQL processing, and to outline the problems that need to be addressed in future research. t need to be addressed in future research. + |
Has authors | Xin Wang +, Thanassis Tiropanis + and Hugh C. Davis + |
Has keywords | SPARQL, Linked Data, distributed query processing. + |
Has motivation | {{{Motivation}}} + |
Has platform | Java VM + |
Has subject | Querying Distributed RDF Data Sources + |
Has vendor | {{{vendor}}} + |
Has year | 2006 + |
ImplementedIn ProgLang | Java + |
Proposes Algorithm | {{{ProposesAlgorithm}}} + |
Title | Querying the Web of Data with Graph Theory-based Techniques + |
Uses Framework | Jena + |
Uses Methodology | {{{Methodology}}} + |
Uses Toolbox | {{{Toolbox}}} + |