Difference between revisions of "SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions"
(Created page with "{{Paper |Title=SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions |Subject=Querying Distributed RDF Data Sources |Authors=Olaf Gorlitz, Steffen Staab, |Series=C...") |
|||
(7 intermediate revisions by the same user not shown) | |||
Line 4: | Line 4: | ||
|Authors=Olaf Gorlitz, Steffen Staab, | |Authors=Olaf Gorlitz, Steffen Staab, | ||
|Series=COLD | |Series=COLD | ||
+ | |Year=2011 | ||
|Abstract=In order to leverage the full potential of the Semantic Web it is necessary to transparently query distributed RDF data sources in the same way as it has been possible with federated databases for ages. However, there are significant differences between the Web of (linked) Data and the traditional database approaches. Hence, it is not straightforward to adapt successful database techniques | |Abstract=In order to leverage the full potential of the Semantic Web it is necessary to transparently query distributed RDF data sources in the same way as it has been possible with federated databases for ages. However, there are significant differences between the Web of (linked) Data and the traditional database approaches. Hence, it is not straightforward to adapt successful database techniques | ||
for RDF federation. Reasons are the missing cooperation between SPARQL endpoints and the need for detailed data statistics for estimating the costs of query execution plans. We have implemented SPLENDID, a query optimization strategy for federating SPARQL endpoints based on statistical data obtained from voiD descriptions. | for RDF federation. Reasons are the missing cooperation between SPARQL endpoints and the need for detailed data statistics for estimating the costs of query execution plans. We have implemented SPLENDID, a query optimization strategy for federating SPARQL endpoints based on statistical data obtained from voiD descriptions. | ||
− | |||
− | |||
|Conclusion=SPLENDID allows for transparent query federation over distributed SPARQL endpoints. In order to achieve a good query execution performance, data source selection and query optimization is based on basic statistical information which is obtained from VOID descriptions. The utilization of open semantic web standards, like VOID and SPARQL endpoints, allows for flexible integration of various distributed and linked RDF data sources. We have described in detail the implementation of the data source | |Conclusion=SPLENDID allows for transparent query federation over distributed SPARQL endpoints. In order to achieve a good query execution performance, data source selection and query optimization is based on basic statistical information which is obtained from VOID descriptions. The utilization of open semantic web standards, like VOID and SPARQL endpoints, allows for flexible integration of various distributed and linked RDF data sources. We have described in detail the implementation of the data source | ||
selection and the join order optimization. The evaluation shows that our approach can achieve good query performance and is competitive compared to other state-of-the-art federation implementations. | selection and the join order optimization. The evaluation shows that our approach can achieve good query performance and is competitive compared to other state-of-the-art federation implementations. | ||
In our analysis of the source selection we came to the conclusion that at least predicate and type statistics should be included in VOID description for RDF datasets. The use of 3rd party sameAs links, however, can significantly increase the number of requests and thus, hamper the efficiency of query execution plans. The comparison of the two employed physical join implementations has shown that the network overhead plays an important role. Both hash join and bind join can significantly reduce the query processing time for certain types of queries. With SPLENDID we also like to advocate the adoption of VOID statistics for Linked Data. As next steps, we plan to investigate whether VOID descriptions can easily be extended with more detailed statistics in order to allow for more accurate cardinality estimates and, thus, better query execution plans. On the other hand, the actual query | In our analysis of the source selection we came to the conclusion that at least predicate and type statistics should be included in VOID description for RDF datasets. The use of 3rd party sameAs links, however, can significantly increase the number of requests and thus, hamper the efficiency of query execution plans. The comparison of the two employed physical join implementations has shown that the network overhead plays an important role. Both hash join and bind join can significantly reduce the query processing time for certain types of queries. With SPLENDID we also like to advocate the adoption of VOID statistics for Linked Data. As next steps, we plan to investigate whether VOID descriptions can easily be extended with more detailed statistics in order to allow for more accurate cardinality estimates and, thus, better query execution plans. On the other hand, the actual query | ||
execution has not yet been optimized in SPLENDID. Therefore, we plan to integrate optimization techniques as used in FedX. Moreover, the adoption of the SPARQL 1.1 federation extension will also allow for more efficient query execution. | execution has not yet been optimized in SPLENDID. Therefore, we plan to integrate optimization techniques as used in FedX. Moreover, the adoption of the SPARQL 1.1 federation extension will also allow for more efficient query execution. | ||
− | |Problem=SPLENDID | + | |Future work=As next steps, we plan to investigate whether VOID descriptions can easily be extended with more detailed statistics in order to allow for more accurate cardinality estimates and, thus, better query execution plans. On the other hand, the actual query execution has not yet been optimized in SPLENDID. Therefore, we plan to integrate optimization techniques as used in FedX. Moreover, the adoption of the SPARQL 1.1 federation extension will also allow for more efficient query execution. |
+ | |Problem=SPARQL Query Federation | ||
+ | |Approach=query optimization strategy for federating SPARQL endpoints based on statistical data obtained from voiD descriptions | ||
+ | |Implementation=SPLENDID | ||
+ | |Evaluation=query execution performance evaluation | ||
+ | |Download-page=https://github.com/semagrow/fork-splendid-server | ||
+ | |API=No data available now. | ||
+ | |InfoRepresentation=RDF | ||
+ | |Catalogue=VoID | ||
+ | |OS=OS independent | ||
+ | |vendor=Open source | ||
+ | |Framework=- | ||
+ | |DocumentationURL=No data available now. | ||
+ | |ProgLang=Java | ||
+ | |Version=1.0 | ||
+ | |Platform=Sesame | ||
+ | |Toolbox=No data available now. | ||
|GUI=No | |GUI=No | ||
+ | |Subproblem=Querying Distributed RDF Data Sources | ||
+ | |RelatedProblem=retrieve and join the result tuples | ||
+ | |ExperimentSetup=Due to the unpredictable availability and latency of the original SPARQL endpoints | ||
+ | of the benchmark dataset we used local copies of them which were hosted on five 64bit | ||
+ | Intel(R) Xeon(TM) CPU 3.60GHz server instances running Sesame 2.4.2 with each | ||
+ | instance providing the SPARQL endpoint for one life science and for one cross domain | ||
+ | dataset. The evaluation was performed on a separate server instance with 64bit Intel(R) | ||
+ | Xeon(TM) CPU 3.60GHz and a 100Mbit network connection. | ||
+ | |EvaluationMethod=The goal of the evaluation is to show that SPLENDID is able to achieve good query execution performance for real world federation scenarios. | ||
+ | |Hypothesis=- | ||
+ | |Description=we investigated how the information from the VOID | ||
+ | descriptions effect the accuracy of the source selection. For each query, we look at | ||
+ | the number of sources selected and the resulting number of requests to the SPARQL | ||
+ | endpoints. We tested three different source selection approaches, based on 1) predicate | ||
+ | index only (no type information), 2) predicate and type index, and 3) predicate and type | ||
+ | index and grouping of sameAs patterns as described in Section 4.2. | ||
+ | |Dimensions=Performance | ||
+ | |Benchmark=FedBench | ||
+ | |Results=AliBaba and DARQ fail to return results for six out of the 14 queries for | ||
+ | different reasons. AliBaba generates malformed sub queries for CD3, CD5, LS6, and | ||
+ | LS7. DARQ can not handle the unbound predicate in CD1 and LS2. For CD3 and CD5 | ||
+ | DARQ opens too many connections to GeoNames. All other unsuccessful queries take | ||
+ | longer than the time limit of five minutes. Overall, FedX has the best query evaluation | ||
+ | performance. The reason is its novel and efficient query execution based on block transmission | ||
+ | of result tuples and parallelization of joins. However, there is only a significant | ||
+ | difference between FedX and SPLENDID for CD6, CD7, LS3, LS5-7. For the other | ||
+ | queries SPLENDID is close to FedX and for CD3 and CD4 even slightly faster, which | ||
+ | indicates that SPLENDID, indeed, generates better query execution plans. | ||
}} | }} |
Latest revision as of 21:31, 11 July 2018
SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions | |
---|---|
SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions
| |
Bibliographical Metadata | |
Subject: | Querying Distributed RDF Data Sources |
Year: | 2011 |
Authors: | Olaf Gorlitz, Steffen Staab |
Venue | COLD |
Content Metadata | |
Problem: | SPARQL Query Federation |
Approach: | query optimization strategy for federating SPARQL endpoints based on statistical data obtained from voiD descriptions |
Implementation: | SPLENDID |
Evaluation: | query execution performance evaluation |
Contents
Abstract
In order to leverage the full potential of the Semantic Web it is necessary to transparently query distributed RDF data sources in the same way as it has been possible with federated databases for ages. However, there are significant differences between the Web of (linked) Data and the traditional database approaches. Hence, it is not straightforward to adapt successful database techniques for RDF federation. Reasons are the missing cooperation between SPARQL endpoints and the need for detailed data statistics for estimating the costs of query execution plans. We have implemented SPLENDID, a query optimization strategy for federating SPARQL endpoints based on statistical data obtained from voiD descriptions.
Conclusion
SPLENDID allows for transparent query federation over distributed SPARQL endpoints. In order to achieve a good query execution performance, data source selection and query optimization is based on basic statistical information which is obtained from VOID descriptions. The utilization of open semantic web standards, like VOID and SPARQL endpoints, allows for flexible integration of various distributed and linked RDF data sources. We have described in detail the implementation of the data source selection and the join order optimization. The evaluation shows that our approach can achieve good query performance and is competitive compared to other state-of-the-art federation implementations. In our analysis of the source selection we came to the conclusion that at least predicate and type statistics should be included in VOID description for RDF datasets. The use of 3rd party sameAs links, however, can significantly increase the number of requests and thus, hamper the efficiency of query execution plans. The comparison of the two employed physical join implementations has shown that the network overhead plays an important role. Both hash join and bind join can significantly reduce the query processing time for certain types of queries. With SPLENDID we also like to advocate the adoption of VOID statistics for Linked Data. As next steps, we plan to investigate whether VOID descriptions can easily be extended with more detailed statistics in order to allow for more accurate cardinality estimates and, thus, better query execution plans. On the other hand, the actual query execution has not yet been optimized in SPLENDID. Therefore, we plan to integrate optimization techniques as used in FedX. Moreover, the adoption of the SPARQL 1.1 federation extension will also allow for more efficient query execution.
Future work
As next steps, we plan to investigate whether VOID descriptions can easily be extended with more detailed statistics in order to allow for more accurate cardinality estimates and, thus, better query execution plans. On the other hand, the actual query execution has not yet been optimized in SPLENDID. Therefore, we plan to integrate optimization techniques as used in FedX. Moreover, the adoption of the SPARQL 1.1 federation extension will also allow for more efficient query execution.
Approach
Positive Aspects: {{{PositiveAspects}}}
Negative Aspects: {{{NegativeAspects}}}
Limitations: {{{Limitations}}}
Challenges: {{{Challenges}}}
Proposes Algorithm: {{{ProposesAlgorithm}}}
Methodology: {{{Methodology}}}
Requirements: {{{Requirements}}}
Limitations: {{{Limitations}}}
Implementations
Download-page: https://github.com/semagrow/fork-splendid-server
Access API: No data available now.
Information Representation: RDF
Data Catalogue: VoID
Runs on OS: OS independent
Vendor: Open source
Uses Framework: -
Has Documentation URL: No data available now.
Programming Language: Java
Version: 1.0
Platform: Sesame
Toolbox: No data available now.
GUI: No
Research Problem
Subproblem of: Querying Distributed RDF Data Sources
RelatedProblem: retrieve and join the result tuples
Motivation: {{{Motivation}}}
Evaluation
Experiment Setup: Due to the unpredictable availability and latency of the original SPARQL endpoints of the benchmark dataset we used local copies of them which were hosted on five 64bit Intel(R) Xeon(TM) CPU 3.60GHz server instances running Sesame 2.4.2 with each instance providing the SPARQL endpoint for one life science and for one cross domain dataset. The evaluation was performed on a separate server instance with 64bit Intel(R) Xeon(TM) CPU 3.60GHz and a 100Mbit network connection.
Evaluation Method : The goal of the evaluation is to show that SPLENDID is able to achieve good query execution performance for real world federation scenarios.
Hypothesis: -
Description: we investigated how the information from the VOID descriptions effect the accuracy of the source selection. For each query, we look at the number of sources selected and the resulting number of requests to the SPARQL endpoints. We tested three different source selection approaches, based on 1) predicate index only (no type information), 2) predicate and type index, and 3) predicate and type index and grouping of sameAs patterns as described in Section 4.2.
Dimensions: Performance
Benchmark used: FedBench
Results: AliBaba and DARQ fail to return results for six out of the 14 queries for different reasons. AliBaba generates malformed sub queries for CD3, CD5, LS6, and LS7. DARQ can not handle the unbound predicate in CD1 and LS2. For CD3 and CD5 DARQ opens too many connections to GeoNames. All other unsuccessful queries take longer than the time limit of five minutes. Overall, FedX has the best query evaluation performance. The reason is its novel and efficient query execution based on block transmission of result tuples and parallelization of joins. However, there is only a significant difference between FedX and SPLENDID for CD6, CD7, LS3, LS5-7. For the other queries SPLENDID is close to FedX and for CD3 and CD4 even slightly faster, which indicates that SPLENDID, indeed, generates better query execution plans.
Access API | No data available now. + |
Event in series | COLD + |
Has Benchmark | FedBench + |
Has Challenges | {{{Challenges}}} + |
Has DataCatalouge | VoID + |
Has Description | we investigated how the information from t … we investigated how the information from the VOID
meAs patterns as described in Section 4.2. +descriptions effect the accuracy of the source selection. For each query, we look at the number of sources selected and the resulting number of requests to the SPARQL endpoints. We tested three different source selection approaches, based on 1) predicate index only (no type information), 2) predicate and type index, and 3) predicate and type index and grouping of sameAs patterns as described in Section 4.2. |
Has Dimensions | Performance + |
Has DocumentationURL | http://No data available now. + |
Has Downloadpage | https://github.com/semagrow/fork-splendid-server + |
Has Evaluation | Query execution performance evaluation + |
Has EvaluationMethod | The goal of the evaluation is to show that SPLENDID is able to achieve good query execution performance for real world federation scenarios. + |
Has ExperimentSetup | Due to the unpredictable availability and … Due to the unpredictable availability and latency of the original SPARQL endpoints
3.60GHz and a 100Mbit network connection. +of the benchmark dataset we used local copies of them which were hosted on five 64bit Intel(R) Xeon(TM) CPU 3.60GHz server instances running Sesame 2.4.2 with each instance providing the SPARQL endpoint for one life science and for one cross domain dataset. The evaluation was performed on a separate server instance with 64bit Intel(R) Xeon(TM) CPU 3.60GHz and a 100Mbit network connection. |
Has GUI | No + |
Has Hypothesis | - + |
Has Implementation | SPLENDID + |
Has InfoRepresentation | RDF + |
Has Limitations | {{{Limitations}}} + |
Has NegativeAspects | {{{NegativeAspects}}} + |
Has PositiveAspects | {{{PositiveAspects}}} + |
Has Requirements | {{{Requirements}}} + |
Has Results | AliBaba and DARQ fail to return results fo … AliBaba and DARQ fail to return results for six out of the 14 queries for
d, generates better query execution plans. +different reasons. AliBaba generates malformed sub queries for CD3, CD5, LS6, and LS7. DARQ can not handle the unbound predicate in CD1 and LS2. For CD3 and CD5 DARQ opens too many connections to GeoNames. All other unsuccessful queries take longer than the time limit of five minutes. Overall, FedX has the best query evaluation performance. The reason is its novel and efficient query execution based on block transmission of result tuples and parallelization of joins. However, there is only a significant difference between FedX and SPLENDID for CD6, CD7, LS3, LS5-7. For the other queries SPLENDID is close to FedX and for CD3 and CD4 even slightly faster, which indicates that SPLENDID, indeed, generates better query execution plans. |
Has Subproblem | Querying Distributed RDF Data Sources + |
Has Version | 1.0 + |
Has abstract | In order to leverage the full potential of … In order to leverage the full potential of the Semantic Web it is necessary to transparently query distributed RDF data sources in the same way as it has been possible with federated databases for ages. However, there are significant differences between the Web of (linked) Data and the traditional database approaches. Hence, it is not straightforward to adapt successful database techniques
for RDF federation. Reasons are the missing cooperation between SPARQL endpoints and the need for detailed data statistics for estimating the costs of query execution plans. We have implemented SPLENDID, a query optimization strategy for federating SPARQL endpoints based on statistical data obtained from voiD descriptions. ical data obtained from voiD descriptions. + |
Has approach | query optimization strategy for federating SPARQL endpoints based on statistical data obtained from voiD descriptions + |
Has authors | Olaf Gorlitz + and Steffen Staab + |
Has conclusion | SPLENDID allows for transparent query fede … SPLENDID allows for transparent query federation over distributed SPARQL endpoints. In order to achieve a good query execution performance, data source selection and query optimization is based on basic statistical information which is obtained from VOID descriptions. The utilization of open semantic web standards, like VOID and SPARQL endpoints, allows for flexible integration of various distributed and linked RDF data sources. We have described in detail the implementation of the data source
allow for more efficient query execution. +selection and the join order optimization. The evaluation shows that our approach can achieve good query performance and is competitive compared to other state-of-the-art federation implementations. In our analysis of the source selection we came to the conclusion that at least predicate and type statistics should be included in VOID description for RDF datasets. The use of 3rd party sameAs links, however, can significantly increase the number of requests and thus, hamper the efficiency of query execution plans. The comparison of the two employed physical join implementations has shown that the network overhead plays an important role. Both hash join and bind join can significantly reduce the query processing time for certain types of queries. With SPLENDID we also like to advocate the adoption of VOID statistics for Linked Data. As next steps, we plan to investigate whether VOID descriptions can easily be extended with more detailed statistics in order to allow for more accurate cardinality estimates and, thus, better query execution plans. On the other hand, the actual query execution has not yet been optimized in SPLENDID. Therefore, we plan to integrate optimization techniques as used in FedX. Moreover, the adoption of the SPARQL 1.1 federation extension will also allow for more efficient query execution. |
Has future work | As next steps, we plan to investigate whet … As next steps, we plan to investigate whether VOID descriptions can easily be extended with more detailed statistics in order to allow for more accurate cardinality estimates and, thus, better query execution plans. On the other hand, the actual query execution has not yet been optimized in SPLENDID. Therefore, we plan to integrate optimization techniques as used in FedX. Moreover, the adoption of the SPARQL 1.1 federation extension will also allow for more efficient query execution. allow for more efficient query execution. + |
Has motivation | {{{Motivation}}} + |
Has platform | Sesame + |
Has problem | SPARQL Query Federation + |
Has relatedProblem | Retrieve and join the result tuples + |
Has subject | Querying Distributed RDF Data Sources + |
Has vendor | Open source + |
Has year | 2011 + |
ImplementedIn ProgLang | Java + |
Proposes Algorithm | {{{ProposesAlgorithm}}} + |
RunsOn OS | OS independent + |
Title | SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions + |
Uses Framework | - + |
Uses Methodology | {{{Methodology}}} + |
Uses Toolbox | No data available now. + |