Avalanche: Putting the Spirit of the Web back into Semantic Web Querying

From Openresearch
Jump to: navigation, search
Avalanche: Putting the Spirit of the Web back into Semantic Web Querying
Avalanche: Putting the Spirit of the Web back into Semantic Web Querying
Bibliographical Metadata
Subject: Querying Distributed RDF Data Sources
Keywords: Not available
Year: 2010
Authors: Cosmin Basca, Abraham Bernstein
Venue ISWC
Content Metadata
Problem: SPARQL Query Federation
Approach: No data available now.
Implementation: Avalanche
Evaluation: No data available now.

Abstract

Traditionally Semantic Web applications either included a web crawler or relied on external services to gain access to the Web of Data. Recent efforts have enabled applications to query the entire Semantic Web for up-to-date results. Such approaches are based on either centralized indexing of semantically annotated metadata or link traversal and URI dereferencing as in the case of Linked Open Data. By making limiting assumptions about the information space, they violate the openness principle of the Web – a key factor for its ongoing success. In this article we propose a technique called Avalanche, designed to allow a data surfer to query the Semantic Web transparently without making any prior assumptions about the distribution of the data – thus adhering to the openness criteria. Specifically, Avalanche can perform “live” (SPARQL) queries over the Web of Data. First, it gets on-line statistical information about the data distribution, as well as bandwidth availability. Then, it plans and executes the query in a distributed manner trying to quickly provide first answers. The main contribution of this paper is the presentation of this open and distributed SPARQL querying approach. Furthermore, we propose to extend the query planning algorithm with qualitative statistical information. We empirically evaluate Avalanche using a realistic dataset, show its strengths but also point out the challenges that still exist.

Conclusion

In this paper we presented Avalanche , a novel approach for querying the Web of Data that (1) makes no assumptions about data distribution, availability, or partitioning, (2) provides up-to-date results, and (3) is flexible since it assumes nothing about the structure of participating triple stores. Specifically, we showed that Avalanche is able to execute non-trivial queries over distributed data-sources with an ex-ante unknown data-distribution. We showed two possible utility functions to guide the planning and execution over the distributed data-sources—the basic simple model and an extended model exploiting join estimation. We found that whilst the simple model found some results faster it did find less results than the extended model using the same stopping criteria. We believe that if we were to query huge information spaces the overhead of badly selected plans will be subdued by the better but slower plans of the extended utility function. To our knowledge, Avalanche is the first Semantic Web query system that makes no assumptions about the data distribution whatsoever. Whilst it is only a first implementation with a number of drawbacks it represents a first important towards bringing the spirit of the web back to triple-stores—a major condition to fulfill the vision of a truly global and open Semantic Web.

Future work

The Avalanche system has shown how a completely heterogeneous distributed query engine that makes no assumptions about data distribution could be implemented. The current approach does have a number of limitations. In particular, we need to better understand the employed objective functions for the planner, investigate if the requirements put on participating triple-stores are reasonable, explore if Avalanche can be changed to a stateless model, and empirically evaluate if the approach truly scales to large number of hosts. Here we discuss each of these issues in turn. The core optimization of the Avalanche system lies in its cost and utility function. The basic utility function only considers possible joins with no information regarding the probability of the respective join. The proposed utility extension UE estimates the join probability of two highly selective molecules. Although this improves the accuracy of the objective function, its limitation to highly selective molecules is often impractical, as many queries (such as our example query) combine highly selective molecules with non-selective ones. Hence, we need to find a probabilistic distributed join cardinality estimation for low selectivity molecules. One approach might be the usage of bloom-filter caches to store precomputed, “popular” estimates. Another might be investigating sampling techniques for distributed join estimation. In order to support Avalanche existing triple-stores should be able to: – report statistics: cardinalities, bloom filters, other future extensions – support the execution of distributed joins (common in distributed databases), which could be delegated to an intermediary but would be inefficient – share the same key space (can be URIs but would result in bandwidth intensive joins and merges) Whilst these requirements seem simple we need to investigate how complex these extensions of triple-stores are in practice. Even better would be an extension of the SPARQL standard with the above-mentioned operations, which we will attempt to propose. The current Avalanche process assumes that hosts keep partial results throughout plan execution to reduce the cost of local database operations and that result-views are kept for the duration of a query. This limits the number of queries a host can handle. We intend to investigate if a stateless approach is feasible. Note that the simple approach—the use of REST-full services—may not be applicable as the size of the state (i.e., the partial results) may be huge and overburden the available bandwidth. We designed Avalanche with the need for high scalability in mind. The core idea follows the principle of decentralization. It also supports asynchrony using asynchronous HTTP requests to avoid blocking, autonomy by delegating the coordination and execution of the distributed join/update/merge operations to the hosts, concurrency through the pipeline shown in Figure 1, symmetry by allowing each endpoint to act as the initiating Avalanche node for a query caller, and fault tolerance through a number of time-outs and stopping conditions. Nonetheless, an empirical evaluation of Avalanche with a large number of hosts is still missing—a non-trivial shortcoming (due to the lack of suitable, partitioned datasets and the significant experimental complexity) we intend to address in the near future.

Approach

Positive Aspects: {{{PositiveAspects}}}

Negative Aspects: {{{NegativeAspects}}}

Limitations: Need to better understand the employed objective functions for the planner, investigate if the requirements put on participating triple-stores are reasonable, explore if Avalanche can be changed to a stateless model, and empirically evaluate if the approach truly scales to a large number of hosts.

Challenges: the exponential complexity class of the plan composition space.

Proposes Algorithm: {{{ProposesAlgorithm}}}

Methodology: {{{Methodology}}}

Requirements: In order to support Avalanche existing triple-stores should be able to: – report statistics: cardinalities, bloom filters, other future extensions – support the execution of distributed joins (common in distributed databases), which could be delegated to an intermediary but would be inefficient – share the same key space (can be URIs but would result in bandwidth-intensive joins and merges)

Limitations: Need to better understand the employed objective functions for the planner, investigate if the requirements put on participating triple-stores are reasonable, explore if Avalanche can be changed to a stateless model, and empirically evaluate if the approach truly scales to a large number of hosts.

Implementations

Download-page: -

Access API: -

Information Representation: RDF

Data Catalogue: Search Engine

Runs on OS: -

Vendor: -

Uses Framework: -

Has Documentation URL: -

Programming Language: -

Version: -

Platform: Avalanche

Toolbox: {{{Toolbox}}}

GUI: No

Research Problem

Subproblem of: {{{Subproblem}}}

Property "Has Subproblem" (as page type) with input value "{{{Subproblem}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

RelatedProblem: {{{RelatedProblem}}}

Property "Has relatedProblem" (as page type) with input value "{{{RelatedProblem}}}" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Motivation: flexibility of Web of Data is paramount for future development of the Semantic Web – as it was for the WWW

Evaluation

Experiment Setup: Test Avalanche using a five-node cluster. Each machine had 2GB RAM and an Intel Core 2 Duo E8500 @ 3.16GHz

Evaluation Method : the query execution and plan convergence

Hypothesis: assumes that the distribution of triples to machines participating in the query evaluation is unknown prior to query execution.

Description: {{{Description}}}

Dimensions: Performance

Benchmark used: DBLP, IEEE triples, ACM triples

Results: Avalanche is able to successfully execute query plans and retrieves many up-to-date results without having any prior knowledge of the data distribution. We, furthermore, see that different objective functions have a significant influence on the outcome and should play a critical role when deployed on the Semantic Web.

Access API- +
Event in seriesISWC +
Has BenchmarkDBLP +, IEEE triples + and ACM triples +
Has Challengesthe exponential complexity class of the plan composition space. +
Has DataCatalougeSearch Engine +
Has Description{{{Description}}} +
Has DimensionsPerformance +
Has DocumentationURLhttp://- +
Has Downloadpagehttp://- +
Has EvaluationNo data available now. +
Has EvaluationMethodthe query execution and plan convergence +
Has ExperimentSetupTest Avalanche using a five-node cluster. Each machine had 2GB RAM and an Intel Core 2 Duo E8500 @ 3.16GHz +
Has GUINo +
Has Hypothesisassumes that the distribution of triples to machines participating in the query evaluation is unknown prior to query execution. +
Has ImplementationAvalanche +
Has InfoRepresentationRDF +
Has LimitationsNeed to better understand the employed obj
Need to better understand the employed objective functions for the planner,

investigate if the requirements put on participating triple-stores are reasonable, explore if Avalanche can be changed to a stateless model, and empirically

evaluate if the approach truly scales to a large number of hosts.
h truly scales to a large number of hosts. +
Has NegativeAspects{{{NegativeAspects}}} +
Has PositiveAspects{{{PositiveAspects}}} +
Has RequirementsIn order to support Avalanche existing tri
In order to support Avalanche existing triple-stores should be able to:

– report statistics: cardinalities, bloom filters, other future extensions – support the execution of distributed joins (common in distributed databases), which could be delegated to an intermediary but would be inefficient – share the same key space (can be URIs but would result in bandwidth-intensive

joins and merges)
t in bandwidth-intensive joins and merges) +
Has ResultsAvalanche is able to successfully execute
Avalanche is able to successfully execute query plans and retrieves many up-to-date results without having any prior knowledge of the data distribution. We, furthermore, see that different objective functions have a significant influence on the outcome and should play a critical role when deployed on the Semantic Web.
al role when deployed on the Semantic Web. +
Has Version- +
Has abstractTraditionally Semantic Web applications ei
Traditionally Semantic Web applications either included a web crawler or relied on external services to gain access to the Web of Data. Recent efforts have enabled applications to query the entire Semantic Web for up-to-date results. Such approaches are based on either centralized indexing of semantically annotated metadata or link traversal and URI dereferencing as in the case of Linked Open Data. By making limiting assumptions about the information space, they violate the openness principle of the Web – a key factor for its ongoing success. In this article we propose a technique called Avalanche, designed to allow a data surfer to query the Semantic Web transparently without making any prior assumptions about the distribution of the data – thus adhering to the openness criteria. Specifically, Avalanche can perform “live” (SPARQL) queries over the Web of Data. First, it gets on-line statistical information about the data distribution, as well as bandwidth availability. Then, it plans and executes the query in a distributed manner trying to quickly provide first answers. The main contribution of this paper is the presentation of this open and distributed SPARQL querying approach. Furthermore, we propose to extend the query planning algorithm with qualitative statistical information. We empirically evaluate Avalanche using a realistic dataset, show its strengths but also point out the challenges that still exist.
point out the challenges that still exist. +
Has approachNo data available now. +
Has authorsCosmin Basca + and Abraham Bernstein +
Has conclusionIn this paper we presented Avalanche , a n
In this paper we presented Avalanche , a novel approach for querying the Web of Data that (1) makes no assumptions about data distribution, availability, or partitioning, (2) provides up-to-date results, and (3) is flexible since it assumes nothing about the structure of participating triple stores. Specifically, we showed that Avalanche is able to execute non-trivial queries over distributed data-sources with an ex-ante unknown data-distribution. We showed two possible utility functions to guide the planning and execution over the distributed data-sources—the basic simple model and an extended model exploiting join estimation. We found that whilst the simple model found some results faster it did find less results than the extended model using the same stopping criteria. We believe that if we were to query huge information spaces the overhead of badly selected plans will be subdued by the better but slower plans of the extended utility function. To our knowledge, Avalanche is the first Semantic Web query system that makes no assumptions about the data distribution whatsoever. Whilst it is only a first implementation with a number of drawbacks it represents a first important towards bringing the spirit of the web back to triple-stores—a major condition to fulfill the vision of a truly global and open Semantic Web.
n of a truly global and open Semantic Web. +
Has future workThe Avalanche system has shown how a compl
The Avalanche system has shown how a completely heterogeneous distributed query engine that makes no assumptions about data distribution could be implemented. The current approach does have a number of limitations. In particular, we need to better understand the employed objective functions for the planner, investigate if the requirements put on participating triple-stores are reasonable, explore if Avalanche can be changed to a stateless model, and empirically evaluate if the approach truly scales to large number of hosts. Here we discuss each of these issues in turn. The core optimization of the Avalanche system lies in its cost and utility function. The basic utility function only considers possible joins with no information regarding the probability of the respective join. The proposed utility extension UE estimates the join probability of two highly selective molecules. Although this improves the accuracy of the objective function, its limitation to highly selective molecules is often impractical, as many queries (such as our example query) combine highly selective molecules with non-selective ones. Hence, we need to find a probabilistic distributed join cardinality estimation for low selectivity molecules. One approach might be the usage of bloom-filter caches to store precomputed, “popular” estimates. Another might be investigating sampling techniques for distributed join estimation. In order to support Avalanche existing triple-stores should be able to: – report statistics: cardinalities, bloom filters, other future extensions – support the execution of distributed joins (common in distributed databases), which could be delegated to an intermediary but would be inefficient – share the same key space (can be URIs but would result in bandwidth intensive joins and merges) Whilst these requirements seem simple we need to investigate how complex these extensions of triple-stores are in practice. Even better would be an extension of the SPARQL standard with the above-mentioned operations, which we will attempt to propose. The current Avalanche process assumes that hosts keep partial results throughout plan execution to reduce the cost of local database operations and that result-views are kept for the duration of a query. This limits the number of queries a host can handle. We intend to investigate if a stateless approach is feasible. Note that the simple approach—the use of REST-full services—may not be applicable as the size of the state (i.e., the partial results) may be huge and overburden the available bandwidth. We designed Avalanche with the need for high scalability in mind. The core idea follows the principle of decentralization. It also supports asynchrony using asynchronous HTTP requests to avoid blocking, autonomy by delegating the coordination and execution of the distributed join/update/merge operations to the hosts, concurrency through the pipeline shown in Figure 1, symmetry by allowing each endpoint to act as the initiating Avalanche node for a query caller, and fault tolerance through a number of time-outs and stopping conditions. Nonetheless, an empirical evaluation of Avalanche with a large number of hosts is still missing—a non-trivial shortcoming (due to the lack of suitable, partitioned datasets and the significant experimental complexity) we intend to address in the near future.
) we intend to address in the near future. +
Has keywordsNot available +
Has motivationflexibility of Web of Data is paramount for future development of the Semantic Web – as it was for the WWW +
Has platformAvalanche +
Has problemSPARQL Query Federation +
Has subjectQuerying Distributed RDF Data Sources +
Has vendor- +
Has year2010 +
ImplementedIn ProgLang- +
Proposes Algorithm{{{ProposesAlgorithm}}} +
RunsOn OS- +
TitleAvalanche: Putting the Spirit of the Web back into Semantic Web Querying +
Uses Framework- +
Uses Methodology{{{Methodology}}} +
Uses Toolbox{{{Toolbox}}} +