RANK: Top-k Query Processing
Computer Science Department
Columbia University


 

Project Summary

 

Traditionally, queries over structured and semi-structured data identify the exact matches for the queries. This exact-match query model is not appropriate for many database applications and scenarios where queries are inherently fuzzy -- often expressing user preferences and not hard Boolean constraints -- and are best answered with a ranked list of the best matching objects, for some definition of degree of match. This "top-k" query model is natural in many scenarios, which we study in our RANK project:

 

 

*     Over relational databases:

 

Consider the following real-estate database with information on houses for sale:

 

 

House ID

Bedrooms

Location

Price

0001

4

Manhattan

380,000

0002

4

Seattle

650,000

0003

5

San Jose

330,000

0004

7

Queens

950,000

0005

1

Manhattan

125,000

….

 

 

 

 

 

For a query “Find 4-bedroom houses in Queens with a price tag of around $300,000”, the database system should rank the available houses according to how well they match the given user preference, and return the top houses for the user to inspect, as illustrated below for k=3:

 

 

House ID

Bedrooms

Location

Price

Score

0014

4

Manhattan

298,000

0.98

0028

5

Queens

310,000

0.93

0001

4

Brooklyn

380,000

0.91

0013

6

Jersey City

330,000

0.85

0008

3

Buffalo

350,000

0.81

….

 

 

 

 

We map a top-k query into a traditional relational range query that an unmodified RDBMS can then optimize and process efficiently (VLDB’99 paper, TODS’02 paper). For this mapping, we rely on multidimensional histograms for selectivity estimation (SIGMOD’01 paper).

 

 

 

*     Over web sources:

 

Sometimes the “attributes” of the data objects over which a top-k query is issued are handled by external, autonomous databases (or web services) available over the web. Consider a scenario where a user is interested in restaurants in the New York City area. This scenario can be modeled using a relation with information about the different restaurants. Each tuple (or object) in this relation has a number of attributes, including Address, Rating, and Price, which are handled by autonomous web databases that exhibit different access interfaces (e.g., MapQuest for Address, Zagat for Rating, and NYTimes for Price).  

 

 

 

 

 

A user who lives at 2590 Broadway and is interested in spending around $25 for a top quality restaurant might then ask a top-10 query Address=“2590 Broadway”, Price=$25, Rating=30 (on a scale from 1 to 30). The result to this query is a list of the 10 restaurants that match the user’s specification the closest.

 

In such scenarios, access to the data is limited by the interfaces the sources provide, and source access constraints and costs  have to be taken into account to achieve efficient query processing.  We designed sequential and parallel algorithms that adaptively prioritize the processing of the known data objects to identify the top-k ones for a given query, making dynamic choices of which objects to discard and which ones to process further as query processing progresses. This results in efficient top-k query executions that (1) handle databases supporting different interfaces, and (2) minimize access to remote sources (ICDE’02 paper, TODS'04 paper).

 

*     Over multimedia repositories: 

 

Queries on  multimedia objects attributes (e.g., image, text) will typically request not just a set of objects, as in the traditional relational query model (filtering), but also a grade of match associated with each object, which indicates how well the object matches the selection condition (ranking). These grades are then used to identify the top k objects returned by the query. A top-k query in this setting has the form:

 

SELECT object_id

FROM Repository

ORDER [k] by Ranking_expression

 

We present techniques that map the top-k query processing problem over multimedia repositories  to a traditional selection query for which efficient techniques for query processing and optimization using available indexes may be adapted. Our proposed optimization techniques lead to the efficient evaluation of selection conditions over multimedia repositories (SIGMOD’96 paper, TKDE’04 paper).

 

*     Over XML data:

 

In XML integration applications, XML data comes from heterogeneous sources, and therefore may not share the same schema. In this scenario, exact query matches are too rigid, so XML query answers are ranked based on their "similarity" to the queries, in terms of both content and structure.

 

Consider the following three XML fragments below, taken from a heterogeneous collection of streaming news (RSS) documents:

 

 

 

The following query:

 

only produces an exact answer for XML fragment (a). However, both fragments (b) and (c) are acceptable answers to the query and could be of interest to the user.

 

Processing top-k queries efficiently in such a scenario is challenging, as the number of candidate answers increases dramatically with the query size. (XML queries are, in effect, joins.) By adaptively selecting query evaluation plans individually for each candidate answer, focusing on pruning irrelevant candidate answers as early as possible, our algorithms minimize the number of candidate answers considered during query evaluation (ICDE ’05 paper). We studied the impact of a variety of query execution and selectivity estimation strategies, as well as the effect of parallelism on the proposed techniques.

 

 

  

   People


At Columbia:

*     Amélie Marian (contact)

*     Luis Gravano

*     Nicolas Bruno (now at Microsoft Research)


External Collaborators:

*     Sihem Amer-Yahia (AT&T Research)

*     Surajit Chaudhuri (Microsoft Research)

*     Nick Koudas (AT&T Research)

*      Divesh Srivastava (AT&T Research)

 

 

   Publications

*     Adaptive Processing of Top-k Queries in XML.
A. Marian, S. Amer-Yahia, N. Koudas, and D.Srivastava.
Proc. of the 21st IEEE International Conference on Data Engineering (ICDE’05), 2005.

*     Optimizing Top-k Selection Queries over Multimedia Repositories.
S. Chaudhuri, L. Gravano, and A. Marian.
IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 8, Aug. 2004.

*     Evaluating Top-k Queries over Web-Accessible Databases.
A. Marian, N. Bruno, and L. Gravano
ACM Transactions on Database Systems, vol. 29, no. 2, June 2004.

*     Top-k Selection Queries over Relational Databases: Mapping Strategies and Performance Evaluation.
N. Bruno, S. Chaudhuri, and L. Gravano.
ACM Transactions on Database Systems, vol. 27, no.2, Jun. 2002.

*     Evaluating Top-k Queries over Web-Accessible Databases.
N. Bruno, L. Gravano, and A. Marian.
Proc. of the 18th IEEE International Conference on Data Engineering (ICDE’02), 2002.

*     STHoles: A Multidimensional Workload-Aware Histogram.
N. Bruno, S. Chaudhuri, and L. Gravano.
Proc. of the 2001 ACM SIGMOD International Conference on Management of Data, 2001.

*     Evaluating Top-k Selection Queries.
S. Chaudhuri and L. Gravano.
Proc. of the 25th International Conference on Very Large Data Bases (VLDB’99), 1999.

*     Merging Ranks from Heterogeneous Internet Sources.
L. Gravano and H. Garcia-Molina.
Proc. of the 23rd International Conference on Very Large Data Bases (VLDB’97), 1997.

*     Optimizing Queries over Multimedia Repositories.
S. Chaudhuri and L. Gravano.
Proc. of the 1996 ACM SIGMOD International Conference on Management of Data, 1996.


   Conference Talks

 

*     Adaptive Processing of Top-k Queries in XML..
A. Marian, S. Amer-Yahia, N. Koudas, and D.Srivastava.
Proc. of the 21st IEEE International Conference on Data Engineering (ICDE’05), 2005.

*     Evaluating Top-k Queries over Web-Accessible Databases.
N. Bruno, L. Gravano, and A. Marian.
Proc. of the 18th IEEE International Conference on Data Engineering (ICDE’02), 2002.

*     STHoles: A Multidimensional Workload-Aware Histogram.
N. Bruno, S. Chaudhuri, and L. Gravano.
Proc. of the 2001 ACM SIGMOD International Conference on Management of Data, 2001.

*     Evaluating Top-k Selection Queries.
S. Chaudhuri and L. Gravano.
Proc. of the 25th International Conference on Very Large Data Bases (VLDB'99), 1999.

*     Merging Ranks from Heterogeneous Internet Sources.
L. Gravano and H. Garcia-Molina.
Proc. of the 23rd International Conference on Very Large Data Bases (VLDB’97), 1997.

*     Optimizing Queries over Multimedia Repositories.
S. Chaudhuri and L. Gravano.
Proc. of the 1996 ACM SIGMOD International Conference on Management of Data, 1996.

   Data Sets

                       

 

*     Synthetic data generation code

 

*     Synthetic and real data sets used in experimental evaluations.

         

amelie@cs.columbia.edu