Traditionally,
queries over structured and semi-structured data identify the exact matches for the queries. This
exact-match query model is not appropriate for many database applications and
scenarios where queries are inherently fuzzy -- often expressing user
preferences and not hard Boolean constraints -- and are best answered with a
ranked list of the best matching objects, for some definition of degree of
match. This "top-k" query model is natural in many scenarios,
which we study in our RANK project:
Over
relational databases:
Consider the following real-estate database with
information on houses for sale:
|
House ID |
Bedrooms |
Location |
Price |
|
0001 |
4 |
|
380,000 |
|
0002 |
4 |
|
650,000 |
|
0003 |
5 |
|
330,000 |
|
0004 |
7 |
|
950,000 |
|
0005 |
1 |
|
125,000 |
|
…. |
|
|
|
For a query “Find 4-bedroom houses in Queens with a price tag of around
$300,000”, the database system should rank the available houses
according to how well they match the given user preference, and return the top
houses for the user to inspect, as illustrated below for k=3:
|
House ID |
Bedrooms |
Location |
Price |
Score |
|
0014 |
4 |
|
298,000 |
0.98 |
|
0028 |
5 |
|
310,000 |
0.93 |
|
0001 |
4 |
|
380,000 |
0.91 |
|
0013 |
6 |
|
330,000 |
0.85 |
|
0008 |
3 |
|
350,000 |
0.81 |
|
…. |
|
|
|
… |
We map a top-k query into a
traditional relational range query that an unmodified RDBMS can then optimize
and process efficiently (VLDB’99 paper,
TODS’02
paper). For this mapping, we rely on multidimensional histograms for
selectivity estimation (SIGMOD’01
paper).
Over
web sources:
Sometimes
the “attributes” of the data objects over which a top-k query is issued are handled by external, autonomous databases (or web services) available
over the web. Consider a
scenario where a user is interested in restaurants in the

A user who lives at 2590 Broadway and is interested in
spending around $25 for a top quality restaurant might then ask a top-10 query Address=“2590
Broadway”, Price=$25, Rating=30 (on a scale from 1 to 30). The result to this query is a
list of the 10 restaurants that match the user’s specification the closest.
In such scenarios, access to the data is limited by the
interfaces the sources provide, and source access constraints and costs have to be taken into account to achieve
efficient query processing. We designed sequential and parallel algorithms that adaptively prioritize the processing of the known data objects to identify the top-k ones for a given query, making dynamic choices of which objects to discard and which ones to process further as query processing progresses. This results in efficient top-k query executions that (1)
handle databases supporting different interfaces, and (2) minimize access to
remote sources (ICDE’02 paper, TODS'04 paper).
Over
multimedia repositories:
Queries
on multimedia objects attributes (e.g.,
image, text) will typically request not just a set of objects, as in the
traditional relational query model (filtering), but also a grade of match associated with each object, which indicates how well
the object matches the selection condition (ranking). These grades are then used to identify the top k objects returned by the query. A top-k
query in this setting has the form:
SELECT object_id
FROM Repository
ORDER [k] by Ranking_expression
We
present techniques that map the top-k
query processing problem over multimedia repositories to a traditional selection query for which
efficient techniques for query processing and optimization using available
indexes may be adapted. Our proposed optimization techniques lead to the efficient evaluation of selection conditions over
multimedia repositories (SIGMOD’96
paper, TKDE’04
paper).
Over XML data:
In XML integration applications, XML data comes from heterogeneous sources, and therefore may not share the same schema. In this scenario, exact query matches are too rigid, so XML query answers are ranked based on their "similarity" to the queries, in terms of both content and structure.
Consider the following three XML fragments below, taken
from a heterogeneous collection of streaming news (RSS) documents:

The
following query:

only produces
an exact answer for XML fragment
(a). However, both fragments (b) and (c) are acceptable answers to the query
and could be of interest to the user.
Processing top-k queries efficiently in such a scenario is challenging, as the number of candidate answers increases dramatically with the query size. (XML queries are, in effect, joins.) By adaptively selecting query evaluation plans individually for each candidate answer, focusing on pruning irrelevant candidate answers as early as possible, our algorithms minimize the number of candidate answers considered during query evaluation (ICDE ’05 paper). We studied the impact of a variety of query execution and selectivity estimation strategies, as well as the effect of parallelism on the proposed techniques.
Amélie Marian (contact)
Nicolas Bruno (now at Microsoft
Research)
Sihem Amer-Yahia (AT&T
Research)
Surajit Chaudhuri
(Microsoft Research)
Nick Koudas (AT&T Research)
Divesh Srivastava (AT&T Research)
Adaptive
Processing of Top-k Queries in XML.
A. Marian, S. Amer-Yahia, N. Koudas,
and D.Srivastava.
Proc. of the 21st IEEE International Conference on Data Engineering
(ICDE’05), 2005.
Optimizing
Top-k Selection Queries over Multimedia Repositories.
S. Chaudhuri, L. Gravano, and A.
Marian.
IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 8, Aug. 2004.
Evaluating
Top-k Queries over Web-Accessible
Databases.
A. Marian, N. Bruno, and L. Gravano
ACM Transactions on Database Systems, vol. 29, no. 2, June 2004.
Top-k
Selection Queries over Relational Databases: Mapping Strategies and Performance
Evaluation.
N. Bruno, S. Chaudhuri, and L.
Gravano.
ACM Transactions on Database Systems, vol. 27, no.2, Jun. 2002.
Evaluating
Top-k Queries over Web-Accessible Databases.
N. Bruno, L. Gravano, and A. Marian.
Proc. of the 18th IEEE International Conference on Data Engineering (ICDE’02),
2002.
STHoles: A
Multidimensional Workload-Aware Histogram.
N. Bruno, S. Chaudhuri, and L.
Gravano.
Proc. of the 2001 ACM SIGMOD International Conference on Management of Data,
2001.
Evaluating Top-k Selection Queries.
S. Chaudhuri and L. Gravano.
Proc. of the 25th International Conference on Very Large Data Bases (VLDB’99),
1999.
Merging Ranks
from Heterogeneous Internet Sources.
L. Gravano and H. Garcia-Molina.
Proc. of the 23rd International Conference on Very Large Data Bases (VLDB’97),
1997.
Optimizing
Queries over Multimedia Repositories.
S. Chaudhuri and L. Gravano.
Proc. of the 1996 ACM SIGMOD International Conference on Management of Data,
1996.
Adaptive
Processing of Top-k Queries in XML..
A. Marian, S. Amer-Yahia, N. Koudas,
and D.Srivastava.
Proc. of the 21st IEEE International Conference on Data Engineering
(ICDE’05), 2005.
Evaluating Top-k Queries over Web-Accessible
Databases.
N. Bruno, L. Gravano, and A. Marian.
Proc. of the 18th IEEE International Conference on Data Engineering (ICDE’02),
2002.
STHoles: A Multidimensional Workload-Aware Histogram.
N. Bruno, S. Chaudhuri, and L.
Gravano.
Proc. of the 2001 ACM SIGMOD International Conference on Management of Data,
2001.
Evaluating Top-k Selection Queries.
S. Chaudhuri and L. Gravano.
Proc. of the 25th International Conference on Very Large Data Bases (VLDB'99),
1999.
Merging Ranks from Heterogeneous Internet Sources.
L. Gravano and H. Garcia-Molina.
Proc. of the 23rd International Conference on Very Large Data Bases (VLDB’97),
1997.
Optimizing Queries over Multimedia Repositories.
S. Chaudhuri and L. Gravano.
Proc. of the 1996 ACM SIGMOD International Conference on Management of Data,
1996.
Synthetic
data generation code
Synthetic
and real data sets used in experimental evaluations.