Speaker:  Dragomir R. Radev (University of Michigan)
Title: Social network analysis of text
Time: Friday(Dec. 15th) 1:00-2:00pm
Room: BH4750
Abstract
 Textual data is everywhere, in email and scientific papers, in
online newspapers and e-commerce sites. The Web contains more than 200
terabytes of text not even counting the contents of dynamic textual
databases. This enormous source of knowledge is seriously
underexploited. Textual documents on the Web are very hard to model
computationally: they are unstructured, time-dependent, collectively
authored, and of uneven importance.  Traditional grammar-based
techniques don't scale up to address such problems. Novel
representations and analytical tools are needed.

   NewsInEssence (www.newsinessence.com) is a system that crawls the
Web for news, automatically clusters them by topic, and produces
user-defined extractive summaries of each cluster. A recent addition
to the battery of summarization algorithms available to NewsInEssence
is the Cosine Centrality method.  In this talk I will describe how one
can apply the theory of social networks and stochastic processes (in
particular rank-based prestige and random walks on undirected graphs)
to multi-document text summarization.

   (I will begin my talk with a short tutorial on the mathematics
needed for the rest of the talk.)

   If time permits, at the end of the talk, I will quickly describe
two recent ongoing projects in my research group: one on machine
learning for object classification using random walks on bipartite
(feature-object) graphs and another on using phylogenetic techniques
for fact tracking in evolving multi-document summarization.

Short Biography:

Dragomir R. Radev is Assistant Professor of Information, Electrical
Engineering and Computer Science, and Linguistics at the University of
Michigan, Ann Arbor.  He leads the CLAIR (Computational Lingusitics
And Information Retrieval) group which currently includes 12
undergraduate and graduate students.  Dragomir holds a Ph.D. in
Computer Science from Columbia University.  Before joining Michigan,
he was a Research Staff Member at IBM's TJ Watson Research Center in
Hawthorne, NY.  He is the author of more than 45 papers on information
retrieval, text summarization, graph models of the Web, question
answering, machine translation, text generation, and information
extraction.  Dr. Radev's current research on probabilistic and
link-based methods for exploiting very large textual repositories,
representing and acquiring knowledge of genome regulation, and
semantic entity and relation extraction from Web-scale text document
collections is supported by NSF and NIH.  Dragomir serves on the
HLT-NAACL advisory committee, was recently reelected as treasurer of
NAACL, is a member of the editorial boards of JAIR and Information
Retrieval, and is a four-time finalist at the ACM international
programming finals (as contestant in 1993 and as coach in
1995-1997). Dragomir received a graduate teaching award at Columbia
and recently, the U. of Michigan award for Outstanding Research

Speaker: Zografoula Vagena, Ph.D. Candidate,  UCR CS  
Title: Optimization of XML Path Queries
Time: Thursday(Dec. 15th) 12:30-2:00pm
Room: BH4750
Abstract
Due to its inherent complexity the problem of path query processing over
XML data has received a lot of attention and numerous relevant access
methods have already been proposed. With such an abundance of proposals,
none of which has proven its total superiority, the need to integrate them
under a common optimization framework becomes compelling. Traditional
solutions emanating from the relational world do not seem promising due to
their inherent problems, namely large search spaces, dependence on
intermediate result size estimation and propagation of estimation errors,
which remain and are exacerbated in the XML world, where not only the
value but also the structure of the data becomes important. In the current
work, we attempt to overcome the previous problems by exploring a
different direction: We identify as 'bad quality' plans the ones which
produce unnecessary large intermediate results and we set the optimization
objective as to avoid such plans instead of identifying the best plan.
We then propose holistic algorithms to answer path queries which achieve
worst case linear performance guarantees. We combine the proposed methods
using a detailed cost model that we have developed for the purpose. The
holistic nature of our algorithms, permits the definition of a cost model
that avoids the propagation of estimation errors. Moreover it enables
global optimization while examining a small search space that explores
only local decisions.
Speaker:  Yi Xia, Ph.D. Candidate Computer Science 
Department, UCLA
Title: High-Dimensional OLAP: A Minimal Cubing Approach
Time: Friday(Dec. 3rd) 12:30-2:00pm
Room: BH4549
Abstract
Data cube has been playing an essential role in fast OLAP (online 
analytical processing) in many multi-dimensional data warehouses. However, 
there exist data sets in applications like bioinformatics, statistics, and 
text processing that are characterized by high dimensionality, e.g., over 
100 dimensions, and moderate size, e.g., around 10 6 tuples. No feasible 
data cube can be constructed with such datasets. In this paper we will 
address the problem of developing an efficient algorithm to perform OLAP on such data sets.
Experience tells us that although data analysis tasks may involve a high
dimensional space, most OLAP operations are performed only on a small 
number of dimensions at a time. Based on this observation, we propose a
novel method that computes a thin layer of the data cube together with
associated value-list indices. This layer, while being manageable in 
size, will be capable of supporting flexible and fast OLAP operations in 
the original high dimensional space. Through experiments we will show that 
the method has I/O costs that scale nicely with dimensionality. 
Furthermore, the costs are comparable to that of accessing an existing 
data cube when full materialization is possible.
Speaker:  Raymond K. Pon, Ph.D. Student Computer Science Department, UCLA
Title: Report on the Dagstuhl Seminar - Data Quality on the Web
Time: Friday(Nov. 5th) 12:30-2:00pm
Room: BH4549
Abstract
M. Gertz, M.T. Ozsu, G. Saake, K-U Sattler, "Report on the Dagstuhl Seminar
- Data Quality on the Web," SIGMOD Record, vol. 33, no. 1, Mar. 2004.
Seminar website: http://www.dagstuhl.de/03362/
(http://sirius.cs.ucdavis.edu/Dagstuhl03/)
Seminar report: ftp://ftp.dagstuhl.de/pub/Reports/03/03362.pdf
SIGMOD Record paper: http://portal.acm.org/citation.cfm?id=974121.974144

Over the past few years, techniques for managing, querying, and integrating
data on the Web have significantly matured. Well-founded and practical
approaches to assess or even guarantee a required degree of quality of the
data in these frameworks, however, are still missing. This can be
contributed to the lack of well defined data quality metrics and assessment
techniques, and the difficulty of handling information about data quality
during data integration and query processing. Data quality problems arise in
many settings, such as the integration of business data, in Web mining, data
dissemination, and in querying the Web using search engines. Data quality
(DQ) addresses various forms of data, including structured and
semistructured data, text documents, multimedia, and streaming data.
Different forms of metadata describing the quality of data is becoming
increasingly important since they provide applications and users with
information about the value and reliability of (integrated) data on the Web.
 
The Dagstuhl Seminar "Data Quality on the Web", organized by Michael Gertz,
Tamer ¨ Ozsu, Gunter Saake, and Kai-Uwe Sattler, took place between August
31st and September 5th 2003 at Schloss Dagstuhl,Germany. The objective of
the seminar was to (1) foster collaboration among researchers that deal with
DQ in different areas, (2) assess existing results in managing the quality
of data, and (3) establish a framework for future research in the area of
DQ. The application contexts considered during the seminar included in
particular (Web-based) data integration and information retrieval scenarios,
scientific databases, and application domains in the computational sciences
and Bioinformatics. In all these areas, data quality plays a crucial role
and therefore different, tailored solutions have been developed. Sharing and
exchanging this knowledge could result in significant synergy effects.

Speaker: Xin Zhou, Ph.D. Candidate Computer Science Department, UCLA
Title:XBiT: An XML-based Bitemporal Data Model
Time: Friday(Oct. 29th) 12:30-2:00pm
Room: BH4549
Abstract
Practice talk for ER2004

Presentation Title: XBiT: An XML-based Bitemporal Data Model [1]

Abstract. Past research work on modeling and managing temporal information has, so far, failed to elicit support in commercial database systems. The
increasing popularity of XML o®ers a unique opportunity to change this situation, inasmuch as XML and XQuery support temporal information much better
than relational tables and SQL. This is the important conclusion claimed in this paper where we show that valid-time, transaction-time, and bitemporal
databases can be naturally viewed in XML using temporally-grouped data models. Then, we show that complex historical queries, that would be very di±cult
to express in SQL on relational tables, can now be easily expressed in standard XQuery on such XML-based representations. We ¯rst discuss the management
of transaction-time and valid-time histories and then extend our approach to bitemporal histories. The approach can be generalized naturally to support
the temporal management of arbitrary XML documents and queries on their version history.

-------------------------------------------------------------------------------------------------------------

Demo Title: Temporal Information Management using XML

We will show that (i) XML hierarchical structure can naturally represent the history of databases and XML documents via temporally-grouped data models,
and (ii) powerful temporal queries can be expressed in XQuery without requiring any extension to current standards. This approach is quite general and,
in addition to the evolution history of databases, it can be used to support the version history of XML documents for transaction-time, valid-time, and
bitemporal chronicles [1]. We will demo the queries discussed in [1] and show that this approach leads to simple programming environments that are
fully-integrated with current XML tools and commercial DBMSs.

Speaker: Yun Chi, Ph.D. Candidate Computer Science Department, UCLA
Title:Catch the Moment: Maintaining Closed Frequent Itemsets over a Data Stream Sliding Window (ICDM practice talk)
Time: Friday(Oct. 22th) 12:30-2:00pm
Room: BH4549
Abstract
This paper considers the problem of mining closed frequent itemsets over a data stream sliding window using limited memory space.  We design a synopsis
data structure to monitor transactions in the sliding window so that we can output the current closed frequent itemsets at any time.  Due to time and
memory constraints, the synopsis data structure cannot monitor all possible itemsets. However, monitoring only frequent itemsets will make it impossible
to detect new itemsets when they become frequent. In this paper, we introduce a compact data structure, the closed enumeration tree (CET), to maintain a
dynamically selected set of itemsets over a sliding window.  The selected itemsets contains of a boundary between closed frequent itemsets and the rest
of the itemsets. Conceptual drifts in a data stream are reflected by boundary movements in the CET. In other words, a status change of any itemset (e.g.,
from non-frequent to frequent) must occur through the boundary.  Because the boundary is relatively stable, the cost of mining closed frequent itemsets
over a sliding window is dramatically reduced to that of mining transactions that can possibly cause boundary movements in the CET.  Our experiments show
that our algorithm performs much better than a representative algorithm for the sate-of-the-art approaches.
Speaker: Victor Liu, Ph.D. Candidate Computer Science Department, UCLA
Title:Automatic identification of search goals behind a user's Web query
Time: Friday(Oct. 15th) 12:30-2:00pm
Room: BH4549
Abstract
Understanding the search goals of a user's Web query is
critical for a search engine to improve its performance.
Existing research has typically classified search goals as
either navigational (to reach a particular Website the user
has in mind) or informational (to collect information about
a query topic).  Knowing the probable search goal behind a
query allows the search engine to tailor its ranking methods
to present the ``best'' result ranking that satisfies the
user's need.  In this talk we investigate automatic
mechanisms to identify search goals, which has not been
fully studied in the past.  In addition, to evaluate the
effectiveness of the developed mechanisms, we have performed
human subject study involving 28 participants from our
department, to obtain ``accurate'' search goals for a
standard set of queries.  The results of this human subject
study along with several interesting findings will also be
presented in this talk.

Speaker: Qinghua Zou, Ph.D. Candidate Computer Science Department, UCLA
Title:Using a Compact Tree to Index and Query XML Data 
Time: Friday(Oct. 8th) 12:30-2:00pm
Room: BH4549
Abstract
Indexing XML is crucial for efficient XML query processing. We propose a compact tree (Ctree) for XML indexing, which provides not only concise path
summaries at group level but also detailed child-parent relationships at element level. Based on Ctree, we are able to measure how well XML data is
structured. We also propose a three-step query processing method. Its efficiency is achieved by: (1) summarizing large XML data structures into a
condensed Ctree; (2) pruning irrelevant groups to significantly reduce the search space; (3) eliminating join operations between the matches for value
predicates and those for structure constraints and (4) using Ctree properties such as regular groups to reduce query processing time. Our experiments
reveal that Ctree is an effective data structure for managing XML data.

Speaker: Yun Chi, Ph.D. Candidate Computer Science Department, UCLA
Title:Loadstar: A Load Shedding Scheme for Classifying Data Streams
Time: Friday(Oct. 1st) 12:30-2:00pm
Room: BH4549
Abstract
 We consider the problem of resource allocation in mining data
  streams. Due to the complexity of the mining problem, the large
  volume, and the high speed of streaming data, mining algorithms must
  be resistant to the effects of system overload.  Thus, how to
  realize maximum benefits under various resource constraints becomes
  an important topic. In this paper, we propose a load shedding scheme
  for classifying data streams.  We introduce QoD (quality of
  decision) specifications to measure the quality of classification
  results.  To guide load shedding, the QoD measures are applied on
  predicted feature values in the future so that we know what data can
  be safely discarded without much negative impact on the benefits of
  mining. For this purpose, we propose a Markov model based feature
  value predictor.  Furthermore, our load shedding scheme is able to
  learn and adapt to the changes of the data characteristics of the
  stream.  Experiments on both synthetic data and real-life data have
  shown that our load shedding scheme is effective in improving the
  benefits of data stream classification under resource constraints.
Speaker: Themis Palpanas, PostDoc Researcher, University of California, Riverside
Title:Online Amnesic Approximation of Streaming Time Series

Time: Friday(July 2nd) 12:30-2:00pm
Room: BH4760
Abstract
The past decade has seen a wealth of research on time series representations, because the manipulation, storage, and indexing of large volumes of raw time series data is impractical. The vast majority of research has concentrated on representations that are calculated in batch mode and represent each value with approximately equal fidelity.

However, the increasing deployment of mobile devices and real time sensors has brought home the need for representations that can be incrementally updated, and can approximate the data with fidelity proportional to its age.

The latter property allows us to answer queries about the recent past with greater precision, since in many domains recent information is more useful than older information. We call such representations amnesic.

While there has been previous work on amnesic representations, the class of amnesic functions possible was dictated by the representation itself. In this work, we introduce a novel representation of time series that can represent arbitrary, user-specified amnesic functions. For example, a meteorologist may decide that data that is twice as old can tolerate twice as much error, and thus, specify a linear amnesic function.

In contrast, an econometrist might opt for an exponential amnesic function. We propose online algorithms for our representation, and discuss their properties.

Finally, we perform an extensive empirical evaluation on 40 datasets, and show that our approach can efficiently maintain a high quality amnesic approximation.

Short Bio:

Themis Palpanas is currently a PostDoctorate fellow at the University of California, Riverside. He holds a PhD and MSc in Computer Science from the University of Toronto, and a BSc in Electrical and Computer Engineering from the National Technical University of Athens, Greece.

His research interests include data summarization, approximate answering, OLAP, streaming data, data mining, view maintenance and caching. He has worked for Microsoft Research and IBM Almaden Research Center, and holds two US patents.

Speaker: Laura Yu Chen, Ph.D. Student Computer Science Department, UCLA
Title:

Time: Friday(June 11th) 12:30-2:00pm
Room: BH4549
Abstract
We are working jointly with the UCLA Urology clinic to analyze the past
seven years of urology surgery data. The data are presented in tabular
form, which consists of 52 attributes and is divided into three
categories: patient demographies, pre-operation data (e.g. urodynamics)
and post-operation condition (e.g complication, urine continence results,
etc.). The physicians are interested in learning under what conditions,
certain surgery operations will be successful with high probability, and
under what type of conditions augmented surgery operations are effective.
Our data mining techniques are able to select a "minimum" set of key
attributes that yield rules with confidence and support. Based on this set
of attributes, statistical classification techniques such as CART can be
used to classify the training set and determine the optimal number of
cells and cell sizes for each attribute that minimizes the classification
error. After adjusting cell sizes on key attributes, association rules
with higher confidence can be generated by data mining algorithms. Such
high-quality rules can provide guidelines for the physicians to decide
whether to perform the surgery.

Speaker: Uichin (Eugene) Lee, Ph.D. Student Computer Science Department, UCLA
Title: 
Understanding user goals in web search, Daniel E. Rose and Danny
Levinson(Yahoo! Inc.), WWW04
http://portal.acm.org/citation.cfm?id=988675&jmp=cit&coll=portal&dl=ACM

Time: Friday(June 4th) 12:30-2:00pm
Room: BH4549
Abstract
Previous work on understanding user web search behavior has focused on how
people search and what they are searching for, but not why they are
searching. In this presentation we'll see the change of user goal in
between the traditional IR (Information Retrieval) system and the web
search engine. Classic IR is inherently predicated on users searching for
information, so called "information need." But the need behind a web
search is often not informational; it might also be navigational (give me
the URL of the site I want to reach) or resource seeking (obtain a
resource, not information, available on web pages). In this paper, the
authors describe a framework for understanding the underlying goals of
user searches and experience in using the framework to manually classify
queries from a web search engine.  Results suggest that so-called
"navigational" searches are less prevalent than generally believed, while
previously unexplored "resource seeking" goal may account for a large
fraction of web searches.

Speaker: Yi Xia, Ph.D. Candidate Computer Science Department, UCLA
Title: Mining Association Rules with Non-uniform Privacy Concerns
(Accepted at the 9th ACM SIGMOD Workshop on Research Issues in Data Mining
and Knowledge Discovery (DMKD-2004), the technical report is available at:
ftp://ftp.cs.ucla.edu/tech-report/2004-reports/040015.pdf)

Time: Friday(May. 28) 12:30-2:00pm
Room: BH4549
Abstract
Privacy concerns have become an important issue in data mining. A popular
way to preserve privacy is to randomize the dataset to be mined in a
systematic way and mine the randomized dataset instead. On the other hand,
people usually have different privacy concerns for different attributes in
data. E.g., in survey data, the sensitivity of questions varies.
Appropriate use of this information can lead to more accurate data mining
results. However, this information has not been fully utilized by many
privacy preserving association rule mining algorithms.

In this paper, we generalize the privacy preserving association rule
mining problem by allowing different attributes to have different levels
of privacy, that is, using different randomization factors for values of
different attributes in the randomization process. We also propose an
efficient algorithm RE (Recursive Estimation) to estimate the support of
itemsets under this framework. Both theoretical and empirical results show
that the use of non-uniform randomization factors improves the accuracy of
the support estimates, compared to the use of one conservative
randomization factor.

Speaker: Yan-Nei Law, Ph.D. Student Computer Science Department, UCLA
Title: Dynamic Weighted Majority: A New Ensemble Method for Tracking Concept Drift (ICDM2003) Jeremy Z. Kolter and Marcus A. Maloof Georgetown
University, Washington, DC

Time: Friday(May. 14) 12:30-2:00pm
Room: BH4549
Abstract
Algorithms for tracking concept drift are important for many
applications. We present a general method based on the Weighted
Majority algorithm for using any on-line learner for concept
drift. Dynamic Weighted Majority (DWM) maintains an ensemble of
base learners, predicts using a weighted-majority vote of these
"experts", and dynamically creates and deletes experts in response
to changes in performance. We empirically evaluated two experimental
systems based on the method using incremental naive Bayes and
Incremental Tree Inducer (ITI) as experts. For the sake of comparison,
we also included Blum's implementation of Weighted Majority. On the
STAGGER Concepts and on the SEA Concepts, results suggest that the
ensemble method learns drifting concepts almost as well as the base
algorithms learn each concept individually. Indeed, we report the best
overall results for these problems to date.
Speaker: Yirong Yang, Ph.D. Candidate Computer Science Department, UCLA
Title: HybridTreeMiner: An Efficient Algorithm for Mining Frequent Rooted Trees
and Free Trees Using Canonical Forms & CMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees
Time: Friday(May. 07) 12:30-2:00pm
Room: BH4549
Abstract
HybridTreeMiner: An Efficient Algorithm for Mining Frequent Rooted Trees
and Free Trees Using Canonical Forms
(Yun Chi, Yirong Yang and Richard R. Muntz, SSDBM'04)

Abstract

Tree structures are used extensively in domains such as computational
biology, pattern recognition, XML databases, computer networks, and so on.
In this paper, we present HybridTreeMiner, a computationally ef- ficient
algorithm that discovers all frequently occurring subtrees in a database
of rooted unordered trees. The algorithm mines frequent subtrees by
traversing an enumeration tree that systematically enumerates all
subtrees. The enumeration tree is defined based on a novel canonical form
for rooted unordered trees¨Cthe breadth-first canonical form (BFCF). By
extending the definitions of our canonical form and enumeration tree to
free trees, our algorithm can efficiently handle databases of free trees
as well. We study the performance of our algorithms through extensive
experiments based on both synthetic data and datasets from real
applications. The experiments show that our algorithm is competitive in
comparison to known rooted tree mining algorithms and is faster by one to
two orders of magnitudes compared to a known algorithm for mining frequent
free trees. keywords: frequent subtree, canonical form, tree isomorphism,
enumeration tree, rooted unordered tree, free tree.

--------------------------------------------------------------------
CMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees
(Yun Chi, Yirong Yang, Yi Xia and Richard R. Muntz, PAKDD'04)

Abstract. Tree structures are used extensively in domains such as com-
putational biology, pattern recognition, XML databases, computer net-
works, and so on. One important problem in mining databases of trees is
to ¡¥nd frequently occurring subtrees. However, because of the combina-
torial explosion, the number of frequent subtrees usually grows exponen-
tially with the size of the subtrees. In this paper, we present CMTreeM-
iner, a computationally e¡Àcient algorithm that discovers all closed and
maximal frequent subtrees in a database of rooted unordered trees. The
algorithm mines both closed and maximal frequent subtrees by traversing
an enumeration tree that systematically enumerates all subtrees, while
using an enumeration DAG to prune the branches of the enumeration tree
that do not correspond to closed or maximal frequent subtrees. The enu-
meration tree and the enumeration DAG are de¡¥ned based on a canonical
form for rooted unordered trees{the depth-¡¥rst canonical form (DFCF).
We compare the performance of our algorithm with that of PathJoin, a
recently published algorithm that mines maximal frequent subtrees.
keywords: frequent subtree, closed subtree, maximal subtree, enumera-
tion tree, rooted unordered tree.
Speaker: Andrea Fang Chu, Ph.D. Candidate Computer Science Department, UCLA
Title: Fast and Light Boosting for Adaptive Mining of Data Streams
Time: Friday(Apr. 23) 12:30-2:00pm
Room: BH4549
Abstract
Supporting continuous mining queries on data streams requires algorithms
that (i) are fast, (ii) make light demands on memory resources, and (iii)
easily adapt to concept drift. We propose a novel boosting ensemble method
that achieves these objectives. The technique is based on a dynamic
sample-weight assignment scheme that achieves the accuracy of traditional
boosting without requiring multiple passes through the data. The technique
assures faster learning and competitive accuracy using simpler base
models. The scheme is then extended to handle concept drift via change
detection. The change detection approach aims at significant data changes
that could cause serious deterioration of the ensemble performance, and
replaces the obsolete ensemble with one built from scratch. Experimental
results confirm the advantages of our adaptive boosting scheme over
previous approaches.
Speaker: Zhenyu (Victor) Liu, Ph.D. Candidate Computer Science Department, UCLA
Title: Approximate Aggregation Techniques for Sensor Database
Time: Friday(Apr. 16) 12:30-2:00pm
Room: BH4549
Abstract
To compute aggregates such as SUM in a sensor network, we typically want
to achieve two goals: energy-efficiency and fault tolerance (or resilience
to communication failure).  A straight-forward approach may achieve both
goals at the same time, but introducing great errors.  To be
energy-efficient, we typically use in-network aggregation (similar to
"early-return" in User-Defined-Aggregation), which requires a parent
sensor aggregate the values from its children branches on the fly, instead
of forwarding the long lists of values all the way to the root sensor
before they are aggregated.  To be fault-tolerant, we let a child sensor
send its result to multiple parents, so that with a much higher chance
this "partial" result will finally reach the root sensor.  However, when
we compute aggregates such as SUM or AVG, combining the above two
techniques (i.e. in-network aggregation and duplication) may introduce
errors.  This happens because SUM and AVG are "duplication-sensitive."
The paper that I will present utilizes theoretical results from the sketch
theory to solve this problem.  The proposed method leads to approximate
answers, and the errors are evaluated experimentally.

* Related Reference:
P. Flajolet and G. Nigel Martin, Probabilistic Counting Algorithms for
Data Base Applications, J. of Comp. and Sys. Sci., Vol. 31, 182-209, 1985

Speaker: Yannis Papakonstantinou, Associate Prof, Computer Science & Eng, UCSD
Title: A Logical Framework for XML-Based Mediation
Time: Friday(Apr. 02) 12:00-2:00pm, There will be PIZZA, served at 12:00pm
Room: BH4760
Abstract
Business and scientific applications often
require access, filtering, integration and
transformation of the data of multiple distributed
and heterogeneous information sources.
Mediator systems address the need by providing
query-able integrated views of the source data.
We focus on mediators providing virtual XML views.
When such a mediator receives an XQuery
on the XML view, it decomposes it into queries
and requests that are sent to the underlying
sources and are compatible with their abilities.
The mediator presents a virtual query result to
the client. Consequently, client navigations in the virtual
query result are reduced into queries and navigations
in the sources.
We present three mediator architecture, built around
the LAV, GAV and combined LAV/GAV paradigms.
Logical optimizations and query answering are based on
the NEsted Xml Tree patterns (NEXT) framework,
which we present along with its applications to query
answering and optimization.

Speaker: Victor Liu
Title A Probabilistic Approach to Metasearching with Adaptive Probing 
Time: 12:30 - 2:00pm
Room: BH4549
Abstract
An ever-increasing amount of valuable information is stored in Web
databases, ``hidden'' behind search interfaces.  To save the
user's effort in manually exploring each database, metasearchers
automatically select the most relevant databases to a user's
query. In this talk, we focus on one of the technical challenges in
metasearching, namely database selection. Past research uses a
pre-collected summary of each database to estimate its
``relevancy'' to the query, and in many cases make incorrect
database selection. In this paper, we propose two techniques:
probabilistic relevancy modelling and adaptive probing. First, we
model the relevancy of each database to a given query as a
probabilistic distribution, derived by sampling that database.
Using the probabilistic model, the user can explicitly specify a
desired level of certainty for database selection.  The adaptive
probing technique decides which and how many databases to contact
in order to satisfy the user's requirement. Our experiments on
real Hidden-Web databases indicate that our approach significantly
improves the accuracy of database selection at the cost of a small
number of database probing.
Speaker:  Alexandros Ntoulas
Title: What's New on the Web? The Evolution of the Web from a Search Engine Perspective
Time: 12:30 - 2:00pm
Room: BH4549
Abstract
We seek to gain improved insight into how Web search engines should cope
with the evolving Web, in an attempt to provide users with the most
up-to-date results possible. For this purpose we collected weekly
snapshots of some 150 Web sites over the course of one year, and measured
the evolution of content and link structure. Our measurements focus on
aspects of potential interest to search engine designers: the evolution of
link structure over time, the rate of creation of new pages and new
distinct content on the Web, and the rate of change of the content of
existing pages under search-centric measures of degree of change. Our
findings indicate a rapid turnover rate of Web pages, i.e., high rates of
birth and death, coupled with an even higher rate of turnover in the
hyperlinks that connect them. For pages that persist over time we found
that, perhaps surprisingly, the degree of content shift as measured using
TF.IDF cosine distance does not appear to be consistently correlated with
the frequency of content updating. Despite this apparent noncorrelation,
the rate of content shift of a given page is likely to remain consistent
over time. That is, pages that change a great deal in one week will likely
change by a similarly large degree in the following week. Conversely,
pages that experience little change will continue to experience little
change. We conclude the paper with a discussion of the potential
implications of our results for the design of effective Web search
engines.

This is a talk for a paper that was accepted in the WWW2004 conference.
The camera-ready version of the paper can be found here:
http://oak.cs.ucla.edu/~ntoulas/pubs/ntoulas_new.pdf

Speaker:  Cheng-Jia Lai
Title: Scalable Overlay Networks in Peer-to-Peer Systems
Time: 12:30 - 2:00pm
Room: BH4549
Abstract
In this talk, I will discuss about peer-to-peer (P2P)
overlay networks on large scales. I will review the
basics of Distributed Hash Tables (DHT), and introduce
a new proposal, Generic Identifier Network (GIN), which
has an improved capacility of merging DHTs.

A P2P overlay network is an interconnection of computers
in the application layer and provides a search function
for the objects that are registered on them, based on an
underlying network that provides the connectivity of all
those computers. A fundamental function of the overlay is
to resolve object locations based on application-layer
object identifiers (OID). OIDs are typically not related
to the network localities where the objects are stored.
Thus, the most difficult design issue may perhaps be
reducing the average routing table size per node while
retaining high routing performance. In practice, a P2P
system evolves with dynamic membership of its nodes and
objects, in a complex way that may involve mergers of
existing P2P overlays, to extend the coverage and user
population. However, the merger has not been addressed
by previously proposed approaches. In this talk, I will
introduce GIN, a new proposal that addresses all those
issues and is capable of merging P2P overlays in a very
efficient and scalable way.

Speaker:  Richard Luo
Title: Efficient Support for Window Aggregates on Data Streams
Time: 12:30 - 2:00pm
Room: BH4549
Abstract
Window construct plays a critical role in advanced query languages for
data streams and in the OLAP functions recently introduced in SQL:1999 for
business intelligence applications. Their efficient implementation poses
challenges, particularly for data streams where sort-based implementation
are no longer usable because of their blocking behavior. We provide
a general solution to this problem based on delta-maintenance techniques
for aggregates.

Our study of alternative  implementation strategies, focuses on
the operators to be used for delta-maintenance and the techniques used for
memory buffer management. We propose specialized buffering techniques
for logical windows and physical ones, and, via several experiments, show
that the proposed solutions are efficient
in  terms of execution speed and memory consumption.

Many O-R DBMSs now support user-defined aggregates, and
window aggregates in  SQL:1999 OLAP functions, but they do
not support user-defined aggregates with windows. The delta-maintenance
primitives proposed in this paper provide a simple solution to remove this
limitation. Indeed their combined power can be beneficial in many applications,
for both data warehouses and data streams.  For instance, we show
how an adaptive classifiers can be expressed using windows on
user-defined-aggregates.
Speaker:  Henning Muller
Title: medGIFT - retrieving medical images based on their content
Time: 12:30 - 2:00pm
Room: BH4549
Abstract
The rising amount of medical images produced in hospitals and at
practitioners creates the need to develop new tools that allow more than
simple access by textual patient identification. Content-based visual
image retrieval is one method that allows browsing of large image
repositories by their visual content.

The medGIFT project is based on the Open Source image retrieval engine GNU
Image Finding Tool (GIFT, http://www.gnu.org/software/gift/) that makes
this research area also accessible for institutions that do not have the
money to fund there own large projects. Small modifications adapt the tool
for retrieval in the medical domain.

A connection with the case database system casimage
(http://www.casimage.com/) allows to get access to the textual data of the
stored cases in connection with the found images.

Currently a project is underway to evaluate the performance that
content-based medical image retrieval systems can reach. Several project
are planned to specialize the projects for certain sorts of images such as
high resolution CTs of the lung or dermatologic images. This will create a
need for higher-level visual features that consequently will need to be
created.

Speaker's CV
Henning Müller studied Medical Informatics with a specialization in image
and signal processing at the University of Heidelberg, Germany from 1992
to 1997.
From 1989-1998 he worked on several research and development projects in
the medical field at the GSbG (http://www.gsbg.de/, Gesellschaft für
Systemberatung im Gesundheitswesen).
His diploma thesis (1996-1997) was on the development of DICOM interfaces
for a teleradiology system at the German Cancer Research Center
(http://mbi.dkfz-heidelberg.de/), Heidelberg, Germany.
From 1997-1998 he worked at the DaimlerChrysler Research and Technology
North America (RTNA, http://www.rtna.daimlerchrysler.com/) in Portland,
Oregon, USA with a scholarship of the German government from the Carl
Duisberg Society.
From 1998 to 2002 he performed research on his PhD. in performance
evaluation and relevance feedback in content-based visual information
retrieval at the Computer Vision and Multimedia Lab at the University of
Geneva.
Within this research project he spend two months in 2001 at Monash
University, Melbourne, Australia to work on long-term user behavior from
usage log files.
Since 2002, he is working on the indexation of medical images at the
University Hospitals of Geneva. The main research interests are in the
fields of image analysis, medical image retrieval, user interaction and
performance evaluation of retrieval systems.

Speaker:  Yijian Bai
Title: Preference queries, skyline queries, and their optimization
Time: 12:30 - 2:00pm
Room: BH4549
Abstract
The skyline of a set of d-dimensional points contains the points that are
not dominated by any other point on all dimensions. Skyline computation
has recently received considerable attention in the database community,
especially for progressive (or online) algorithms that can quickly return
the first skyline points without having to read the entire data file.
Currently, the most efficient algorithm is NN (nearest
neighbors), which applies the divide-and-conquer framework on
datasets indexed by R-trees. Although NN has some desirable features (such
as high speed for returning the initial skyline points, applicability to
arbitrary data distributions and dimensions), it also presents several
inherent disadvantages (need for duplicate elimination if d>2, multiple
accesses of the same node, large space overhead). In this paper we develop
BBS (branch-and-bound skyline), a progressive
algorithm also based on nearest neighbor search, which is IO optimal,
i.e., it performs a single access only to those R-tree nodes that may
contain skyline points. Furthermore, it does not retrieve duplicates and
its space overhead is significantly smaller than that of NN. Finally, BBS
is simple to implement and can be efficiently applied to a variety of
alternative skyline queries. An analytical and experimental comparison
shows that BBS outperforms NN (usually by orders of magnitude) under all
problem instances.
Speaker:  Raymond K. Pon
Title: CQL, a Continuous Query Language, is supported by the STREAM prototype Data Stream
Time: 12:30 - 2:00pm
Room: BH4549
Abstract
CQL, a Continuous Query Language, is supported by the STREAM prototype
Data Stream
Management System at Stanford. CQL is an expressive SQL-based
declarative language for
registering continuous queries against streams and updatable relations.
We begin by presenting
an abstract semantics that relies only on "black box" mappings among
streams and relations.
>From these mappings we define a precise and general interpretation for
continuous queries.
CQL is an instantiation of our abstract semantics using SQL to map from
relations to relations,
window specifications derived from SQL-99 to map from streams to
relations, and three new
operators to map from relations to streams. Most of the CQL language is
operational in the
STREAM system. We present the structure of CQL's query execution plans
as well as details
of the most important components: operators, inter-operator queues,
synopses, and sharing of
components among multiple operators and queries. Examples throughout the
paper are drawn
from the Linear Road benchmark recently proposed for Data Stream
Management Systems. We
also curate a public repository of data stream applications that
includes a wide variety of queries
expressed in CQL.
Speaker: Yirong Yang 
Title: Learning Naive Bayesian Classifier from Noisy Data
Time: 12:00 - 1:00pm
Room: BH4549
Abstract
Classification is one of the major tasks in knowledge discovery and data
mining. Naive Bayes classifier, in spite of its simplicity, has proven
surprisingly effective in many practical applications. In real datasets,
noise is inevitable, because of the imprecision of measurement or privacy
preserving mechanisms. In this paper, we develop a new approach,
LinEar-Equation-based noise-aWare bAYes classifier (LEEWAY ), for learning
the underlying naive Bayes classifier from noisy observations. Using
linear equation systems and optimization methods, LEEWAY reconstructs the
underlying probability distributions of the noise-free dataset based on
the given noisy observations. By incorporating the noise model into the
learning process, we improve the classification accuracy. Several
experiments are presented to evaluate the performance of LEEWAY. The
experimental results show that LEEWAY is an effective technique to handle
noisy data and it provides higher classification accuracy than the
traditional approach.
Speaker: Prof. Junghoo Cho
Title: Search Engines Considered Harmful: In Search of Unbiased Web Ranking
Time: 12:30 - 2:00pm
Room: BH4549

Abstract:
In this talk, we discuss the widespread use of Web search engines and
its potential impact on the ecology of the Web.  Recent studies show
that a significant portion of Web accesses are referred by search
engines. Furthermore, the Web-search market is increasingly
dominated by a few number of key players.  What are the implications
of this heavy reliance of Web users on search engines in their pursuit
of information?  For example, given that search engines return
currently ``popular'' pages at the top of search results, are we
somehow penalizing newly-created pages that are not very well known
yet?  Are popular pages getting even more popular while new pages are
being completely ignored?  We first show that this unfortunate trend
indeed exists on the Web through an experimental study based on real
Web data.  We then analytically estimate how much longer it takes for
a new page to attract a large number of Web users when search engines
return only popular pages at the top of search results.  Finally, we
develop new ways of ranking Web pages that can significantly reduce
the potential bias of existing ranking metrics.  We show that our
proposed ranking metric can alleviate the "rich-get-richer"
phenomenon and help new and high-quality pages get the attention that
they deserve.
Speaker: Fusheng Wang
Title: ArchIS: An Efficient Transaction-Time Temporal Database System 
       Built on Relational Databases and XML
Host:  Yun Chi
Time: 12:30 - 2:00pm
Room: BH4549

Abstract:
Better support for temporal applications by database systems represents an
important technical objective that is difficult to achieve; indeed it
requires devising integrated solutions for technical problems that are
challenging on their own, including (i) expressive temporal
representations and data models, (ii) powerful languages for temporal
queries and snapshot queries, (iii) indexing, clustering and query
optimization techniques for managing temporal information efficiently, and
(iv) architectures that bring together different pieces of enabling
technology into a robust system that is appealing to users.
In this paper, we describe the design and implementation of a transaction-time
DBMS called  ArchIS that achieves these objectives by supporting a
temporally grouped data model on top of relational RDBMS. ArchIS'
architecture uses  (a) XML to support temporally grouped (virtual)
representations of the database history, (b) XQuery to express powerful
temporal queries on such views, (c) temporal clustering and indexing
techniques for managing the actual historical data in a RDBMS, and (d)
SQL/XML for executing the queries on the XML views as equivalent queries
on the relational DB. The performance studies presented in the paper show
that ArchIS is quite effective at storing, and retrieving under complex
query conditions, the transaction-time history of relational databases. By
supporting database compression as an option, ArchIS also achieves
excellent storage efficiency for archived histories. This approach
achieves full-functionality transaction-time databases without requiring
temporal extensions in XML or database standards.