4/12: Giovanni Giuffrida
Name: Giovanni Giuffrida
Time: 4/12 Firday 12-2pm
Room: 4760 Commons
Title: Experiences in mining commercial databases
Abstract
Knowledge Discovery from Databases and Data Mining (KDD/DM) is a young
multidisciplinary area that combines experiences from, besides others,
statistics, machine learning (ML), databases and, data visualization.
KDD/DM grew at breakneck pace in recent years driven by the needs of
an industry which, over the past decades, accumulated tremendous
amounts of data and, now, lacks the capability of effectively (and
efficiently) gathering relevant information from such a vast amount of
data. The relational model has largely shown its strength in
structuring and retrieving data when the type of information we are
looking for is well known. So, while a question like "How much did my
customers spend on product X in region Y?" is a straightforward task
for a relational database system, the same does not hold true for a
question like: "What are the reasons for the strong sales of product
X in region Y?" The industry has largely recognized the value of a
system able to "answer" the second type of question; the new wave of
KDD/DM applications addresses this issue.
KDD/DM is mostly rooted in the machine learning discipline and,
consequently, inherited many legacies that not necessarily fit in the
domain of large databases. This is mostly due to the memory-bound
nature of machine learning algorithms that was the de-facto choice
given the reduced size of the used databases. Also, KDD/DM grew in a
sort of uncoordinated way fueled by fast growing commercial interests
and good successes in the research community. Even though nowadays
many tools and algorithms can be found, in both commercial and
research environments, no real standards have been yet proposed. For
instance, there is no standard way of structuring the database or,
similarly, there is no standard language for data mining. We believe
that the integration of data mining and databases has a lot to offer
in tackling some of these issues.
In this talk we address the following points:
- Prove that efficient and effective data mining can be achieved
on top of standard DBMS. We do so by presenting an algorithm
tightly integrated with a commercial DBMS. We apply this algorithm
to a commercial dataset and compare it with other algorithms.
- Introduce a couple of general heuristics that help to reduce the
search space when mining large datasets. We do this by presenting an
algorithm to discover classification rules that implements such
heuristics and comparing it with other mining tools over a large
commercial dataset.
- Promote integration of data mining and statistics. We do this by
presenting two applications on real marketing problems on real data.
In one case we prove that by combining data mining and statistics we
achieve a result that is superior to the ones achieved when data
mining and statistics are applied in isolation. In the other case we
prove that, other than groundwork costs, data mining models and
statistical models are comparable in terms of predictive
performances for the application at hand.
3/8: Shu-Yao Chien
Name: Shu-Yao Chien
Time: 3/8 Firday 12-2pm
Room: 4732C
Title: Efficient Complex Query Support For Multiversion XML Documents
Abstract
Managing multiple versions of XML documents represents a critical
requirement for many applications. Also, there has been much recent
interest in supporting complex queries on XML data (e.g., regular path
expressions, structural projections, DIFF queries). In this talk, I
will examine the problem of supporting efficiently complex queries on
multiversioned XML documents. Our approach relies on a scheme based
on durable node numbers (DNNs) that preserve the order among the XML
tree nodes and are invariant with respect to updates. Using the
document's DNNs various complex queries are reduced to combinations of
partial version retrieval queries. We examine three indexing
schemes to efficiently evaluate partial version retrieval queries in
this environment. A thorough performance analysis is then presented to
reveal the advantages of each scheme.
11/30: Chen Li
Name: Chen Li
Time: 11/30 Firday 12-2pm
Room: 4760 Commons
Title: Query Processing and Optimization in Data Integration
Abstract:
The rapid growth of the Internet gives us access to an unprecedented
number of autonomous and heterogeneous information sources. The goal
of data integration is to support seamless access to these data
sources. Many data-integration systems use a mediation architecture,
in which a wrapper is built on the top of each source. A mediator
accepts a user query, and answers the query by accessing relevant
sources through wrappers. In this talk I will first give an overview
of data integration, and discuss two approaches taken by several
systems: the query-centric approach (e.g., TSIMMIS) and the
source-centric approach (e.g., Information Manifold). I will show
research problems in both approaches. In particular, I will focus on
how to do query processing and optimization in the first approach,
especially when sources have diverse and limited capabilities to
answer queries. I will also talk about how to answer a query
efficiently using source data represented as views in the second
approach.
11/16: Cyrus Shahabi
Name: Cyrus Shahabi
Time: 11/16 Firday 12-2pm
Room: 4760 Commons
Title: Mining Multidimensional Databases
Abstract:
I will start by reviewing some examples of multidimensional data sets
and their corresponding applications. Subsequently, I will focus on
one of our research projects and provide some in-depth explanations.
I will present our technique for evaluating range aggregate queries
(e.g., summation, count and variance) on multidimensional data sets,
termed Progressive Polynomial Analytical Processing (PROPOLYNE).
Many data mining and analysis applications rely on evaluation of range
aggregate functions. While aggregations such as COUNT and SUM are
important, improving support for range multivariate statistics can
significantly extend the reach of existing technology and open the
door to new methods. We show that these and many other queries can be
derived from polynomial range-sum queries, and propose a novel
pre-aggregation method called PROPOLYNE to support progressive
evaluation of arbitrary polynomial range-sums. PROPOLYNE is
wavelet-based, but uses query approximation rather than data
compression. PROPOLYNE is an excellent exact algorithm because
well-chosen wavelets approximate range aggregate queries extremely
well. At its foundation, PROPOLYNE is a data independent progressive
query evaluation algorithm. At each step it makes the best possible
wavelet approximation of the submitted query, and retrieves only the
data needed to evaluate this approximation. Our experiments
demonstrate that the approximate results produced by PROPOLYNE for
several empirical datasets are very accurate after a small number of
I/O's and the error becomes negligible long before the exact
computation is complete. We also show that the accuracy of results
produced by typical wavelet-based data compression techniques vary
wildly with the dataset, and at its best is always significantly worse
than that of PROPOLYNE.
Bio:
Dr. Cyrus Shahabi is currently an Assistant Professor and the Director
of the Information Laboratory (http://infolab.usc.edu/) at the
Computer Science Department and the Integrated Media Systems Center
(IMSC) at the University of Southern California. He has more than
fifty articles, book chapters, and conference papers in the areas of
databases and multimedia. Dr. Shahabi's current research interests
include Multidimensional Databases, Multimedia Servers, and Data
Mining. He was the program committee chair of ACM WIDM'99 workshop and
is currently serving as a program committee member for ACM Multimedia
2001 and VLDB E-commerce workshop 2001.
11/9: Reza Sadri
Name: Reza Sadri
Time: 11/9 Firday 12-2pm
Room: 4760 Commons
Title: Sequence Queries in Database Systems
Abstract:
The need to search for complex and recurring patterns in database
sequences is shared by many applications. In this talk, we discuss
how to express and support efficiently sophisticated sequential
pattern queries in database systems. Thus, we first introduce simple
extension of SQL, to express these patterns, and then we study how to
optimize search for sequential patterns. We take the optimal text
search algorithm of Knuth, Morris and Pratt, and generalize it to
handle complex queries on sequences. Our algorithm exploits the
inter-dependencies between the elements of a sequential pattern to
minimize repeated passes over the same data. Experimental results on
typical sequence queries, such as double bottom queries, confirm that
substantial speedups are achieved by this new optimization technique.
10/12: Andrea Chu
Name: Andrea Chu
Time: 10/12 Firday 12-2pm
Room: 4760 Commons
Title: Pruning Cost-Sensitive Ensemble
Abstract:
Cost-sensitive problems have been more and more attractive recently
due to its promising application in domains like fraud detection in
financial information systems, intrusion detection in network based
systems, etc. Here we present a recent work in cost-sensitive domain:
pruning cost-sensitive ensembles.
There has been intensitive research on how to learn classifier models
efficiently and accurately, and how to use them to make efficient
predictions. it is already a common practice to train multiple models
and then prune them for the purpose above. Although pruning is not a
new notion, we attempt to find new consideration to guide pruning in
cost-sensitive context.
In addition, we explored a new line of dynamical pruning, in which we
dynamically decide how many classifiers are need to make a confident
prediction. Given an upper bound of how many classifiers are
affordable, the dynamic method leads to an average number much
smaller, hence further speeds up the prediction.
Last modified: Tue May 7 09:43:41 PDT 2002