5/17: Dongwon Lee's Ph.D. Defense.
Name: Dongwon Lee
Date: Friday, May 17, 2002
Time: 2:00 - 4:00 pm
Place: 4760 Boelter Hall
Advisor: Prof. W. Chu
Title: Query Relaxation for XML Model
Abstract
XML (eXtensible Markup Language) is the new universal format for
structured documents and data on the World Wide Web. As the Web
becomes a major means of disseminating and sharing information and as
the amount of XML data increases substantially, there are increased
needs to manage and query such XML data in a novel yet efficient way.
In this talk, I will particularly focus on one query processing
technique called "Query Relaxation" in the context of the XML model.
Unlike relational databases where the schema is relatively small and
fixed, the XML model allows varied/missing structures and values,
which make it difficult for users to ask questions precisely and
completely. To address such problems, query relaxation technique
enables systems to automatically weaken, when not satisfactory, the
given user query to a less restricted form to permit "approximate"
answers as well.
To support query relaxation for XML, I first present a formal
framework where users can express the precise semantics and behaviors
of query relaxation. This framework can also be used as the basis for
designing and implementing the eventual relaxation-enabled query
language. Secondly, I describe an array of issues that are related to
support query relaxation using native XML engines. Especially, I
describe the notion of similarity between XML data trees using tree
edit distance and the issue of selectivity estimation of a set of
relaxed XML queries. Lastly, I present issues involved in converting
data between XML and relational models. This is a necessary step to
support query relaxation for XML model by way of using the mature
relational database systems.
2/21: Giovanni Giuffrida's Ph.D. Defense.
Name: Giovanni Giuffrida
Date: Thursday, February 21, 2002
Time: 4:00 - 6:00 pm
Place: 4760 Boelter Hall
Advisor: Prof. W. Chu
Title: Data Mining of Large Relational Databases
Abstract
Knowledge Discovery from Databases and Data Mining (KDD/DM) is a young
multidisciplinary area that combines experiences from, besides others,
statistics, machine learning (ML), databases and, data visualization.
KDD/DM grew at breakneck pace in recent years driven by the needs of
an industry which, over the past decades, accumulated tremendous
amounts of data and, now, lacks the capability of effectively (and
efficiently) gathering relevant information from such a vast amount of
data. The relational model has largely shown its strength in
structuring and retrieving data when the type of information we are
looking for is well known. So, while a question like "How much did my
customers spend on product X in region Y?" is a straightforward task
for a relational database system, the same does not hold true for a
question like: "What are the reasons for the strong sales of product
X in region Y?" The industry has largely recognized the value of a
system able to "answer" the second type of question; the new wave of
KDD/DM applications addresses this issue.
KDD/DM is mostly rooted in the machine learning discipline and,
consequently, inherited many legacies that not necessarily fit in the
domain of large databases. This is mostly due to the memory-bound
nature of machine learning algorithms that was the de-facto choice
given the reduced size of the used databases. Also, KDD/DM grew in a
sort of uncoordinated way fueled by fast growing commercial interests
and good successes in the research community. Even though nowadays
many tools and algorithms can be found, in both commercial and
research environments, no real standards have been yet proposed. For
instance, there is no standard way of structuring the database or,
similarly, there is no standard language for data mining. We believe
that the integration of data mining and databases has a lot to offer
in tackling some of these issues.
We tried to address these issues by setting the following three
objectives for this dissertation:
- Prove that efficient and effective data mining can be achieved
on top of standard DBMS. We do so by presenting an algorithm
tightly integrated with a commercial DBMS. We apply this algorithm
to a commercial dataset and compare it with other algorithms.
- Introduce a couple of general heuristics that help to reduce the
search space when mining large datasets. We do this by presenting an
algorithm to discover classification rules that implements such
heuristics and comparing it with other mining tools over a large
commercial dataset.
- Promote integration of data mining and statistics. We do this by
presenting two applications on real marketing problems on real data.
In one case we prove that by combining data mining and statistics we
achieve a result that is superior to the ones achieved when data
mining and statistics are applied in isolation. In the other case we
prove that, other than groundwork costs, data mining models and
statistical models are comparable in terms of predictive
performances for the application at hand.
Last modified: Tue May 7 09:50:25 PDT 2002