Speaker: Victor Liu
Host:  Richard Luo
Time:  12:30 - 2:00pm
Room:  BH4549

Today's distributed systems such as sensor networks has
tremendously extended our capability in sensing and interacting
with the outside world.  However, the wide application of such
systems is hindered by their potentially high operation cost.  For
example, saving the battery life remains to be one of the biggest
concern in sensor network research.  In this talk, we will focus on
how to spend the minimum amount of operation cost to solve one
common problem: searching for extremes among distributed data
sources with uncertainty.  In particular, because we cannot afford
constant communication with all data sources, our knowledge about
their latest values is incomplete.  The incomplete knowledge
results in uncertainty in our final answer of extreme sources (the
ones with highest or lowest values).  In our model, we reduce the
uncertainty by probing a few data sources and force them to
update.  Thus, the key question is how can we spend the minimum
amount of probing cost to gain sufficient knowledge and reduce the
uncertainty to a user-accepted level?  Analytically, we study how
the optimal probing can be achieved under various problem
settings. Experimentally, we study the behavior of the optimal
policies on a real sensor network dataset.
Speaker: Richard Luo
Time: 12:30pm - 2:00pm, Nov. 17th
Place: BH4549

Practice talk for Ph.D. qualifying exam
Title: Query Languages and Systems for Advanced Data-Stream Applications

There are emerging research interests in Data streams and continuous queries.
Current projects have fully recognized the significant changes in execution
models and architectures needed to support continuous queries on data streams,
but grossly underestimated the changes and extensions in database languages
required to support advanced data stream applications. We investigate these
requirements by studying typical queries used in online auction and network
monitoring applications. We show that current database languages are severely
limited in their ability to support (1) nonblocking queries, (2) complex
applications, and (3) even simple computations requiring efficient use of main
memory.

Therefore, the objective of this thesis is to design and develop database
language extensions to overcome these limitations, and to devise and demonstrate
execution strategies and system architectures that enable their very efficient
implementation.
Speaker: Yi Xia
Time: 12:30-2:00pm
Room: BH4549

Practice Talk for upcoming ICDM workshop.

Title: Mining Frequent Itemsets in Uncertain Datasets

Abstract:
Data in real world are usually noisy or uncertain. However, traditional data
mining algorithms ignore the uncertainty in data or take it into consideration
in a very limited way. In this paper, we define a relatively generic model
for uncertainty in data in which each data item comes with a ``tag'' that
defines the degree of confidence in that value. This is more realistic in many
cases where the data items are derived from other evidence or more basic data.
As an example problem, in this paper we study frequent itemset mining in such
uncertain data.

With uncertain data, finding frequent itemsets will not be perfect. There will
be false positives and false negatives. We consider several intuitive approaches
and propose a new scheme which significantly reduces the total number of false
positives and false negatives.

The paper could be downloaded from
http://www.cs.ucla.edu/~xiayi/doc/fdm.pdf

Speaker: Qinghua Zou
Host: Richard C. Luo
Time: 12:30pm - 2:00pm, Oct. 31th
Place: BH 4549

Practice talk for AMIA 03

IndexFinder: A Method of Extracting Key Concepts from Clinical Texts for
Indexing

Extracting key concepts from clinical texts for indexing is an important
task in implementing a medical digital library.  Several methods are
proposed for mapping free text into standard terms defined by the Unified
Medical Language System (UMLS). For example, natural language processing
techniques are used to map identified noun phrases into concepts. They
are, however, not appropriate for real time applications.  Therefore, in
this paper, we present a new algorithm for generating all valid UMLS
concepts by permuting the set of words in the input text and then
filtering out the irrelevant concepts via syntactic and semantic
filtering. We have implemented the algorithm as a web-based service that
provides a search interface for researchers and computer programs. Our
preliminary experiment shows that the algorithm is effective at
discovering relevant UMLS concepts while achieving a throughput of 43K
bytes of text per second. The tool can extract key concepts from clinical
texts for indexing.

Speaker: Richard C. Luo 
Time: 12:30pm - 2:00pm, Oct. 25th, 2003
Room: BH4549


Continuous queries on data streams represent a vibrant area of
current research.  The growing interest in this area is due to
the emergence of important applications that need to monitor continuous
streams of data, including network traffic, sensor networks, financial data,
online auctions, telecommunication records, web logs, and click-streams.

The need to support queries that span both data bases and data streams
has motivated researchers to seek DBMS extensions to support continuous
queries on data streams--in addition to the traditional user-issued
queries on stored data. This objective  brings major technical challenges,
inasmuch as traditional DBMS were designed for transient user-issued queries
on persistent data, rather than for persistent queries on transient data.

In this talk, I will present how data stream techniques are applied in real-time
netowrk traffic monitor and analyzer, particularly in Bluewave Networks' NAMS
product.  I will also demonstrate the product which is running on an ISP site.

About Bluewave Networks
----------------------------------------
Founded in 2001, Bluewave Networks has developed one of the most comprehensive
and advanced Real-time Network Analytics products in the industry, to help
Enterprises and Service Providers in proactive detection of network anomalies
that impacts business critical applications and employee productivity. The
company's flagship product, the NAMS¢ (Network Analytics and Managemen
System), provides comprehensive route and traffic analytics and automates many
of the labor-intensive tasks typically associated with maintaining todayâ^À^Ùs
complex networks. Bluewave Networkâ^À^Ùs NAMS product has been validated by Large
Enterprises and Top tier Service Providers.

www.bluewavenetworks.com

DBUCLA Seminar: XML Normalization
Speaker: Dr. Murali Mani
Time: 12:00pm - 2:00pm
Room: BH4549

Abstract:

We will talk about normalizing XML. The specific problem we study is the
following: Suppose a database designer comes up with an XML schema for his
application. He also specifies the functional dependencies and multivalued
dependencies that exist in the schema. The question which he has to now
answer is: "Is my XML schema good?"

We will help such database designers using traditional DB theory. An XML
schema is good if it does not have insert, delete and update anomalies. We
study how database designers can remove such anomalies which may be
present in their design.

The talk will cover briefly specification of functional and multivalued
dependencies. The major part of the talk will be on:
(a) What is the semantics for "unnesting" in XML? There are two semantics
which are discussed in xml standards community; what is the crux of these
2 semantics; do any of these semantics make sense?
(b) Once we know the existence of anomalies, how can we come up with a
design that does not have these anomalies.

To give an example of (a), the question we try to answer here is:
Suppose you have a DTD:

what is the meaning of the functional dependency author -> title

Primary Reference:
M. Arenas, L. Libkin, "A Normal Form for XML Documents", PODS 2002

Time:    Friday, 12:30pm-2pm, August 8
Place:   BH 4549

Speaker: Yi Xia
Title:   Privacy Preserving Association Rule Mining

Host:    Andrea Chu

Privacy issue is gaining more and more attention. In this talk,
the speaker will discuss two recent papers about association rule mining
that incorporate privacy concerns.
The two papers are:
 1. Maintaining Data Privacy in Association Rule Mining ( Shariq Rizvi,
    Jayant R. Haritsa, VLDB 2002)
    Download:
    http://www.cse.iitb.ac.in/~rizvi/files/vldb.ppt (slides)

http://citeseer.nj.nec.com/rd/68802721%2C542393%2C1%2C0.25%2CDownload/http://citeseer.nj.nec.com/cache/papers/cs/26589/http:zSzzSzwww.cs.ust.hkzSzvldb2002zSzVLDB2002-paperszSzS19P03.pdf/rizvi02maintaining.pdf

 2. Privacy Preserving Mining of Association Rules (Evfimievski, R.
    Srikant, R. Agrawal and J. Gehrke, KDD 2002)
    Download: http://www.almaden.ibm.com/cs/people/srikant/papers/kdd02.pdf

Time:  noon, August 1st
Place: BH 4549
(Bring your own lunch please)

Speaker: Andrea Chu
Title:   Mining Stream Data

In this one hour talk, the speaker will focus on two representative
mining techniques on data streams, one-pass incremental decision tree
construction, and an ensemble method that captures concept drifts
in data streams. Pros and cons of the two approaches will be discussed.

References:

1. Mining High-Speed Data Streams
   Pedro Domingoes et. al., SIGKDD 00
2. Mining Concept-Drifting Data Streams unsing Ensemble Classifiers.
   Haixun Wang et. al., SIGKDD 03.

Abstracts:

1. Mining High-Speed Data Streams
Many organizations today have more than very large data-
bases; they have databases that grow without limit at a
rate of several million records per day. Mining these con-
tinuous data streams brings unique opportunities, but also
new challenges. This paper describes and evaluates VFDT,
an anytime system that builds decision trees using constant
memory and constant time per example. VFDT can in-
corporate tens of thousands of examples per second using
ding bounds to guarantee that its output is asymptotically
nearly identical to that of a conventional learner. We study
VFDT's properties and demonstrate its utility through an
extensive set of experiments on synthetic data. We apply
VFDT to mining the continuous stream of Web access data
from the whole University of Washington main campus.


2. Mining Concept-Drifting Data Streams unsing Ensemble Classifiers

Recently, mining data streams with concept drifts for actionable
insights has become an important and challenging task for a wide
range of applications including credit card fraud protection, target
marketing, network intrusion detection, etc. Conventional knowledge
discovery tools are facing two challenges, the overwhelming
volume of the streaming data, and the concept drifts. In this paper,
we propose a general framework for mining concept-drifting data
streams using weighted ensemble classifiers. We train an ensemble
of classification models, such as C4.5, RIPPER, naive Bayesian,
etc., from sequential chunks of the data stream. The classifiers in
the ensemble are judiciously weighted based on their expected
classification accuracy on the test data under the time-evolving
environment. Thus, the ensemble approach improves both the efficiency in
learning the model and the accuracy in performing classification.
Our empirical study shows that the proposed methods have substantial
advantage over single-classifier approaches in prediction
accuracy, and the ensemble framework is effective for a variety of
classification models.
DBUCLA Seminar: XML: THE EXCHANGE TAIL AND THE MANAGEMENT DOG
Speaker: Fabian Pascal
Time:    11:30amam-1:30pm, July 18th, 2003
Room:    BH 4549

Pizza served.

Mr. Pascal has also offered to bring signed copies of his books
"PRACTICAL ISSUES IN DATABASE MANAGEMENT" and "UNDERSTANDING RELATIONAL
DATABASES". For people interested in aquiring a copy of this book, this
would be a very good chance to buy one with the author's signature.
Information about the speaker and of these book are available at
http://www.dbdebunk.com/.

Abstract:

The computer industry operates like the fashion industry: it is driven by
fads. With help from the trade media, it constantly hypes new fads,
obscuring their lack of soundness and imperfections. Every fad paves the
way for a subsequent fad, by failing to solve old problems (and creating
new ones!), a quite profitable system predicated on the accelerated
obsolescence which underlies all fads. Many of these fads are, however,
not even new, but previously failed and discarded fads by other names,
often worse than the original. Sound and correct principles, on the other
hand, which do provide the right solutions, are ignored and flouted.

The most recent evangelism revolves around XML: conferences and articles
in the trade press proliferate and vendors step over each other--even
Microsoft got religion--to announce all sorts of XML support, the
hallmarks of a fad.  Is it truly the holy grail of data management in the
Internet age, or just another fad? This seminar offers a critical
assessment of XML from the only perspective that counts, one that you do
not hear in the industry: that of data/database management.


1.      The Problem
2.      The Argument
3.      The XML Solution

4.      XML Data Exchange
"       The Claim
"       Physical Format
"       Performance
"       Self-Describing
"       Reality Check
5. XML Data Management
"       Models, Models...
"       Meaning?
"       Reinventing The Wheel
"       ... Nor Any Time To Think
"       Back Up Into Trees
6. The Framework Applied
"       Completeness
"       Generality
"       Formality
"       Simplicity
"       Data Independence
7. Conclusions
"       Nails To A Hammer
"       The Logical-Physical Confusion
"       DX Tail & DM Dog
"       Good Luck
"       Round In Circles
"       Too Much Rethinking

Fabian Pascal has a national and international reputation as an
independent technology analyst, consultant, author and lecturer
specializing in data management. He was affiliated with Codd & Date and
for 20 years held various analytical and management positions in the
private and public sectors, has taught and lectured at the business and
academic levels, and advised vendor and user organizations on data
management technology, strategy and implementation. Clients include IBM,
Census Bureau, CIA, Apple, Borland, Cognos, UCSF, IRS. He is founder,
editor and publisher of DATABASE DEBUNKINGS, a web site dedicated to
dispelling persistent fallacies, flaws, myths and misconceptions prevalent
in the IT industry (Chris Date is a senior contributor). Author of three
books, he has published extensively in most trade publications, including
DM Review, Database Programming and Design, DBMS, Byte, Infoworld and
Computerworld. He is author of the contrarian columns Against the Grain,
Setting Matters Straight, and for The Journal of Conceptual Modeling. His
third book, PRACTICAL ISSUES IN DATABASE MANAGEMENT serves as text for his
seminars.

Time: Friday, June 13, 12:00pm-2:00pm

Location: BH4750

Title: "Exploring and Integrating the Deep Web: Observations, Implications, and Evidence"

Speaker: Kevin Chen-Chuan Chang, Assistant Professor in the Department of
Computer Science, University of Illinois at Urbana-Champaign

Abstract:
Over the past few years, the Web has deepened dramatically- A
significant and increasing amount of information is hidden on the
"deep" Web, behind the query interfaces of searchable databases. There
are numerous such autonomous and heterogeneous sources, each with a
different schema and native query constraints. Our MetaQuerier
project, the context of this talk, proposes to build a "metaquery"
system, to help users in both finding and querying online databases
effectively and uniformly.

This talk will discuss our ongoing work in exploring and integrating
the deep Web, presenting a recent (December 2002) extensive survey of
this "new frontier."  First, I will report our observations from this
survey, on both the "macro" and "micro" aspects of the databases on
the Web. (How many databases are there? How are they covered by search
engines? How "complex" are they? ...) Second, based on our findings, I
will suggest several likely implications: We believe that both
holistic (context-driven) and divide-and-conquer (component-driven)
approaches are likely to enable "shallow" integration for the deep Web
at a large scale. Finally, as an evidence of such approaches, I will
briefly discuss our work on statistical schema matching, which aims at
solving the schema matching problem with a new holistic paradigm.
Project URL: http://eagle.cs.uiuc.edu/metaquerier/

Bio:

Kevin Chen-Chuan Chang is an Assistant Professor in the Department of
Computer Science, University of Illinois at Urbana-Champaign. He
received a PhD in Electrical Engineering in 2001 from Stanford
University. His research interests are in databases and Internet
information access, with emphasis on information integration and top-k
ranked query processing. He is the recipient of an NSF CAREER Award in
2002 and an NCSA Faculty Fellow in 2003.
URL: http://www-faculty.cs.uiuc.edu/~kcchang/

Time: Thursday, June 5, 12:00pm-2:00pm

Location: BH4549

Title: "Clio:  Schema Mapping and Data Exchange"

Speaker: Prof. Renee Miller, Univeristy of Toronto

Abstract:
We present a novel framework for creating mappings between any
combination of XML and relational data sources.  In our approach,
attribute correspondences (the result of "schema matching") are
translated into a set of mappings that capture the semantics of the
source and target schemas (including their hierarchical organization
as well as their nested referential constraints).  These mappings are
then translated into queries over the source schema(s) that produce
data satisfying the referential constraints and structure of the
target schema.  These queries preserve the semantic relationships of
the source.  The mapping algorithm is complete in that it produces all
mappings that are consistent with the semantics of the schemas.  We
have implemented the translation algorithm in Clio, a schema mapping tool,
 and present our experience using Clio on life science data.
 The mappings produced by Clio can be used both within data
integration where source data is queried through a virtual target view
and for data exchange, including the exchange of data in P2P data
sharing applications.  We discuss the often subtle difference between
the semantics of data integration and that of data exchange.
 This is joint work with Ron Fagin, Mauricio Hernandez, Phokion
Kolaitis, Lucian Popa, and Yannis Velegrakis.

Time: 12:00pm - 2:00pm, May 2 (pizza will be served at 12:00pm)
Location: BH4750

Title: Towards a Content and Load Adaptive Distributed Infrastructure for
Mobile/Pervasive Computing

Speaker: Paul Castro, IBM Research.
Abstract:
Pervasive computing applications, such as Internet-based control systems
and large-scale mobile asset management, requires the monitoring and
processing of data streams from highly distributed, mobile data sources.
Middleware to support data stream processing must scale to potentially
millions of streams. In many pervasive applications, streams represent
rapid updates to stored data, which precludes the use of traditional
scaling strategies such as mirroring or caching. In this talk, we present
work in developing a scalable, de-centralized architecture for processing
data streams based on Distributed Hash Table (DHT) overlay networks. DHTs
potentially provide efficient and robust techniques for wide-area data
storage and query processing, but skews in application workload can result
in bottlenecks and failures the limit the scalability of the approach.

The primary focus of this talk is on early work with the Content and Load
Aware Scalable Hashing (CLASH) protocol, which provides a flexible,
adaptive mechanism for scaling DHTs to skewed workloads by controlling per
node utilization levels in an overlay network for a given workload.  CLASH
is part of the IBM ContextSphere project, an effort to provide
programmable and distributed middleware for data composition from
pervasive data sources. As part of this talk, we provide an overview of
ContextSphere and introduce two examples of emerging Internet-scale
application environments that require large-scale stream processing. We
then describe the basic CLASH protocol and its application to
ContextSphere. We present simulation results that show the benefits of our
approach and explain ongoing work.


dbUCLA seminar 11/1(Fri): Programming with Logic and Objects

Speaker: Prof. Michael Kifer, Computer Science Dept, Stony Brook University
Time: Nov 1, Friday, 12-2pm
Room: 4760 Commons
Title: FLORA-2: Programming with Logic and Objects

Pizza served.

Abstract
This talk is about a marriage of object-based and logic-based paradigms for
programming knowledge-intensive applications.

The product of this marriage is FLORA-2, which is both a seamless
integration of Frame Logic, HiLog and Transaction Logic in a single
formalism, and an implementation that adds important pragmatic
extensions. Together they make a powerful knowledge programming language.

Frame Logic relates to the object-oriented data model as classical
predicate calculus relates to the relational data model.  HiLog adds
meta-programming, and Transaction Logic add dynamics to the mix.

Although FLORA-2 has been released only in its alpha form, it is already
very usable and has a following of dedicated users in the areas of
information integration, semantic web, information systems design, agent
building, etc.

Bio
Prof. Kifer received his Ph.D. in Computer Science from Hebrew University of
Jerusalem in 1985.  He has been a professor at Stony Brook University since
1994. His research interests include declarative languages for data and
knowledge manipulation, integration of object-oriented and deductive
paradigms, object-oriented databases, query optimization and logic
programming.
http://www.cs.sunysb.edu/~kifer/
dbUCLA seminar 10/25 (Fri)

Speaker:   Qinghua Zou (Ph.D. student)
Advisor:   Wesley W. Chu
Time:      Oct 25, Firday 12-2pm
Room:      4760 Commons
Title:     SmartMiner: A new algorithm for mining maximal frequent itemsets
           (paper accepted by ICDM'02 - IEEE Int'l Conf. on Data Mining)

Abstract
Maximal frequent itemsets (MFI) are crucial to many tasks in data mining.
Since the MaxMiner algorithm first introduced enumeration trees for mining
MFI in 1998, several methods have been proposed to use depth first search to
improve performance.  To further improve the performance of mining MFI, we
proposed a technique that takes advantage of the information gathered from
previous steps to discover new MFI. More specifically, our algorithm called
SmartMiner gathers and passes tail information and uses a heuristic select
function which uses the tail information to select the next node to explore.
Compared with Mafia and GenMax, SmartMiner generates a smaller search tree,
requires a smaller number of support counting, and does not require superset
checking.  Using the datasets Mushroom and Connect, our experimental study
reveals that SmartMiner generates the same MFI as Mafia and GenMax, but
yields an order of magnitude improvement in speed.
dbUCLA Seminar on Occ 10th: Effective Change Detection Using Sampling


Speaker:   Alexandros Ntoulas
Time:      Oct 11th, Firday noon-2pm
Room:      4760 Commons
Title:     Effective Change Detection Using Sampling

Pizza will be provided.

Abstract:
For a large-scale data-intensive environment, such as the World-Wide Web
or data warehousing, we often make local copies of remote data sources.
Due to limited network and computational resources, however, it is often
difficult to monitor the sources constantly to check for changes and to
download changed data items to the copies. In this scenario, our goal is
to detect as many changes as we can using the fixed download resources
that we have.

In this presentation we propose three sampling-based download policies
that can identify more changed data items effectively. In our
sampling-based approach, we first sample a small number of data items from
each data source and download more data items from the sources with more
changed samples. We analyze the effectiveness of the sampling-based
policies and compare our proposed policies to existing ones, including the
state-of-the-art frequency-based policy.

Our experiments on synthetic and real-world data show the relative merits
of various policies and the great potential of our sampling-based policy.
In certain cases, our sampling-based policy could download twice as many
changed items as the best existing policy.




DBUCLA Seminar: "Virtual Suffix Tree for XML Indexing" this Friday


Time and Place:
         Friday August 16, 11:00 AM, BH 4549

Speaker:
         Haixun Wang,  IBM  T.J. Watson Research Labs

Title:
         ViST: Virtual Suffix Tree for XML Indexing

Abstract:

With the growing importance of XML in data exchange, much research
has been done in providing flexible query facilities to extract data
from structured XML documents.  In this paper, we propose VIST, a
novel index structure for searching XML documents. By representing
both XML documents and XML queries in structure-encoded sequences,
we show that querying XML data is equivalent to finding subsequence
matches. Unlike index methods that dissemble a query into multiple
sub-queries, and then join the results of these sub-queries to
provide the final answers, VIST uses tree structures
as the basic unit of query to avoid expensive join operations.
Furthermore, VIST provides a unified index on both content and structure
of the XML documents, hence it has a performance advantage over methods
indexing either just content or structure.  VIST supports dynamic
index update, and it relies solely on B+Trees without using any
specialized data structures that are not well supported by DBMSs.
Our experiments show that VIST is effective, scalable, and efficient
in supporting structural queries.
Speaker: Mark Emerson, CEO, Information Resonance Corporation
Venue:   4760 Boelter Hall
Date:    June 7, 2002
Time:    12:00 - 2 pm

Lunch will be provided at 12:00 noon.


TITLE:

AngelBase: a Powerful New Database Paradigm That (1) Dramatically
Empowers Users and (2) Promises to Make Massively Parallel Hardware a
Reality.

ABSTRACT:

The computing industry currently rests on four database technologies:
hierarchical, network, relational and object.  None is patented.
AngelBase(TM) is a fifth database technology that is vastly superior to
the existing four.  And AngelBase is patented.  AngelBase represents a
paradigm shift that is anticipated to affect the entire computing
industry, from microchips to global systems.  For a preview, see
http://angelbase.com.

The lecture will give an overview of AngelBase, including its extended
dimensionality, its metaphor (angels + base = data village), its
declarative programming paradigm (via the Angel Language(TM)), its
explicit meta-data, its seven levels of user-data (datum, rec, record,
lattice, database, realm, and World), its dynamic operation (i.e. no
rollovers), and its promise to "break" the von Neumann barrier and make
massively parallel processing a reality.  These new computers will be
called Angel Machines(TM).

After more than a decade of theoretical research, it's now time to build
AngelBase with a  "dream team" of just four players.  We will first
build a "Virtual Angel Machine" (VAM) which runs AngelBase on existing
(von Neumann) machines.  Then, in a few years, we'll set up a hardware
laboratory and build the first real Angel Machine.

Students (both graduate and upper division) are encouraged to attend,
because we are looking to recruit an outstanding student as the fourth
and final member of our VAM development team.

SPEAKER:

AngelBase inventor, Mark Emerson (51), is the CEO of Information
Resonance Corporation, which owns AngelBase.  Mark has been working on
AngelBase (either part-time or full-time) since 1991.  Before conceiving
AngelBase, he spent 13 years as a software engineer and data quality
expert at Hughes Aircraft Company.  Prior to that he spent 5 years
teaching mathematics and founding a school.  Mark received his B.A. in
Mathematics from U.C.L.A. in 1973, graduating Magna cum Laude.