Optimizing Database Systems for Large Main Memories

Brad Adelberg
Computer Science Department
Northwestern University

Contact Information

Brad Adelberg
1890 Maple Avenue
Evanston, IL 60201
Phone: (847) 467-2129
Fax : (847) 491-5258
Email: adelberg@cs.nwu.edu

WWW PAGE

www.cs.nwu.edu/~adelberg/mmdb/index.html

Project Award Information

Keywords

Main memory database system, cache alignment, locality, performance.

Project Summary

Plummeting memory prices combined with the availability of 64-bit architectures has made computer systems with multigigabytes of main memory a reasonable platform for many production systems. As a result, many applications find that all of their data, or a least the most frequently accessed tables, can be kept entirely in main memory. This represents a significant change from previous generation database systems which expected that most of the application data would reside on disk. This research focuses on two aspects of adapting database systems for platforms with large main memories.

First, it has been observed that during transaction processing applications, the CPU wastes approximately 80% of its cycles due to stalls [cvetanovic94]. Thus a 5 fold speed up is possible if cache misses are eliminated entirely. One focus of this research is on how to modify the data structures and algorithms used by database systems to minimize the cache miss rate and thereby increase system performance. Although all of the cache misses will never be eliminated, even modest reductions in the cache miss rate will yield large performance gains. Further, with clock speeds continuing to grow, the available speed up will grow as well.

The second focus of the research is how to best integrate main memory database technology with disk-based database systems in the event that all of the needed data will not fit in main memory. A unified architecture will be designed that extends traditional DBMSs with a new main memory component. We will determine which functions need to be handled within the new component (i.e. table scanning, transaction management, concurrency control) and how this optimal division of work is supported by the different commercially available extensibility proposals and interfaces (i.e. OLE DB, datablades, cartridges). Finally, we will develop algorithms for database administrators (DBAs) to determine, for a given amount of memory, which tables to store in memory versus on disk.

Goals, Objectives, and Targeted Activities

In the next year we will instrument our current code base using both Speedshop and SimOS to perform detailed runtime analysis of the interaction of the database system with the processor and memory system. We will focus on transaction processing workloads, specifically TPC-B and TPC-C, and will analyze cache miss performance by type of query as well as by system component. For instance, we will examine how concurrency control and recovery, which typically have less regular access patterns than some query execution operators (i.e. sorting), interfere with components that are more regular.

On the educational side of the proposal, the project required in the second database course (focusing on database system implementation) will be modified. It currently requires writing pieces of a reduced functionality standalone database system. The new project will require student to write components using the OLE DB interface and then integrating them with commercial code.

Indication of Success

During the last six months, we have instrumented our code for use in SGI’s performance package called speedshot. Preliminary experiments have been performed and indicate that the cache performance of main memory database systems, even those not specifically designed with cache cognizance, is better than that of disk-based commercial systems reported in the literature. There is still room for improvement, however, and further experiments on more realistic workloads are necessary.

Project Impact

Currently one undergraduate is working on the project and a graduate student will soon be added. The effects on education have already been dramatic. The undergraduate database curriculum has been redesigned as a two quarter sequence, the first focusing on using a database system and second on building one. The removal of implementation issues in the first quarter has significantly broadened its appeal. Enrollment has tripled and many students have found the material useful for both undergraduate projects as well as in their jobs post-graduation. In particular, two students projects are now being used in production by the University.

GPRA Outcome Goals

This project has only been underway for 6 months, but if successful will be useful in improving the performance of main memory database systems. This is particularly important in electronic commerce applications since the volume of transactions seen by servers is much higher than traditional sales systems whose input flow is gated by human cash register operators. The techniques may also improve the performance of disk-based database systems that have large buffer pools or that can pin an important subset of their tables in memory.

Project References

Area Background
The vast majority of commercial databases have been too large to fit in memory. Hence traditional database systems have had to access the data from disks. Since a disk is many orders of magnitude slower to access than RAM, much of the work in database system performance has ignored memory access cost and instead focused on how to minimize requests to the disk or to access it in the most efficient way possible (i.e. replacing random with sequential accesses). A common theme is to try to place related information on a single disk block so a single disk access will satisfy multiple disk accesses. Recently used blocks are also caches in a buffer pool to take advantage of temporal locality of access.

If a database completely fits in main memory, queries can be answered without I/O and hence memory access time becomes important. Since accessing data in the cache is two to three orders of magnitude faster than reading from main memory, organizing data to minimize cache misses becomes important. Thus just as disk-based DBMSs carefully arrange information on disk blocks, main memory DBMSs must carefully arrange information on cache line boundaries. This task is more difficult since cache lines tend to be either 64 or 128 bytes as opposed to disk pages which are many thousands of bytes long. Despite this difficulty, preliminary results are promising --- in [nyberg94], the authors reported a 300% speed up in a sorting algorithm due to careful consideration of cache effects.

Area References

Potential Related Projects

The best source of related projects would be in the computer architecture area. We are interested in working with any group trying to optimize the performance database systems on standard hardware.
 
 


Geospatial Database-Driven Extraction of Information from Digital Aerial Imagery

Peggy Agouris
Department of Spatial Information Science & Engineering
University of Maine
348 Boardman Hall
Orono, ME 04469-5711
Phone: (207) 581-2180, Fax : (207) 581-2206
E-Mail: peggy@spatial.maine.edu

WWW Page: www.spatial.maine.edu/~peggy/CAREER.html

Project Award Information:

Award Number: 9702233
Duration: 48 months, currently on Year 2
Title: CAREER/EPSCoR: Geospatial Database-Driven Extraction of Information from Digital Aerial Imagery

Keywords:

Digital Image Analysis, Spatial Databases, Change Detection, Matching, Scale Space

Project Summary

This CAREER project will advance the ability to extract spatial information from digital aerial imagery for which prior or complementary information already exists. Examples of such information are pre-existing digital maps, digital terrain models, and spatial information systems in general. We use geospatial databases to support and guide digital image analysis methods for object extraction. The work will accomplish both research and educational objectives. The research program develops novel concepts and algorithms for the extraction of information from digital images. Emphasis is put on matching images to abstract representations of reality, change detection, analysis of scale differences between various images and images and databases, and structuring the extracted information in order to be suitable for incorporation into geospatial databases for their updating. A workshop is also planned with the participation of leading experts from a variety of disciplines relevant to this project. The educational initiatives will incorporate the research advancements made in this project into our graduate and undergraduate curriculum through the development of a new course and the modification of two existing courses, and in our high school outreach program.

Goals, Objectives, and Targeted Activities

Our objective is to improve information extraction processes from digital aerial images by having image analysis methods supported and guided by pre-existing/complementary spatial information.

The research program develops novel concepts and algorithms for the extraction of information from digital images. The research plan is organized in complementary tasks. More specifically:

By embedding object extraction within the framework of spatial information systems, digital image processing and analysis will be able to exploit the advantages offered by the availability of various sources and formats of spatial data.

This year (June 14 and 15, 1999) we will hold a 2-day single track international workshop with the participation of experts from the fields of digital image analysis, photogrammetry, GIS, and databases. The proceedings will be fully refereed and will be published by Springer Verlag in their Lecture Notes in Computer Science (LNCS) series.

Parallel to the aforementioned research activities, this project includes educational initiatives, designed to take advantage of and incorporate the research advancements made in this project. A new undergraduate course is being developed. On graduate level, experiences acquired in this project are being incorporated in two current courses taught by the PI. Also, the high school outreach program of our Department will be enriched and expanded, exposing junior classes to the challenging and evolving role of digital images in spatial information engineering.

Indication of Success

As mentioned before, we are currently halfway through year 2 of this four-year project. Years 1 and 2 focus on the development of a matching scheme which uses as input database object models and is applied on digital imagery for object extraction and change detection. We have developed the basic theoretical part of our matching scheme and have began its implementation and relevant experiments. We have been able to extend matching from an image-to-image operation into an image-to-information operation. This research continues according to schedule and our progress is consistent with the goals of the project. Our advancements in this direction so far have enabled us to also consider the potential use of the developed concepts and techniques in image retrieval scenaria.

Project Impact and Output

Human Resources: This project supports the graduate studies of two Ph.D. candidates. Also, educational developments resulting from this work affect one undergraduate and two graduate courses, with an average yearly participation of approx. 15 students per class. Work covered by this project is also part of the high school outreach program ("Spatial Horizons") of our department. This program is typically attended by 100 or more high school students per year. Another high school outreach program affected by this project is the University-organized "Expanding your Horizons" program. It addresses directly female high school students, attempting to increase their participation in science and engineering disciplines.

Education and curriculum development: This project includes educational initiatives, designed to take advantage of and incorporate the research advancements made in this project. A new undergraduate course is under development, tentatively titled "Integration of Digital Image Processing and Geospatial Databases". On graduate level, experiences acquired in this project are incorporated in the current courses of Digital Image Processing & Analysis, and Analytical & Digital Photogrammetry, taught by the PI. The high school outreach program of the Department of Spatial Information Science and Engineering is enriched and expanded as a result of this project, to expose high school junior classes to the challenging and evolving role of digital images in spatial information engineering.

Department/institution infrastructure: The project further supports collaboration between the PI and GIS-oriented faculty within the PI's Department, and GIS-relevant faculty within the University of Maine. Last year, one such collaboration resulted in the awarding of a Center of Excellence in Remote Sensing Applications grant by NASA ($300,000 from NASA, plus software donations valued at approximately $150,000). This year, the PI collaborated with another faculty member within the same Department and they were awarded a project of approx. $320,000 from the National Imagery and Mapping Agency. This new research project complements the CAREER project of the PI by extending certain concepts towards spatiotemporal reasoning and the modeling of change.

Industry: We are still at an early stage regarding this project. However, some initial discussions with leading industrial firms (e.g. Autometric and LHS) in the fields of digital image analysis and softcopy photogrammetry have confirmed their interest in technology transfer of our work.

GPRA Outcome Goals

This project contributes towards the following 4 out of the 5 GPRA Outcome Goals:

Goal 1: Discoveries at and across the frontier of science and engineering.

The new concepts which are under development during this project will contribute towards the seamless integration of digital images and geographic information systems (GIS). They are expected to improve and automate the geospatial information production, management, and analysis cycles.

Goal 2: Connections between discoveries and their use in service to society.

The objective of this project is the development of a seamlessly integrated geospatial information environment. This will allow various types of users (e.g. scientists, engineers, practitioners, private citizens, students) to have easier access to better and more accurate multifaceted geospatial information. In doing so, we permit various users to take full advantage of modern technological achievements (e.g. GPS, digital cameras and video, Web).

Goal 3: A diverse, globally-oriented workforce of scientists and engineers.

The project encourages and facilitates the exchange and integration of geospatial information between various sources and/or users, thus contributing to this goal.

Goal 5: Timely and relevant information on the national and international science and engineering enterprise.

Please refer to goals 2 and 3 above.

Project References

Area Background

Digital photogrammetry is an important field of digital image analysis. It deals with the processing of digital images for the extraction of metric quality information on objects/scenes depicted in them, without having to get into physical contact with the object/phenomenon under study. Typical photogrammetric applications employ aerial, space, or even close-range imagery. Applications range from mapping and environmental monitoring to medical and biological image processing, intelligence and defense activities, and even flow monitoring. In this project we deal with aerial images representing scenes for which prior or complementary information already exists in geospatial databases. Geospatial databases (e.g. digital maps, GIS databases) contain qualitative and quantitative information on the precise location, function, and interrelationships of man-made objects (e.g. roads, buildings) and natural objects (e.g. bodies of water, or the terrain itself). Aerial images are ideal complements to geospatial databases as data capture sources, since image-captured terrain scenes can be used to extract accurate and up-to-date qualitative and quantitative spatial information. Object extraction from digital images is a major research topic in computer vision and digital photogrammetry.

Area References

Potential Related Projects

General topics:

Retrieval of digital images and other types of GIS and cartographic data from integrated spatial databases using matching techniques, topological relations, metadata, and database management systems.

Object extraction from digital images in medical/biological applications, using CAT scans, mammograms, X-rays etc.

Spatio-temporal modeling, especially considering novel sensors and wireless communications; rapid scene modeling and updating using hand-held cameras in dynamic environments.

Other on-going IDM projects:

The projects of Arif Ghafoor, Shashi Shekhar, Wesley Chu appear to be good candidates for collaboration
 
 


 

 

A Family of the ODMG Object Models

Principal Investigator: Suad Alagic
Affiliation: Department of Computer Science, Wichita State University

Contact Information

Suad Alagic
Department of Computer Science
Wichita State University
Wichita, KS 67260-0083
Phone: (316) 978 3916
Fax: (316) 978 3984
E-mail: alagic@cs.twsu.edu

WWW Page

http://www.cs.twsu.edu/~alagic/nsf_ODMG.html

Project Award Information

Award Number: IIS-9811452
Duration: 9/15/1998 - 8/31/2001
Title: A Family of the ODMG Object Models

Keywords

Object-oriented, ODMG Standard, type systems, persistence, constraints, Java.

Project Summary

The goal of this project is to provide research and technical solutions for major problems in the ODMG Standard, the currently proposed industrial standard for object-oriented databases. The main expected contributions of this project to the ODMG Standard are in the underlying type systems, the model of persistence, and the constraint language. The approach is based on establishing a well-defined family of object models replacing the existing single and inadequate ODMG Object Model. The expected results also include some implementation models and techniques applicable to the technology proposed by the ODMG Standard. As the ODMG Standard contains a binding for the Java programming language, some results of this project will also be applicable to the Java technology in general. The research team will play an active role in transferring results of this project to the ODMG by making specific proposals to the future releases of the Standard. The results of this project are intended to contribute to proliferation of a whole generation of commercial object-oriented database management systems compliant with a well-defined Standard that reflects properly the state of the object-oriented research.

Goals, Objectives and Targeted Activities

Specific research and technical goals and the expected results are:

Indication of Success

There are two measures of success of the activities in this project. One of them is the technical validity of the research and technical solutions as evaluated by the research and technical community. The other measure is the actual influence on the new releases of the ODMG Standard.

Project Impact

This project is expected to influence the future releases of the ODMG Standard, either directly or indirectly. The direct influence is expected from the direct work of the PI with the ODMG. Indirect influence is expected from the published papers and the expected response of the research and technical community to those papers.

Human Resources: One PhD student will be supported by this grant. MS level students will be involved in more pragmatic, system-oriented issues. PI will gain valuable industrial insights and interactions.

Education and curriculum development: A top-level graduate course on object-oriented databases is being offered on a regular basis. This course addresses the issues of the ODMG Standard and involves a non-trivial database project implemented using an ODMG compliant database management system.

Department infrastructure: Research facilities for this project are located in the Object-Oriented Research Lab. equipped with UNIX and SGI workstations and NCD X-terminals. The Lab is running a variety of software systems: object-oriented programming environments (for Java, Eiffel and C++), object-oriented database management systems (O2 and Ode) and persistent-object systems (PJama and GemstoneJ)

Industry: The ODMG Standard has been developed by the vendors of object-oriented database management systems. Other related and interested industrial partners are also involved in the consortium. The PI of this project is participating actively in the ODMG meetings contributing proposals for solving the problems in the Standard. Research results from this project are thus directly transferred to the industrial partners of the ODMG.

GPRA Outcome Goals

1.
Discoveries at and across the frontier of science and engineering.
The ODMG Standard is in fact an engineering document, supported and implemented by software and database engineering professionals. The role of the PI is to contribute the Standard in such a way that the Standard reflects properly the state of the object-oriented research. This project thus aims at the research results that are acceptable from the viewpoint of database and software systems engineering.
2.
Connections between discoveries and their use in service to society
This project is addressing research and technical problems of an industrial standard. The results are transferred directly to the ODMG. Those proposals that are in fact accepted by the ODMG are expected to be incorporated in the new releases of the Standard, and then implemented by the vendors of object-oriented database management systems.

Project References

The following reports have been produced so far:

The above papers have been submitted for publication and will be available from the project web page.

Area Background

Object-oriented database technology is a database technology based on the paradigm of object-oriented programming languages such as C++ or Java. As such, object-oriented database technology unifies two core technologies: the database technology and the technology of programming languages and systems. Because of this, the object-oriented database technology is very different from the relational technology which currently dominates the market.

Object-oriented database technology is well-suited for complex application (for example, engineering) environments involving objects with complex structure and complex behavior. Object-oriented database technology is appealing to the users that rely on object-oriented application modeling techniques and/or object-oriented programming languages and other object-oriented software tools in developing their applications.

Area References


 
 


CAREER: Supporting Workflow, Long Duration and Nested Transaction Models in a Multilevel Secure Database Environment

Vijay Atluri
MS/IS Department
Rutgers University

Contact Information

180 University Avenue, Newark, NJ 07102
Phone: (973) 353-1642, Fax : (973) 353-5003
Email: atluri@andromeda.rutgers.edu, Web page: http://cimic.rutgers.edu/~atluri

WWW PAGE

http://cimic.rutgers.edu/~atluri/nsf-career.html

List of Supported Students

Wei-Kuang Huang, Sept 1996 - Aug 1998
Soon Ae Chun, Sept 1998 - Aug 1999

Project Award Information

Keywords

Advanced transaction models, workflows, security, authorization model, digital libraries, real-time information services.

Project Summary

This project has two main goals: (1) to investigate how multilevel security constraints impact transaction processing in advanced database applications, and (2) to investigate how the properties of the advanced transaction models can be utilized to solve the secure concurrency control problems in traditional databases. In particular, it develops multilevel secure workflow and multilevel secure long duration transaction models as well as workflow authorization models. It also investigates whether nested transaction models can be more suitable for multilevel secure (MLS) database management systems (DBMSs) than the traditional transaction model and develop the necessary extensions to the nested model required by MLS DBMSs. The results of this project enable incorporating security into advanced database applications such as engineering design and long running activities, and provide efficient secure transaction processing protocols for conventional database applications. The education component of this project develops new courses at both undergraduate and graduate level and student involvement in the research.

Goals, Objectives, and Targeted Activities

Our goals are to develop techniques and algorithms to incorporate both mandatory and discretionary security into workflow management systems. We have identified two new advanced application domains, digital libraries and real-time information services, for which we intend to develop authorization models. During this year and next year, our focus is on (1) analysis of multilevel secure workflows and workflow authorization models (2) developing a workflow execution model for a heterogeneous autonomous distributed workflow environment, (3) implementing a web-based secure workflow system using Perl and Java, (4) developing a multilevel secure nested transaction model, (5) developing a security model suitable for a digital library environment, and (6) developing an authorization model for real-time information services. Our research is moving towards new and interesting application domains, as can be seen from the last two goals, which were not part of the original proposal.

Indication of Success

During the past two years, we have successfully developed an execution model for multilevel secure workflows, an authorization model suitable for WFMS and a tool based on Petri nets to analyze the properties of workflows such as safety and consistency. Our progress is consistent with the goals of the project. We have identified several new problems in the WFMS area and therefore continuing to work in this area. We are currently developing a web-based secure workflow management system and extending our work on WFMS to heterogeneous autonomous distributed environments. Workflow management systems are increasingly being used today in numerous application domains. While protecting the information from unauthorized and malicious users is important, there has been no research addressing these issues in WFMSs. Our research identifies the security requirements in WFMS and provides suitable models and algorithms to meet these requirements. In addition to the objectives stated in the original proposal, we have identified new application domains that were not part of the original proposal. We are currently developing authorization models for digital library environment and for real-time information services.

Project Impact

GPRA Outcome Goals

  1. Discoveries at and across the frontier of science and engineering: The results of this project enable incorporating security measures (both mandatory and discretionary) in advanced database applications including engineering design, long running activities such as workflow systems, digital library environments and real-time information services. We have successfully developed an execution model for multilevel secure workflows, an authorization model suitable for workflow management systems, a tool based on Petri nets to analyze the properties of workflows such as safety and consistency, and a web-based secure workflow management system.
  2. Connections between discoveries and their use in service to society: Workflow systems are today used in numerous application domains including office automation, finance and banking, healthcare, telecommunications, manufacturing and production, to run their day-to-day applications. In such an environment, it is important to protect information against both direct and indirect leakage to unauthorized users. The models and tools developed in this project enable enforcing security policies that are typical of commercial applications, as well as military applications that call for more stringent security requirements. As an example, our models and tools enable enforcing security policies such as separation of duties in a paperless office environment, in which electronic documents are exchanged among individuals and processes. Our authorization model for digital libraries (DL) enables enforcement of access control to DL objects based on their content and the qualifications of users rather than object and user identities. This project also has an impact on education and curriculum development. See ``Project Impact'' above.
  3. A diverse, globally-oriented workforce of scientists and engineers: Human Resources Impact: See ``Project Impact'' above.

Project References

Area Background

The first two years of this project focuses on security issues in the area of workflow management systems (WFMSs). WFMSs are today used in numerous application domains including office automation, finance and banking, healthcare, telecommunications, manufacturing and production, to run their day-to-day applications. A workflow separates the various activities of a given organizational process into a set of well defined tasks, which are carried out by specified users or processing entities. In such an environment, it is important to protect information against either by direct or by indirect leakage to unauthorized users. The security policies specified by the organization typically reflect its security requirements. These requirements can be broadly classified into two categories: (1) discretionary access control (DAC), in which users, at their discretion, are allowed to grant permissions on the data they own, and (2) mandatory access control (MAC), in which all data are labeled based on their level of sensitivity and all users are given clearances, and users are allowed to access data with a given label only if their level of clearance permits them. While DAC is suitable for some commercial applications, they are not adequate where more stringent security requirements are needed, because they are vulnerable to sophisticated attacks (e.g. Trojan Horse Attack). In this project we develop the necessary models and algorithms to incorporate both DAC and MAC into workflow management systems.

Area References

Potential Related Projects

Flexible Transactions in DBMSs: Formalism and Support Using Active Capability, PI: Sharma Chakravarthy. Others to be determined.
 
 
 


Reliable Data Sharing in Wide Area Gigabit Networked Databases

Sujata Banerjee

Dept. of Info. Sci & Telecommunication

University of Pittsburgh

 

Panos K. Chrysanthis

Dept.  of Computer Science

University of Pittsburgh

Contact Information

Panos K. Chrysanthis 
University of Pittsburgh
Pittsburgh, PA 15260
Phone: (412) 624-8924 
Fax:    (412) 624-8854
Email: panos@cs.pitt.edu

Sujata Banerjee
University of Pittsburgh
Pittsburgh, PA 15260
Phone: (412) 624-9470
Fax:     (412) 624-2788
Email: sujata@tele.pitt.edu
URL: http://www.pitt.edu/~sujata

WWW PAGE
 URL:  http://www.cs.pitt.edu/admt/projects/wands.html

List of Supported Students

Project Award Information

Keywords

Data servers, Caching, Transaction Processing, High speed networks, Data consistency and recovery

Project Summary

This project's goal is to study the impact of the new capabilities of high-speed networks on the design and performance of data management protocols. In particular, the focus is on high-speed wide-area networks (WANs) where the propagation latency is the dominant communication cost, the migration of large amounts of data is not an issue, and efficient multicasting and guarantees on network Quality of Service (QoS) parameters are available. In this project, existing concurrency control, commit and recovery protocols are being evaluated with respect to performance scalability under the new network assumptions. Further, new protocols are being developed that utilize the huge bandwidth and minimize the number of sequential phases of messages exploiting the multicasting capability and the provision of QoS guarantees such as bounded end-to-end packet delay and packet loss. At the same time, because of the need to scale to thousands of clients and servers, not all of which may be operational at any given time, the new WAN data management protocols are designed to achieve higher site autonomy in a cooperative manner, reducing global synchronization and supporting disconnected operations. The various protocols are empirically evaluated using computer simulation under different scenarios. The results of this research are expected to provide a better insight into the synergy between database and network technologies. This, in turn, will facilitate the deployment of future high performance WAN data server systems which are part of the necessary infrastructure that will support important nation- and world-wide applications such as electronic commerce.

Goals, Objectives, and Targeted Activities

This year our goals and objectives are as given below.

Indication of Success

We have made substantial progress on the first three objectives stated above. Work on the read-only optimization, comparison with other caching protocols and enhancement of the existing simulation testbed are progressing on schedule. Preliminary results indicate that the results are in line with our expectations. A significant side effect of this work so far has been to investigate the nature of the network traffic generated by the transaction processing system, which is essential in deploying future data server systems on next generation WANs. Also, the specific application area of E-Commerce is under exploration and some preliminary ideas on dynamic workflow management have emerged. This work will contribute the understanding of high performance data server systems and their interaction with high speed networks operating in the guaranteed QoS paradigm. Further, several insights into performance tuning of these systems will be obtained for a variety of workloads, transaction models and network environments.

Project Impact

GPRA Outcome Goals

The goals of this project are in consensus with the long-term vision of the computer and communications industry to provide fast and reliable information access to the masses, taking into account the individual user's specific needs. Every attempt is being made to develop theoretical frameworks and then transfer them into pragmatic environments so their impact is far-reaching in the commercial sector. The project thus far has attracted a diverse body of students, including women, who are particularly under-represented in this research community.

Project References

These and other papers are available from the project WWW page.

Area Background

The same reasons that caused the emergence of the client-server database systems to become the predominant distributed database architecture in local area networks (LANs), combined with ever faster and cheaper computers, cheaper stable memory, high-speed networks and network-aware applications, are now causing unprecedented growth in information/distributed database systems over WANs. The WWW is a prime example of a WAN data-server system which is primarily read-only for the time being. In the future, it is expected that general data-server systems (also called data shipping or enhanced client-server systems) in which clients perform much of their query and transaction processing locally, will be deployed over wide-area networks (WANs). It is also expected that multiple WAN servers will operate cooperatively and that WAN clients will evolve and function as data-servers over local-area networks that will also need to cooperatively process transactions across multiple WAN servers, creating an even more powerful distributed database environment of multiple data-servers and collaborating-servers over WANs. It is within such a database environment that we envision that nation- and world-wide enterprises, for example, will be operating and electronic commerce will be carried out. In order to realize such powerful distributed database environments, there is a need to study the impact of the new capabilities of high speed networks on the design, the performance and the scalability of data management protocols, specifically the availability of huge bandwidth, of efficient multicasting and of Quality of Service (QoS) guarantees.

Area References


 
 
 


Quasi-cubes: Exploiting Approximations in Multidimensional Data Sets

Daniel Barbará
Information and Software Engineering Department
George Mason University

Contact Information

Daniel Barbará
George Mason University
ISE Dept. Mail Stop 4A4
Fairfax, VA 22030
Phone: (703) 993-1627
Fax : (703) 993-1638
Email: dbarbara@gmu.edu
WWW page: http://www.isse.gmu.edu/~dbarbara/quasi.html

Project Award Information

Award Number: IIS-9732113
Duration: 09/01/1998 – 8/31/2001
Title: QUASI-CUBES: Exploiting Approximations in Multidimensional Data Sets

Keywords: Multidimensional data, datacubes, query approximation, data mining, statistics.

Project Summary

A datacube is a multidimensional structure that contains at each point an aggregate value, i.e., the result of applying an aggregate function to an dataset of records. The purpose of this project is to develop, implement and experimentally evaluate techniques to reduce the storage needs in a multidimensional cube by substituting some of the data by more succinct models that characterize it. By doing this, errors are introduced when the data is reconstructed. The tradeoffs between space savings and reconstruction errors are studied, along with tradeoffs between response time to queries and reconstruction errors. Understanding these tradeoffs make it possible to efficiently design and implement multiresolution systems where query answers can be incrementally refined as the user looks at them. Finally, the models are a good starting point to mine the cube data since they show valuable patterns and associations among the different dimensions. The results of this project will provide valuable techniques and software to efficiently store and analyze large multidimensional data sets, such as those that are common in health, financial and scientific applications.

Goals, Objectives, and Targeted Activities

We are currently involved in the following activities:

Indication of Success

We have successfully demonstrated the feasibility of using statistical models for characterizing datacubes. A preliminary report shows that substantial savings in space can be achieved at the expense of reasonable accuracy in the query answers. In less than half of the first year of the grant, we are very close to the completion of the first three prototypes of the systems mentioned above. We expect to complete them in a short period and to submit the results to conferences and journals. Our research has attracted the attention of the information technology department in the Department of Justice and we expect to collaborate with them in the near future.

Project Impact

GPRA Outcome Goals

Project References

Area Background

As the data collected by organizations grows in size, the capacity to extract information from all this data has yet to catch up. A first step in understanding data is to organize it in suitable ways for analysis. This factor has prompted the development of {\it data warehouses} as repositories of integrated information. Within a data warehouse, a very useful data organization, well suited for analysis is the dimensional model, which presents the data as a set of cells in a multidimensional space. A popular abstraction of this model is the datacube: a multidimensional structure that contains at each point an aggregate value, i.e., the results of applying an aggregate function for an underlying relation. Implementing datacubes can be achieved by materialization of the cells or by computation of them on demand. Both strategies have shortcomings: the staggering amount of space needed and the potential for long delays to obtain the answers. Our research addresses the investigation of techniques for efficiently scaling datacubes. These techniques make use of a variety of statistical models to provide characterizations of portions of the datacube. We use these characterizations to provide datacube compression or to provide quick and approximate answers that can be refined on-line. Also, the usage of this models enables the scaling of a variety of data mining algorithms.

Area References


 
 


Similarity Based Retrieval from Video and Pictorial Databases

A. Prasad Sistla and Jezekiel Ben-Arie


Contact Information

Jezekiel Ben-Arie 
EECS Dept. (M/C 154) 851 South Morgan Street 
University of Illinois at Chicago, Chicago, IL 60607 
Ph/Fax: (312) 996-2648 Email: benarie@eecs.uic.edu
URL: http://www.eecs.uic.edu/~benarie

A. Prasad Sistla 
EECS Dept. (M/C 154) 851 South Morgan Street 
University of Illinois at Chicago, Chicago, IL 60607 
Ph: (312) 996-8779 Email: sistla@eecs.uic.edu
URL: http://www.eecs.uic.edu/~sistla

WWW PAGE
IDM'99 Report: http://www.eecs.uic.edu/~benarie/video_db/
Web Based Demo of Digital Video Library: http://arik.eecs.uic.edu/cgi-bin/vdsearch.cgi

Project Award Information
Award Number: 9711925
Duration: 09/01/1997 - 08/31/2000
Project Title: "Similarity Based Retrieval from Video and Pictorial Databases"

Keywords
Content Based Retrieval, Video and Pictorial Databases, Multi-media databases, Shot and image segmentation, Motion and activity recognition, Object recognition.

Project Summary
This project proposes research towards the development of a system for similarity based retrieval from video and pictorial databases. The salient features of the proposed system are :
An expressive query language, called Hierarchical Temporal Logic (HTL), for expressing spatio-temporal queries on video databases. A modular similarity based retrieval system for the temporal part of the query which can be used on top of any suitable picture retrieval system. An image understanding system for extracting the objects and activities in the videos in the database which uses local and global color characteristics, motion signatures, shape characteristics and texture. An activity recognition module that recognizes human actions will be used to interpret video shots with respect to temporal events. A dictionary based system for translating the parts of the query into feature vectors which are searched in a space consisting of the feature vectors corresponding to the videos present in the database. An incremental learning module which expands the dictionary from examples.
HTL uses the classical temporal operators to specify temporal properties of videos (i.e. the sequencing of video segments), and it employs level modal operators to specify such properties at different levels in the video hierarchy. At the atomic predicate level, the language allows specification of properties on the contents of a single video segment including objects and motions (activities). For a given user query, each atomic predicate in the query is translated into feature vectors (or hyper regions in feature space) using a dictionary based scheme; these feature vectors are then searched using multi-dimensional indexing in the database consisting of the feature vectors of all the video segments in the video database; the result of this is a similarity list containing entries where each entry contains the id of a relevant video segment and a similarity value denoting how closely it satisfies the atomic predicate. Such similarity lists for the atomic predicates can also be obtained by a using a picture retrieval system. The similarity lists for the atomic predicates are combined together to obtain a similarity list for the main HTL query.The unique aspect of our feature vector extraction is the combination of the static generic object characterization in pictures and the motion/activity recognition in videos using a unified approach based on multidimensional indexing both in the image domain and in the action domain.
The project proposes extensions to the HTL query language, by incorporating additional temporal operators in order to make it more expressive. A graphical user-interface for specifying HTL queries will be developed. To improve the accuracy of the similarity based retrieval, alternate similarity functions will be investigated. Techniques such as temporal indexing (i.e. indices on the time dimension) will be investigated in order to enhance the performance of the retrieval system. Also, extensions to the currently implemented picture retrieval system are proposed.
The project also proposes development of new methods for segmentation of a video stream into shots using edge detection technique based on feature vectors of the frames. Also proposed are novel image segmentation and methods for person detection based on novel approach of model based segmentation and color characteristics. The research proposes generic recognition of objects and activities employing a flexible dictionary approach that translates atomic predicates into feature vectors which are then matched with corresponding features of video frames/shots. The dictionary can represent also generic inanimate or animate objects and activities by hyper-regions in the feature vector space. It is proposed to develop learning techniques which incrementally expand the dictionary with additional entries and generalizes existing entries.

Goals, Objectives, and Targeted Activities

Indication of Success The Similarity list generator for a subclass of HTL queries, called conjunctive queries, has been implemented. Preliminary results are encouranging. We have succeeded in segmentation of human faces using color and frequency characteristics. We have completed a model-based segmentation scheme that detects man-made objects in cluttered scenes. A robust scheme for head detection is now in development with successful results for various poses of human head. Also, a robust scheme for tracking human body parts for the purpose of activity recognition is under construction. A web demo is available at http://arik.eecs.uic.edu/cgi-bin/vdsearch.cgi.

Project Impact and Output Human Resources: Minlin Deng: Ph. D. student, Tao Hu, Ph.D. student, Zhiqian Wang: Ph.D. student, Dibyendu Nandy: Ph.D. student, Xiao Du, Ph.D. student, R. Venkatasubramanian: M. S. student. Education and curriculum development at all levels. The PI, Prof. Sistla teaches graduate and undergraduate courses in Databases and other areas of Computer Science. The Co-PI, Prof. Ben-Arie teaches graduate and undergraduate courses on image understanding, image analysis and image processing. His research has influenced and contributed to the contents of these courses.

GPRA Outcome Goals Discovered a novel Volumetric Frequency Representation that encapsulates both the 3D structure of objects along with a continuum of their views. This establishes a new approach in computer vision which can be for many applications such as pose-invariant object/face recognition,shape reconstruction from single images, 3-D motion recovery. Discovered and developed a novel method for segmentation of objects based on their generic models. This method is very useful in detection of objects and persons in cluttered scenes.

Project References

Area Background
Content Based retrieval from Video and Pictorial Databases is an active area of research. The main research goal is to facilitate retrieval of video and pictorial objects based on their content. The area spans multiple disciplines of which the Database and Information Retrieval and the Vision Understanding and Image Processing Disciplines. The Database area is concerned about languages for specifying queries, efficient retrieval mechanisms. The Vision Understanding and Image processing areas are concerned about processing raw pictures and videos, and extracting quantitative and qualitative information from them.

Area References

Potential Related Projects
Any project that is concerened with information retrieval from pictorial and video data.
 
 
 


Adaptive Multiresolution Data Representation

R. Daniel Bergeron (*)
John P. McHugh (**)

(*) Department of Computer Science
(**) Department of Mechanical Engineering

University of New Hampshire
Kingsbury Hall
Durham, NH 03824

Phone:    (603) 862-3778
Fax:        (603) 862-3493
Email:     rdb@cs.unh.edu
                 jpm@alma.unh.edu
www:       www.cs.unh.edu/~rdb

WWW PAGE
www.cs.unh.edu/projects/vis/mdr

Project Award Information

Keywords
multiresolution data, hierarchical data, level of detail, error representation, orthogonal wavelets.

Project Summary
This project is based on the development and evaluation of a multiresolution data representation model for interactive visualization and exploration of very large multidimensional multivariate scientific data. The combination of wavelets and multidimensional visualization techniques provides a powerful and effective basis for real-time visualization of very large data sets. We believe that compactly supported orthogonal wavelets can be used effectively to furnish an authentic representation of very large scientific data, and at the same time, provide a fine to coarse hierarchy for detail investigations. An essential component of our wavelet-based multiresolution data representation model is the ability to provide a meaningful localized error estimation for each level of the representation. Preliminary results show that large data sets can be effectively viewed at low resolutions. A scientist can search the coarse representation for interesting patterns, and can use the local error display to identify areas of the coarse representation where the visual representation has too much error to be trusted. In both cases, the scientist can then explore the identified areas at a higher resolution.  The current research is an interdisciplinary effort aimed at  developing and evaluating tools to support the interactive exploration of computational fluid dynamics (cfd) output representing the turbulence generated by gravity waves in the atmospheres of the Earth and the outer planets.

Goals, Objectives, and Targeted Activities
The principal immediate goal for this project is to develop the basic software that allows us to use the multiresolution data representation with the computational fluid dynamics (cfd) simulation output. This effort requires adaptations to the existing multiresolution data generation code and to the simulation code as well as the development of new code to support interactive access to the multiresolution hierarchy over the net. A critical component of the current research will be the development and evaluation of appropriate error criteria and presentation for cfd data of this type.

Indication of Success
We expect this research to result in contributions to both the theory and practice of multiresolution data representations. We will be exploring the time/space/accuracy trade-offs that are inherent in the use of our multiresolution data representation in the context of computational fluid dynamics. In addition, we will be developing a set of software tools to utilize our representation for interactive access to and visualization of extremely large cfd data sets. Such an environment will significantly improve a scientist's ability to manipulate and interpret very large data sets.  Although we are focusing on cfd datasets, we believe that much of the experience and knowledge we gain from this effort and most of the software we develop will be applicable to other scientific domains.

Project Impact
This project is just beginning, but we expect the effort to contribute to the educational experience of a number of graduate and undergraduate students. In addition to the one graduate project assistant funded by the grant, we expect to involve at least 4 computer science and mechanical engineering graduate students by identifying master's thesis topics associated with this project. In addition, we plan to request an REU supplement to support undergraduate students on the project.

GPRA Outcome Goals
This project addresses very real and very critical problems associated with a large segment of current scientific research -- the difficulty of dealing effectively with extremely large data sets. Our major goal is to improve the scientist's ability to understand and interpret very large datasets and, therefore, to help scientists to conduct their own research more effectively.

Project References

Area Background
Scientific data visualization [Nielson and Shriver, 1990; Rosenblum et al., 1994 ] is an emerging multidisciplinary research area that incorporates aspects of da ta modeling, data access and storage, interactive techniques, understanding of h uman perception and considerable emphasis on domain-specific needs and visualiza tion goals.  The notion of exploratory data analysis [Tukey, 1977] has been a fundamental component of much scientific data visualization research and corresponds closely to recently developed ideas of data mining  and knowledge-directed data discovery.  In recent years, significant research efforts have been devoted to development of multiresolution data representations including notions of multiple levels of detail for rendering purposes.  One particularly significant development in this area is the use of wavelet transformations [Strang, 1989; Stollnitz et al. 1995] which were originally used primarily for data compression.  Mallat's seminal paper [Mallat, 1989] was a primary motivation for much of the later use of wavelets for volume rendering and surface representation and for our data modeling work.  A basic one-dimensional wavelet transform of n data elements produces two sets of n/2 coefficients, called the summary and detail components.  The operation is invertible, so that the original signal can be reconstructed from the coefficients.  The coefficients are essentially weights applied to wavelet basis functions which have compact and limited support.  Consequently,  small coefficients can be discarded (treated as 0) without major impact on the reconstructed signal.  This characteristic makes wavelets useful for data compression.  The multiresolution application of wavelets arises because the summary coefficients essentially act as a lower resolution approximation to the original data and recursive applications of the wavelet transform produce a hierarchy of resolutions.  For our purposes, however, it is just as important that (for orthogonal wavelets) we can  treat the detail coefficients as representing the locally-defined error of the associated summary coefficients at each level of the hierarchy.

Area References

Potential Related Projects
The data modeling aspect of this project relates to ongoing research based on a comprehensive data model for scientific databases.  This approach uses metric space and topological foundations to combine conventional numeric data with categorical data into a single metric space.  We are also developing formal models for distributed adaptive multiresolution data representations.  This work is being carried out in collaboration with Ted M. Sparr and several computer science graduate students at UNH.


Automated Acquisition of Domain Knowledge and Language Patterns, with Application to Language Metadata

Lois Boggess and Julia Hodges, Principal Investigators
Department of Computer Science
Mississippi State University

Contact Information
Lois Boggess or Julia Hodges
Department of Computer Science
Drawer 9636
Mississippi State, MS 39762
Phone: (601) 325-2756
Fax : (601) 325-8997
Email: lboggess@cs.msstate.edu, hodges@cs.msstate.edu

 KUDZU/WebDoc home page

List of Supported Students and Staff
Bo Tang is a Computer Science doctoral student funded by this grant.
Lynellen Perry is a Computer Science doctoral student whose work for the project has been funded in part by this grant and in part by a grant from the Mississippi State University Office of Research.
Janna Shaffer has just joined the CS doctoral program and is funded by this project beginning this January.

Project Award Information

Keywords
 Content-based document indexing, natural language processing, information retrieval, text categorization.

Project Summary
This system maps World Wide Web documents to Library of Congress subject headings. Specifically, it is designed to automatically classify Web documents into subject headings of the Science (Q) and Technology (T) portions of the Library of Congress classification system, after being trained and tested on thousands of documents previously classified by library scientists and expert document analysts.  The project represents a scaling up of knowledge-based methods and natural language processing techniques which have been successful in providing more than 80% of the descriptive phrases produced by expert document analysts for journal articles in a specific scientific domain.

Goals, Objectives, and Targeted Activities
This research furthers concept-based (in contrast to word-based) indexing of documents.  In the months since we have started, we have concentrated on subjects related to the Science (Q) sections of the Library of Congress classification system - specifically Astronomy, Physics, Geology, and Mathematics.  We have downloaded and stored the contents of several hundreds of web sites per major subject area, including ftp sites containing large collections of documents.  These web sites have been classified into the Library of Congress subject headings by human specialists.  The collection of data for training and testing purposes continues:  in the coming year we will download and store sites in Biology, Computer Science, and most of the Engineering fields.

Our initial efforts in index generation have focused on building a knowledge base that includes a structured representation of the subject headings used by the Library of Congress and synonyms for those headings.  In our earlier work with scientific journal articles, we found that some correct indexes were not recognized if the phrases that would normally serve as indicators for those indexes were buried in compound noun phrases. We are building a more structured representation of the phrases and synonym information, to make use of important sub-phrases within phrases.  We have almost completed the development of the initial knowledge base.  Once this is complete, we will test the effectiveness of the additional structure for the phrases and synonyms by measuring improvement in the recall and precision rates. We will also modify our statistical program to allow for dynamic updating of the knowledge base so that the index generation component will become more precise in its generation of the Library of Congress subject headings as it processes more and more documents.

We also target natural language processing errors which have serious impact on knowledge extraction, especially errors in determining the parts of speech of unrestricted text.  We believe that reducing these errors will require expanding the context examined by the classifiers and that hybrid classifiers combining more than one kind of classification technique are the most likely sources of breakthrough in reducing remaining errors.  To deal with context, we are training very large neural networks, with emphasis on distributing attention across the networks.  In so doing, we have opened the context window from the three or four words typical for current part of speech taggers to as much as twelve words.  Our best result so far reduces our most frequent serious error by 13%.

Our formal goals for the first year include the following:

Indication of Success
One unexpected early development is the possibility of developing a more reliable body of training and test data for machine learning projects based on classified web sites.  OCLC (Online Computer Library Center, Inc., a service organization for libraries around the world, and the "parent" organization for the Dewey Decimal Classification System) made available to us thousands of records of web sites classified by Library of Congress subject headings.  Web sites are notoriously volatile:  Files move or disappear, and sites that remain in place undergo modification, so that classifications are no longer relevant.  OCLC is looking into working independently and/or with us on establishing a corpus of accurately classified Web documents which are captured and will not undergo modification.

In the first months a prototype system has been constructed which incorporates subsets of the Library of Congress subject headings into a database that comprises a roughly hierarchical network.  This database also contains Library of Congress information about synonymous relationships between phrases related to the subject headings. Study of large neural networks incorporating contexts of up to 12 words has resulted in comparison of approaches to evolving such networks in parallel.  A paper detailing a promising parallel approach to evolving these large networks has been submitted (Boggess, 1999, in review).  Recurrent neural networks are also being explored.  A hybrid stochastic/neural network part of speech tagger is in progress.

Project Impact
Two of the three doctoral students participating are women, and both women have targeted international conferences to which they will submit papers for which they will be first author, during this semester. Issues related to the project are being incorporated into a graduate level course on natural language processing this semester. The students of that course have the opportunity to implement experiments related to language processing, terminology, and/or classification of documents, using data collected by the project.

This project is possible thanks to donations of data and time from OCLC, a not-for-profit service and research organization affiliating more than 60,000 libraries worldwide.  We have regular conference calls with two OCLC researchers.  This exchange of ideas has led to plans to partition effort on joint goals.  For example, it appears to be the case that OCLC will develop a repository of classified "web sites" (downloaded sites that will not change with time).  They offer vast expertise in human document classification, and we have mutually exchanged advice, information, and software tools for language processing and classification.

GPRA Outcome Goals
Connections between discoveries and their use in service to society.

A predecessor to this project was successful in providing 85% of the descriptive phrases indexing journal articles by content that highly trained document analysts provided, for the field of physical chemistry.  This project seeks to perform the same task, for the fields represented by Library of Congress Q and T call letters (most of science and technology) for much more diverse documents (journal articles are highly constrained as to format, in comparison with web pages).  Library of Congress subject headings range from broad (e.g., "Astronomy", "Tobacco", "Volcanoes") to specific (e.g., "Actinomycetales, Research", "Aeronautics in astronomy, United States [and] Infrared Astronomy, Research").  Success of this research will in effect allow web documents to be indexed by a long-established, widely-adopted controlled vocabulary.

Timely and relevant information on the national and international science and engineering enterprise.
If paired with a web crawler, the final product of this research will be a system with the capability to produce far fewer spurious returns from searches for information on the web, without requiring the user to know the exact words that indicate relevance in the target document, and with a high percentage of the relevant sites presented to the user.  It should be noted, however, that while the system will have the described capability, it is not within the resources of the research team to collect and establish the training data required for a system to respond to all existing Library of Congress subject headings in the sciences and technology.

Project References

Area Background
Our work is a confluence of two areas of expertise: corpus-based natural language processing and knowledge bases. It can be seen from at least three perspectives: information retrieval, broad-based study of the characteristics of language, and knowledge bases themselves. Our focus has been on extraction of information from a body of text from some particular domain.  Sometimes the purpose of the extraction has been to add to or update existing knowledge already embedded in a knowledge base.  Sometimes, as in the present classification task, the purpose has been to identify the most relevant concepts associated with that text.  In either case, there are many ways that the same concepts can be expressed by writers who have the full freedom of a natural language such as English in which to express their ideas. Consequently, the system must handle vocabulary and grammar that has not been anticipated, and relate previously unseen entities in a text to known concepts. In doing so, we examine the properties of language in the aggregate. We use clustering algorithms to examine the contexts in which words are used, with the result that we group words and concepts by "the company they keep".  An important aspect of the work is to weigh the evidence that favors one classification or interpretation over competing classifications and interpretations.

Area References


Unsupervised Document Set Exploration Using Divisive Partitioning

Daniel Boley
Department of Computer Science and Engineering
University of Minnesota

Contact Information

4-192 EE/CS, 200 Union St. SE,
Minneapolis, MN 55455
Phone: (612) 625-3887
Fax: (612) 625-0572
Email: boley@cs.umn.edu

WWW Page

Project URL: "http://www.cs.umn.edu/~boley/research/idm99.html" contains this short report. A set of papers on this algorithm as well as prototype software can be found in "http://www.cs.umn.edu/~boley/PDDP.html".

Project Award Information

Award Number: IIS-9811229.
Duration: September 15, 1996 to August 31, 2001.
Title: Unsupervised Document Set Exploration Using Divisive Partitioning.

List Of Supported Students And Staff (Optional)

  1. Daniel Boley, principal investigator
  2. Vivian Borst, graduate assistant

Keywords

Unsupervised clustering, data mining, classification, hierarchical taxomonies, Web agents.

Project Summary

This project is devoted to the research and development of a hierarchical div isive clustering algorithm. The basic underlying activity is the basic research needed to fully develop the divisive partitioning method and the applied research needed to demonstrate its use on specific applications and to make it easy for humans to use. The conceptual framework for the basic divisive partitioning method has been developed, and an efficient prototype implementation has been installed locally, on which all experimental results have been based. The research issues to be addressed in this project include completion of the theoretical foundations of the algorithm, experiments on larger datasets to measure the scalability of the methods, experiments on a datasets from a variety of sources as a way to gain understanding about the theory behind the method, extensions to the algorithm capabilities, extensions to other applications beyond text document clustering, and validation of algorithm results. Technology transfer forms an additional part of this project, and will be addressed by applying the methods developed as part of the project to other application datasets (e.g. genome, toxicology, medical literature datasets), as well as extending it to other paradigms such as hierarchical taxonomies, document rating aids and collaborative filtering. By incorporating it into a Web agent, the methods will be made available to a wide audience. All application datasets have either been collected by colleagues and will be shared as part of ongoing projects, or are available over the Web. 

Goals, Objectives And Targeted Activities

The project goals are to develop the Principal Direction Divisive Partitioning (PDDP) Algorithm and demonstrate its effectiveness in a variety of applications.

The specific goals can be grouped into three main categories:
(1) demonstrate and test the applicability and effectiveness of the PDDP algorithm to new application areas, including very large data sets;
(2) extend the capabilities of the algorithm to include features for automatically determining the number of clusters or automatically producing labels to help users navigate through large datasets;
(3) develop user interfaces to allow easy access to the functionality of the algorithm, allowing users to apply it to a variety of datasets in a straightforward way.

The research plan included the following tasks:
(a) Demonstrate the scalability compared to older methods,
(b) Demonstrate the effectiveness in terms of quality of resulting clusters,
(c) Incorporate an autonomous stopping test,
(d) Develop methods for producing hierarchical taxonomic labels,
(e) Develop a user interface based on a Web agent.

Indication Of Success

Success is measured in the progress made within each task in the research plan. This project has been in progress for only 6 months during which we have actively pursued all 5 tasks mentioned above. In this period we have accomplished the following:
(i) We have developed the conceptual framework necessary to incorporate essential new features of the methods. The features are necessary to make the method a practical and useful method in the domain of Web exploration and include an automatic stopping test and a dynamic incremental updating feature [Ref. 1].
(ii) We have completed the design of a client-side Web agent incorporating this algorithm as a tool to automatically [i.e. without user participation] organize a large collection of Web pages visited by a user using a normal Web browser. The feasibility and practicality of the design has been demonstrated [Ref. 3], but robustness issues still need to be addressed.

Project Impact And Outcome

Human Resources: The project is currently supporting the research work of Vivian Borst working toward a Ph.D.
Education and Curriculum Development: The project is supporting development of the regular semester course CSci 5521 "Pattern Recognition" to be offered next year.
Infrastructure: The highest priority item purchased so far with the modest equipment budget has been a dedicated disk partition for large datasets. The second highest priority will be the purshase of a high performance computational engine for the computational experiments.
Industry: Not Applicable.
Activities Spawned: This activity is leading to the exploration of divisive partitioning methods for other applications taking place within the Dept, including collaborative filtering, genome mapping, vision-based defect detection. All of these are still in the preliminary stages.

GPRA Outcome Goals

  1. Discoveries at and across the frontier of science and engineering.

  2. We have developed a new paradigm for partitioning large datasets and the conceptual framework to incorporate new features. This framework has been made possible only by a synergy of a variety of different scientific disciplines, with the result that methods have new capabilities not possible by using methods from only one scientific discipline.
  3. Connections between discoveries and their use in service to society.

  4. The methods begin investigated here are of immediate and practical value to the analysis of large data sets from different applications: text documents, texture images, chemical toxicity parameters, etc. The automated analysis of large bodies of data from a variety of sources is a critical issue in the age of digital technology, and automated and unsupervised analysis tools are essential in this effort.
  5. A diverse, globally-oriented workforce of scientists and engineers.

  6. Even at the early stages of this project, we have promoted the education and incorporation of women in the workforce of scientists and engineers. The interdisciplinary nature of the project also encourages exposure and learning of concepts across many scientific disciplines, leading to a more well-rounded preparation to be members of the scientific workforce.

Project References

  1. D. Boley and V. Borst, Unsupervised Updating of a Classification Tree in a Dynamic Environment, in Autonomous Agents'99 Conference, 1999, to appear.
  2. D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore, Partitioning-Based Clustering for Web Document Categorization, Decision Support Systems, 1999, to appear. ("dss99.ps.gz").
  3. D. Boley, M. Gini, R. Gross, E.-H. (Sam) Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore, Document Categorization and Query Generation on the World Wide Web Using WebACE, AI Review, 1999, to appear. ("aireview.ps.gz").

Area Background

This project depends on the combination of several widely different fields which have come together to produce methods which are extremely successful. The foundation of the techniques -- the ideas that lead to the practicality and the scalability -- are methods for large sparse eigenvalue problems in numerical linear algebra. Specifically, the core computation is a so-called Lanczos-based method for computing the leading principal component for a large collection of sparse vectors each of which represents a document (or other data sample). The methods preserve the inherent sparsity making it possible to represent large data collections in relatively little space, and also making it possible to carry out the computations in a relatively short time. These underlying methods serve as a foundation for a variety of divisive partitioning techniques, which are analogous to the classical agglomerative clustering techniques. Instead of coalescing the individual data samples (documents) one by one into clusters (e.g. with a nearest neighbor criterion), we start with all the data samples in one big cluster and split it repeatedly along the principal component. The splitting process yields weights indicating which features are most critical in placing data samples in one cluster as opposed to another cluster, whose utility is currently underdeveloped. 

Area References

  1. Golub and Van Loan, Matrix Computations, 3rd edition, Johns Hopkins Univ. Press, 1996.
  2. D. L. Boley, Principal Direction Divisive Partitioning, Data Mining and Knowledge Discovery 2(4), 1999, to appear. ("PDDP.ps.gz").
  3. Gose, Johnsonbaugh, Jost, Pattern Recognition and Image Analysis, Prentice Hall,1996.

Potential Related Projects

To be added in future years, once greater familiarity with other projects is obtained.


Content-Based Image Retrieval for Medical Databases

Carla E. Brodley (*), Avi C. Kak (*), Lynn S. Broderick (**) and Alex Aisen (***)

(*) School of Electrical and Computer Engineering, Purdue University
(**) Radiology Department, University of Wisconsin
(***) School of Medicine, Indiana University

Contact Information

Carla E. Brodley
School of Electrical and Computer Engineering
Purdue University
West Lafayette, IN 47907
Phone: (765) 494-0635
FAX: (765) 494-6440
Email: brodley@ecn.purdue.edu

WWW PAGE

Content-Based Image Retrieval of Medical Images -- IRI-9711535

Keywords

Content-Based Image Retrieval, Machine Learning, Medical Images, Computer Vision

Project Award Information

Project Summary

Increasing computerization of the process of acquiring and storing diagnostic images requires more sophisticated approaches to image retrieval in order to optimally use the images for improved health-care delivery, physician education and research. A content-based image retrieval system is being developed that takes a human-in-the-loop approach, because completely automated approaches are not feasible for radiological images in which the clinically useful information may consist of gray level and texture deviations in highly localized but difficult-to-segment regions of an image. In this approach, a physician manually delineates the pathology-bearing regions in an image. The CBIR system then analyzes these regions for their properties to index the images into a database and retrieve similar images typically with known diagnoses. This project focuses on the domain of high resolution computed tomography (HRCT) images in patients with a variety of lung diseases.

Goals, Objectives, and Targeted Activities

In the first year of our project we collected a database of roughly 400 HRCT images of the lung, investigated methods for indexing and retrieving images, and developed a user interface that allows physicians to mark the PBR's, pose queries to the system and view the retrieval results. In the second year of our project we will focus on the following objectives:

Indication of Success

We have demonstrated that a human-in-the-loop approach to CBIR of medical images is a critical factor in the success of our approach. In our approach the physician delineates the pathology bearing regions (PBR) and a set of anatomical landmarks of the image at the time the image is entered into the database. From the marked regions, our approach applies low-level computer vision and image processing algorithms to extract attributes related to the variations in gray scale, texture, shape, wavelet coefficients, etc. We have demonstrated experimentally that for our domain, HRCT of the lung, local image characterization using PBRs is far more accurate than any method that creates an index based on a global characterization of the lung.

Recently we have developed a new approach to image retrieval, which we call ``customized queries''. This approach automatically customizes the retrieval index (the computer vision features that characterize the image) to the query image. The underlying rational of the approach is that the features that retrieve the most visually similar images within a disease class differ from those that discriminate among disease classes. During retrieval, to obtain the customized feature vectors for each query, our approach first classifies the query using the features that best discriminate the disease classes (we apply a standard machine learning technique of forward sequential selection to determine these features). Next the corresponding feature set for that class is used for retrieval within the subset of the database that has that disease. Customized queries has increased retrieval precision (the number of retrieved images visually similar to the query image as judged by a experienced lung radiologist) to approximately 90%. This is a radical increase over the 50% realized by the traditional approach to CBIR, which uses the same set of image features for every query.

Project Impact

Human Resources: The project currently supports two Ph.D. students (Chi Ren Shyu and Jennifer Dy). Education and Curriculum development at all levels: The PI, Prof Brodley teaches graduate and undergraduate courses in artificial intelligence and machine learning. The Co-PI, Prof Kak teaches graduate and undergraduate courses on image understanding, image analysis and image processing. Both Prof Brodley's and Prof Kak's research has influenced and contributed to the content of these courses.

GPRA Outcome Goals

1. Discoveries at and across the frontier of science and engineering: Our approach to CBIR for medical databases relies on human input, machine learning and computer vision. Specifically, we apply expert-level human interaction for solving that aspect of the problem which cannot yet be automated, we use computer vision for only those aspects of the problem to which it lends itself best -- image characterization -- and we employ machine learning algorithms to allow the system to be adapted to new clinical domains. As such we have made a bridge across the disciplines of medicine, machine learning and computer vision.

2. Connections between discoveries and their use in service to society: The shift in technology from analog film-based methodologies to computer based digital technologies is creating large digital image repositories. CBIR provides an opportunity to tap the expertise contained in these databases in the following way: observing an abnormality in a diagnostic image, the physician can query a database of known cases to retrieve images (and associated textual information) that contain regions with features similar to what is in the image of interest. With the knowledge of disease entities that match features of the selected region, the physician can be more confident of the diagnosis.

3. A diverse, globally-oriented workforce of scientists and engineers: Our project is training two graduate students in the diverse areas of computer vision, machine learning and medical informatics. For our project the personnel consist of 50% male and 50% female participants.

Project References

Area Background

Content-based image retrieval (CBIR) refers to the ability to retrieve images on the basis of image content, as opposed to on the basis of some textual description of the images. CBIR uses image content directly. An example of a retrieval query of this sort would be ``show me images from a given database that are similar to a particular image.'' A key element of the approach revolves around the types of patterns that can be recognized by the computer and that can serve as the indices of the retrieval.

Area References


Database Programming Languages and Environments

Peter Buneman*
Co-PIs: Alex Borgida, Luca Cardelli, Richard Hull, David Maier, David Stemple, Victor Vianu, Stan Zdonik**

*Department of Computer and Information Science
University of Pennsylvania

**Respectively: Rutgers University; Microsoft Cambridge; Lucent; Oregon Research Institute; University of Massachussets, Amherst; University of California, San Diego; Brown University.

Contact Information

Peter Buneman
CIS, University of Pennsylvania
200 South 33rd Street
Philadelphia, PA 19104
Phone: (215) 898-7703
Fax : (215) 898-0587
Email: peter@cis.upenn.edu

WWW PAGE

http://db.cis.upenn.edu

Keywords

Databases, Programming Languages, Persistent Object Systems, Heterogeneous Databases, Scientific Databases, Database Integration, Constraint Databases, Semistructured Data.

Project Award Information

Project Summary

This project sponsors a series of workshops in collaboration with EU on various aspects of databases and programming languages. It is intended to initiate workshops in various emerging areas of database research. Over the past two years it has been used to sponsor workshops related to constraint databases, semistructured databases and scientific databases.

Goals, Objectives, and Targeted Activities

The goals of the project are to foster international exchanges, and to encourage the development of new areas of database research.

Indication of Success

The workshops have all resulted in some form of publicly disseminated proceedings or report. The success of the workshops should be measured by the results of the report. )

Project Impact

Perhaps the most important recent impact of these workshops has been their contribution to the field of semistructured data. This grant has funded the two most important workshops in this area and has been, in part, responsible for the fusion of database research and web document technology, especially in the relationship between XML and semistructured data.

Project References

A brief synopsis of some of the recent workshops is given here. The publications in many cases are electronic. Since it is highly topical, The proceedings of the workshop on "Query Languages for semistructured data and non-standard data formats" are given in some detail.

  1. Workshop on Constraint Databases and their Applications

  2. Organizers: Victor Vianu (UCSD) M. Wallace (Imperial College)
    Time and Place: January 11-12 1997, Delphi, Greece. Following the ICDT '97 conference.
    Publications/reports arising from workshop: Constraint Databases and Applications, G. Kuper and M. Wallace (Eds.) LNCS 1191, Springer Verlag 1997.

  3. Workshop on Management of Semistructured Data

  4. Organizers: Dan Suciu (AT&T), Tova Milo, (Tel Aviv University)
    Time and Place: Friday, May 16, 1997 Tucson, Arizona (Adjacent to SIGMOD)
    Publications/reports arising from workshop: All 13 papers were published in an informal proceedings and are available electronically. The best five papers were selected for publication in Sigmod Record, vol. 26, no. 4, 12/1997
    URLs etc.: A workshop description and the papers presented are available from http://www.research.att.com/~suciu/workshop-papers.html or by e-mail to suciu@research.att.com
     
  5. Metadata for Data Integration

  6. Organizers: Rick Hull (Lucent), Sophie Cluet (INRIA).
    Time and Place: Aug 19, 1997, Estes Park, Colorado (Adjacent to DBPL)
    Approximate attendance: 35
    Publications/reports arising from workshop:"Metadata for Database Interoperation", to appear in Proceedings of 1997 Workshop on Database Programming Languages, Springer Verlag Lecture Notes in Computer Science. This will also be submitted to Sigmod record.
    URLs etc.: A workshop description and report are available from http://www.cis.upenn.edu/~peter/mddi or by e-mail to hull@research.bell-labs.com.
     
  7. Workshop on Distributed Information, Computation, and Process Management for Scientific and Engineering Environments.

  8. Organizer: Nicholas M. Patrikalakis (MIT)
    Approximate attendance: 50
    Time and Place: Herndon, Virginia, May 15-16, 1998
    Publications/reports arising from workshop: in preparation.
    Workshop description from http://deslab.mit.edu/DesignLab/dicpm/ or email to DICPM@deslab.mit.edu.