Brad Adelberg
Computer Science Department
Northwestern University
Contact Information
Brad Adelberg
1890 Maple Avenue
Evanston, IL 60201
Phone: (847) 467-2129
Fax : (847) 491-5258
Email: adelberg@cs.nwu.edu
WWW PAGE
www.cs.nwu.edu/~adelberg/mmdb/index.html
Project Award Information
Keywords
Main memory database system, cache alignment, locality, performance.
Project Summary
Plummeting memory prices combined with the availability of 64-bit architectures has made computer systems with multigigabytes of main memory a reasonable platform for many production systems. As a result, many applications find that all of their data, or a least the most frequently accessed tables, can be kept entirely in main memory. This represents a significant change from previous generation database systems which expected that most of the application data would reside on disk. This research focuses on two aspects of adapting database systems for platforms with large main memories.
First, it has been observed that during transaction processing applications, the CPU wastes approximately 80% of its cycles due to stalls [cvetanovic94]. Thus a 5 fold speed up is possible if cache misses are eliminated entirely. One focus of this research is on how to modify the data structures and algorithms used by database systems to minimize the cache miss rate and thereby increase system performance. Although all of the cache misses will never be eliminated, even modest reductions in the cache miss rate will yield large performance gains. Further, with clock speeds continuing to grow, the available speed up will grow as well.
The second focus of the research is how to best integrate main memory database technology with disk-based database systems in the event that all of the needed data will not fit in main memory. A unified architecture will be designed that extends traditional DBMSs with a new main memory component. We will determine which functions need to be handled within the new component (i.e. table scanning, transaction management, concurrency control) and how this optimal division of work is supported by the different commercially available extensibility proposals and interfaces (i.e. OLE DB, datablades, cartridges). Finally, we will develop algorithms for database administrators (DBAs) to determine, for a given amount of memory, which tables to store in memory versus on disk.
Goals, Objectives, and Targeted Activities
In the next year we will instrument our current code base using both Speedshop and SimOS to perform detailed runtime analysis of the interaction of the database system with the processor and memory system. We will focus on transaction processing workloads, specifically TPC-B and TPC-C, and will analyze cache miss performance by type of query as well as by system component. For instance, we will examine how concurrency control and recovery, which typically have less regular access patterns than some query execution operators (i.e. sorting), interfere with components that are more regular.
On the educational side of the proposal, the project required in the second database course (focusing on database system implementation) will be modified. It currently requires writing pieces of a reduced functionality standalone database system. The new project will require student to write components using the OLE DB interface and then integrating them with commercial code.
Indication of Success
During the last six months, we have instrumented our code for use in SGI’s performance package called speedshot. Preliminary experiments have been performed and indicate that the cache performance of main memory database systems, even those not specifically designed with cache cognizance, is better than that of disk-based commercial systems reported in the literature. There is still room for improvement, however, and further experiments on more realistic workloads are necessary.
Project Impact
Currently one undergraduate is working on the project and a graduate student will soon be added. The effects on education have already been dramatic. The undergraduate database curriculum has been redesigned as a two quarter sequence, the first focusing on using a database system and second on building one. The removal of implementation issues in the first quarter has significantly broadened its appeal. Enrollment has tripled and many students have found the material useful for both undergraduate projects as well as in their jobs post-graduation. In particular, two students projects are now being used in production by the University.
GPRA Outcome Goals
This project has only been underway for 6 months, but if successful will be useful in improving the performance of main memory database systems. This is particularly important in electronic commerce applications since the volume of transactions seen by servers is much higher than traditional sales systems whose input flow is gated by human cash register operators. The techniques may also improve the performance of disk-based database systems that have large buffer pools or that can pin an important subset of their tables in memory.
Project References
Area Background
The vast majority of commercial databases have been too large to fit in memory. Hence traditional database systems have had to access the data from disks. Since a disk is many orders of magnitude slower to access than RAM, much of the work in database system performance has ignored memory access cost and instead focused on how to minimize requests to the disk or to access it in the most efficient way possible (i.e. replacing random with sequential accesses). A common theme is to try to place related information on a single disk block so a single disk access will satisfy multiple disk accesses. Recently used blocks are also caches in a buffer pool to take advantage of temporal locality of access.
If a database completely fits in main memory, queries can be answered without I/O and hence memory access time becomes important. Since accessing data in the cache is two to three orders of magnitude faster than reading from main memory, organizing data to minimize cache misses becomes important. Thus just as disk-based DBMSs carefully arrange information on disk blocks, main memory DBMSs must carefully arrange information on cache line boundaries. This task is more difficult since cache lines tend to be either 64 or 128 bytes as opposed to disk pages which are many thousands of bytes long. Despite this difficulty, preliminary results are promising --- in [nyberg94], the authors reported a 300% speed up in a sorting algorithm due to careful consideration of cache effects.
Area References
Potential Related Projects
The best source of related projects would be in the computer architecture area. We are interested in working with any group trying to optimize the performance database systems on standard hardware.
Geospatial Database-Driven Extraction of Information from Digital Aerial Imagery
Peggy Agouris
Department of Spatial Information Science & Engineering
University of Maine
348 Boardman Hall
Orono, ME 04469-5711
Phone: (207) 581-2180, Fax : (207) 581-2206
E-Mail: peggy@spatial.maine.edu
WWW Page: www.spatial.maine.edu/~peggy/CAREER.html
Project Award Information:
Award Number: 9702233
Duration: 48 months, currently on Year 2
Title: CAREER/EPSCoR: Geospatial Database-Driven Extraction of Information from Digital Aerial Imagery
Keywords:
Digital Image Analysis, Spatial Databases, Change Detection, Matching, Scale Space
Project Summary
This CAREER project will advance the ability to extract spatial information from digital aerial imagery for which prior or complementary information already exists. Examples of such information are pre-existing digital maps, digital terrain models, and spatial information systems in general. We use geospatial databases to support and guide digital image analysis methods for object extraction. The work will accomplish both research and educational objectives. The research program develops novel concepts and algorithms for the extraction of information from digital images. Emphasis is put on matching images to abstract representations of reality, change detection, analysis of scale differences between various images and images and databases, and structuring the extracted information in order to be suitable for incorporation into geospatial databases for their updating. A workshop is also planned with the participation of leading experts from a variety of disciplines relevant to this project. The educational initiatives will incorporate the research advancements made in this project into our graduate and undergraduate curriculum through the development of a new course and the modification of two existing courses, and in our high school outreach program.
Goals, Objectives, and Targeted Activities
Our objective is to improve information extraction processes from digital aerial images by having image analysis methods supported and guided by pre-existing/complementary spatial information.
The research program develops novel concepts and algorithms for the extraction of information from digital images. The research plan is organized in complementary tasks. More specifically:
By embedding object extraction within the framework of spatial information systems, digital image processing and analysis will be able to exploit the advantages offered by the availability of various sources and formats of spatial data.
This year (June 14 and 15, 1999) we will hold a 2-day single track international workshop with the participation of experts from the fields of digital image analysis, photogrammetry, GIS, and databases. The proceedings will be fully refereed and will be published by Springer Verlag in their Lecture Notes in Computer Science (LNCS) series.
Parallel to the aforementioned research activities, this project includes educational initiatives, designed to take advantage of and incorporate the research advancements made in this project. A new undergraduate course is being developed. On graduate level, experiences acquired in this project are being incorporated in two current courses taught by the PI. Also, the high school outreach program of our Department will be enriched and expanded, exposing junior classes to the challenging and evolving role of digital images in spatial information engineering.
Indication of Success
As mentioned before, we are currently halfway through year 2 of this four-year project. Years 1 and 2 focus on the development of a matching scheme which uses as input database object models and is applied on digital imagery for object extraction and change detection. We have developed the basic theoretical part of our matching scheme and have began its implementation and relevant experiments. We have been able to extend matching from an image-to-image operation into an image-to-information operation. This research continues according to schedule and our progress is consistent with the goals of the project. Our advancements in this direction so far have enabled us to also consider the potential use of the developed concepts and techniques in image retrieval scenaria.
Project Impact and Output
Human Resources: This project supports the graduate studies of two Ph.D. candidates. Also, educational developments resulting from this work affect one undergraduate and two graduate courses, with an average yearly participation of approx. 15 students per class. Work covered by this project is also part of the high school outreach program ("Spatial Horizons") of our department. This program is typically attended by 100 or more high school students per year. Another high school outreach program affected by this project is the University-organized "Expanding your Horizons" program. It addresses directly female high school students, attempting to increase their participation in science and engineering disciplines.
Education and curriculum development: This project includes educational initiatives, designed to take advantage of and incorporate the research advancements made in this project. A new undergraduate course is under development, tentatively titled "Integration of Digital Image Processing and Geospatial Databases". On graduate level, experiences acquired in this project are incorporated in the current courses of Digital Image Processing & Analysis, and Analytical & Digital Photogrammetry, taught by the PI. The high school outreach program of the Department of Spatial Information Science and Engineering is enriched and expanded as a result of this project, to expose high school junior classes to the challenging and evolving role of digital images in spatial information engineering.
Department/institution infrastructure: The project further supports collaboration between the PI and GIS-oriented faculty within the PI's Department, and GIS-relevant faculty within the University of Maine. Last year, one such collaboration resulted in the awarding of a Center of Excellence in Remote Sensing Applications grant by NASA ($300,000 from NASA, plus software donations valued at approximately $150,000). This year, the PI collaborated with another faculty member within the same Department and they were awarded a project of approx. $320,000 from the National Imagery and Mapping Agency. This new research project complements the CAREER project of the PI by extending certain concepts towards spatiotemporal reasoning and the modeling of change.
Industry: We are still at an early stage regarding this project. However, some initial discussions with leading industrial firms (e.g. Autometric and LHS) in the fields of digital image analysis and softcopy photogrammetry have confirmed their interest in technology transfer of our work.
GPRA Outcome Goals
This project contributes towards the following 4 out of the 5 GPRA Outcome Goals:
Goal 1: Discoveries at and across the frontier of science and engineering.
The new concepts which are under development during this project will contribute towards the seamless integration of digital images and geographic information systems (GIS). They are expected to improve and automate the geospatial information production, management, and analysis cycles.
Goal 2: Connections between discoveries and their use in service to society.
The objective of this project is the development of a seamlessly integrated geospatial information environment. This will allow various types of users (e.g. scientists, engineers, practitioners, private citizens, students) to have easier access to better and more accurate multifaceted geospatial information. In doing so, we permit various users to take full advantage of modern technological achievements (e.g. GPS, digital cameras and video, Web).
Goal 3: A diverse, globally-oriented workforce of scientists and engineers.
The project encourages and facilitates the exchange and integration of geospatial information between various sources and/or users, thus contributing to this goal.
Goal 5: Timely and relevant information on the national and international science and engineering enterprise.
Please refer to goals 2 and 3 above.
Project References
Area Background
Digital photogrammetry is an important field of digital image analysis. It deals with the processing of digital images for the extraction of metric quality information on objects/scenes depicted in them, without having to get into physical contact with the object/phenomenon under study. Typical photogrammetric applications employ aerial, space, or even close-range imagery. Applications range from mapping and environmental monitoring to medical and biological image processing, intelligence and defense activities, and even flow monitoring. In this project we deal with aerial images representing scenes for which prior or complementary information already exists in geospatial databases. Geospatial databases (e.g. digital maps, GIS databases) contain qualitative and quantitative information on the precise location, function, and interrelationships of man-made objects (e.g. roads, buildings) and natural objects (e.g. bodies of water, or the terrain itself). Aerial images are ideal complements to geospatial databases as data capture sources, since image-captured terrain scenes can be used to extract accurate and up-to-date qualitative and quantitative spatial information. Object extraction from digital images is a major research topic in computer vision and digital photogrammetry.
Area References
Potential Related Projects
General topics:
Retrieval of digital images and other types of GIS and cartographic data from integrated spatial databases using matching techniques, topological relations, metadata, and database management systems.
Object extraction from digital images in medical/biological applications, using CAT scans, mammograms, X-rays etc.
Spatio-temporal modeling, especially considering novel sensors and wireless communications; rapid scene modeling and updating using hand-held cameras in dynamic environments.
Other on-going IDM projects:
The projects of Arif Ghafoor, Shashi Shekhar, Wesley Chu appear to be good candidates for collaboration
Principal Investigator: Suad Alagic
Affiliation: Department of Computer Science, Wichita State University
Contact Information
Suad Alagic
Department of Computer Science
Wichita State University
Wichita, KS 67260-0083
Phone: (316) 978 3916
Fax: (316) 978 3984
E-mail: alagic@cs.twsu.edu
WWW Page
http://www.cs.twsu.edu/~alagic/nsf_ODMG.html
Project Award Information
Award Number: IIS-9811452
Duration: 9/15/1998 - 8/31/2001
Title: A Family of the ODMG Object Models
Keywords
Object-oriented, ODMG Standard, type systems, persistence, constraints, Java.
Project Summary
The goal of this project is to provide research and technical solutions for major problems in the ODMG Standard, the currently proposed industrial standard for object-oriented databases. The main expected contributions of this project to the ODMG Standard are in the underlying type systems, the model of persistence, and the constraint language. The approach is based on establishing a well-defined family of object models replacing the existing single and inadequate ODMG Object Model. The expected results also include some implementation models and techniques applicable to the technology proposed by the ODMG Standard. As the ODMG Standard contains a binding for the Java programming language, some results of this project will also be applicable to the Java technology in general. The research team will play an active role in transferring results of this project to the ODMG by making specific proposals to the future releases of the Standard. The results of this project are intended to contribute to proliferation of a whole generation of commercial object-oriented database management systems compliant with a well-defined Standard that reflects properly the state of the object-oriented research.
Goals, Objectives and Targeted Activities
Specific research and technical goals and the expected results are:
Indication of Success
There are two measures of success of the activities in this project. One of them is the technical validity of the research and technical solutions as evaluated by the research and technical community. The other measure is the actual influence on the new releases of the ODMG Standard.
Project Impact
This project is expected to influence the future releases of the ODMG Standard, either directly or indirectly. The direct influence is expected from the direct work of the PI with the ODMG. Indirect influence is expected from the published papers and the expected response of the research and technical community to those papers.
Human Resources: One PhD student will be supported by this grant. MS level students will be involved in more pragmatic, system-oriented issues. PI will gain valuable industrial insights and interactions.
Education and curriculum development: A top-level graduate course on object-oriented databases is being offered on a regular basis. This course addresses the issues of the ODMG Standard and involves a non-trivial database project implemented using an ODMG compliant database management system.
Department infrastructure: Research facilities for this project are located in the Object-Oriented Research Lab. equipped with UNIX and SGI workstations and NCD X-terminals. The Lab is running a variety of software systems: object-oriented programming environments (for Java, Eiffel and C++), object-oriented database management systems (O2 and Ode) and persistent-object systems (PJama and GemstoneJ)
Industry: The ODMG Standard has been developed by the vendors of object-oriented database management systems. Other related and interested industrial partners are also involved in the consortium. The PI of this project is participating actively in the ODMG meetings contributing proposals for solving the problems in the Standard. Research results from this project are thus directly transferred to the industrial partners of the ODMG.
GPRA Outcome Goals
Project References
The following reports have been produced so far:
The above papers have been submitted for publication and will be available from the project web page.
Area Background
Object-oriented database technology is a database technology based on the paradigm of object-oriented programming languages such as C++ or Java. As such, object-oriented database technology unifies two core technologies: the database technology and the technology of programming languages and systems. Because of this, the object-oriented database technology is very different from the relational technology which currently dominates the market.
Object-oriented database technology is well-suited for complex application (for example, engineering) environments involving objects with complex structure and complex behavior. Object-oriented database technology is appealing to the users that rely on object-oriented application modeling techniques and/or object-oriented programming languages and other object-oriented software tools in developing their applications.
Area References
Vijay Atluri
MS/IS Department
Rutgers University
Contact Information
180 University Avenue, Newark, NJ 07102
Phone: (973) 353-1642, Fax : (973) 353-5003
Email: atluri@andromeda.rutgers.edu, Web page: http://cimic.rutgers.edu/~atluri
WWW PAGE
http://cimic.rutgers.edu/~atluri/nsf-career.html
List of Supported Students
Wei-Kuang Huang, Sept 1996 - Aug 1998
Soon Ae Chun, Sept 1998 - Aug 1999
Project Award Information
Keywords
Advanced transaction models, workflows, security, authorization model, digital libraries, real-time information services.
Project Summary
This project has two main goals: (1) to investigate how multilevel security constraints impact transaction processing in advanced database applications, and (2) to investigate how the properties of the advanced transaction models can be utilized to solve the secure concurrency control problems in traditional databases. In particular, it develops multilevel secure workflow and multilevel secure long duration transaction models as well as workflow authorization models. It also investigates whether nested transaction models can be more suitable for multilevel secure (MLS) database management systems (DBMSs) than the traditional transaction model and develop the necessary extensions to the nested model required by MLS DBMSs. The results of this project enable incorporating security into advanced database applications such as engineering design and long running activities, and provide efficient secure transaction processing protocols for conventional database applications. The education component of this project develops new courses at both undergraduate and graduate level and student involvement in the research.
Goals, Objectives, and Targeted Activities
Our goals are to develop techniques and algorithms to incorporate both mandatory and discretionary security into workflow management systems. We have identified two new advanced application domains, digital libraries and real-time information services, for which we intend to develop authorization models. During this year and next year, our focus is on (1) analysis of multilevel secure workflows and workflow authorization models (2) developing a workflow execution model for a heterogeneous autonomous distributed workflow environment, (3) implementing a web-based secure workflow system using Perl and Java, (4) developing a multilevel secure nested transaction model, (5) developing a security model suitable for a digital library environment, and (6) developing an authorization model for real-time information services. Our research is moving towards new and interesting application domains, as can be seen from the last two goals, which were not part of the original proposal.
Indication of Success
During the past two years, we have successfully developed an execution model for multilevel secure workflows, an authorization model suitable for WFMS and a tool based on Petri nets to analyze the properties of workflows such as safety and consistency. Our progress is consistent with the goals of the project. We have identified several new problems in the WFMS area and therefore continuing to work in this area. We are currently developing a web-based secure workflow management system and extending our work on WFMS to heterogeneous autonomous distributed environments. Workflow management systems are increasingly being used today in numerous application domains. While protecting the information from unauthorized and malicious users is important, there has been no research addressing these issues in WFMSs. Our research identifies the security requirements in WFMS and provides suitable models and algorithms to meet these requirements. In addition to the objectives stated in the original proposal, we have identified new application domains that were not part of the original proposal. We are currently developing authorization models for digital library environment and for real-time information services.
Project Impact
GPRA Outcome Goals
Project References
Area Background
The first two years of this project focuses on security issues in the area of workflow management systems (WFMSs). WFMSs are today used in numerous application domains including office automation, finance and banking, healthcare, telecommunications, manufacturing and production, to run their day-to-day applications. A workflow separates the various activities of a given organizational process into a set of well defined tasks, which are carried out by specified users or processing entities. In such an environment, it is important to protect information against either by direct or by indirect leakage to unauthorized users. The security policies specified by the organization typically reflect its security requirements. These requirements can be broadly classified into two categories: (1) discretionary access control (DAC), in which users, at their discretion, are allowed to grant permissions on the data they own, and (2) mandatory access control (MAC), in which all data are labeled based on their level of sensitivity and all users are given clearances, and users are allowed to access data with a given label only if their level of clearance permits them. While DAC is suitable for some commercial applications, they are not adequate where more stringent security requirements are needed, because they are vulnerable to sophisticated attacks (e.g. Trojan Horse Attack). In this project we develop the necessary models and algorithms to incorporate both DAC and MAC into workflow management systems.
Area References
Potential Related Projects
Flexible Transactions in DBMSs: Formalism and Support Using Active Capability, PI: Sharma Chakravarthy. Others to be determined.
Reliable Data Sharing in Wide Area Gigabit Networked Databases
|
Dept. of Info. Sci & Telecommunication |
|
University of Pittsburgh |
|
Dept. of Computer Science |
|
University of Pittsburgh |
Contact Information
Panos K. Chrysanthis
University of Pittsburgh
Pittsburgh, PA 15260
Phone: (412) 624-8924
Fax:    (412) 624-8854
Email: panos@cs.pitt.edu
Sujata Banerjee
University of Pittsburgh
Pittsburgh, PA 15260
Phone: (412) 624-9470
Fax: (412) 624-2788
Email: sujata@tele.pitt.edu
URL: http://www.pitt.edu/~sujata
WWW PAGE
URL: http://www.cs.pitt.edu/admt/projects/wands.html
List of Supported Students
Project Award Information
Keywords
Data servers, Caching, Transaction Processing, High speed networks, Data consistency and recovery
Project Summary
This project's goal is to study the impact of the new capabilities of high-speed networks on the design and performance of data management protocols. In particular, the focus is on high-speed wide-area networks (WANs) where the propagation latency is the dominant communication cost, the migration of large amounts of data is not an issue, and efficient multicasting and guarantees on network Quality of Service (QoS) parameters are available. In this project, existing concurrency control, commit and recovery protocols are being evaluated with respect to performance scalability under the new network assumptions. Further, new protocols are being developed that utilize the huge bandwidth and minimize the number of sequential phases of messages exploiting the multicasting capability and the provision of QoS guarantees such as bounded end-to-end packet delay and packet loss. At the same time, because of the need to scale to thousands of clients and servers, not all of which may be operational at any given time, the new WAN data management protocols are designed to achieve higher site autonomy in a cooperative manner, reducing global synchronization and supporting disconnected operations. The various protocols are empirically evaluated using computer simulation under different scenarios. The results of this research are expected to provide a better insight into the synergy between database and network technologies. This, in turn, will facilitate the deployment of future high performance WAN data server systems which are part of the necessary infrastructure that will support important nation- and world-wide applications such as electronic commerce.
Goals, Objectives, and Targeted Activities
This year our goals and objectives are as given below.
Indication of Success
We have made substantial progress on the first three objectives stated above. Work on the read-only optimization, comparison with other caching protocols and enhancement of the existing simulation testbed are progressing on schedule. Preliminary results indicate that the results are in line with our expectations. A significant side effect of this work so far has been to investigate the nature of the network traffic generated by the transaction processing system, which is essential in deploying future data server systems on next generation WANs. Also, the specific application area of E-Commerce is under exploration and some preliminary ideas on dynamic workflow management have emerged. This work will contribute the understanding of high performance data server systems and their interaction with high speed networks operating in the guaranteed QoS paradigm. Further, several insights into performance tuning of these systems will be obtained for a variety of workloads, transaction models and network environments.
Project Impact
GPRA Outcome Goals
The goals of this project are in consensus with the long-term vision of the computer and communications industry to provide fast and reliable information access to the masses, taking into account the individual user's specific needs. Every attempt is being made to develop theoretical frameworks and then transfer them into pragmatic environments so their impact is far-reaching in the commercial sector. The project thus far has attracted a diverse body of students, including women, who are particularly under-represented in this research community.
Project References
These and other papers are available from the project WWW page.
Area Background
The same reasons that caused the emergence of the client-server database systems to become the predominant distributed database architecture in local area networks (LANs), combined with ever faster and cheaper computers, cheaper stable memory, high-speed networks and network-aware applications, are now causing unprecedented growth in information/distributed database systems over WANs. The WWW is a prime example of a WAN data-server system which is primarily read-only for the time being. In the future, it is expected that general data-server systems (also called data shipping or enhanced client-server systems) in which clients perform much of their query and transaction processing locally, will be deployed over wide-area networks (WANs). It is also expected that multiple WAN servers will operate cooperatively and that WAN clients will evolve and function as data-servers over local-area networks that will also need to cooperatively process transactions across multiple WAN servers, creating an even more powerful distributed database environment of multiple data-servers and collaborating-servers over WANs. It is within such a database environment that we envision that nation- and world-wide enterprises, for example, will be operating and electronic commerce will be carried out. In order to realize such powerful distributed database environments, there is a need to study the impact of the new capabilities of high speed networks on the design, the performance and the scalability of data management protocols, specifically the availability of huge bandwidth, of efficient multicasting and of Quality of Service (QoS) guarantees.
Area References
Quasi-cubes: Exploiting Approximations in Multidimensional Data Sets
Daniel Barbará
Information and Software Engineering Department
George Mason University
Contact Information
Daniel Barbará
George Mason University
ISE Dept. Mail Stop 4A4
Fairfax, VA 22030
Phone: (703) 993-1627
Fax : (703) 993-1638
Email: dbarbara@gmu.edu
WWW page: http://www.isse.gmu.edu/~dbarbara/quasi.html
Project Award Information
Award Number: IIS-9732113
Duration: 09/01/1998 – 8/31/2001
Title: QUASI-CUBES: Exploiting Approximations in Multidimensional Data Sets
Keywords: Multidimensional data, datacubes, query approximation, data mining, statistics.
Project Summary
A datacube is a multidimensional structure that contains at each point an aggregate value, i.e., the result of applying an aggregate function to an dataset of records. The purpose of this project is to develop, implement and experimentally evaluate techniques to reduce the storage needs in a multidimensional cube by substituting some of the data by more succinct models that characterize it. By doing this, errors are introduced when the data is reconstructed. The tradeoffs between space savings and reconstruction errors are studied, along with tradeoffs between response time to queries and reconstruction errors. Understanding these tradeoffs make it possible to efficiently design and implement multiresolution systems where query answers can be incrementally refined as the user looks at them. Finally, the models are a good starting point to mine the cube data since they show valuable patterns and associations among the different dimensions. The results of this project will provide valuable techniques and software to efficiently store and analyze large multidimensional data sets, such as those that are common in health, financial and scientific applications.
Goals, Objectives, and Targeted Activities
We are currently involved in the following activities:
Indication of Success
We have successfully demonstrated the feasibility of using statistical models for characterizing datacubes. A preliminary report shows that substantial savings in space can be achieved at the expense of reasonable accuracy in the query answers. In less than half of the first year of the grant, we are very close to the completion of the first three prototypes of the systems mentioned above. We expect to complete them in a short period and to submit the results to conferences and journals. Our research has attracted the attention of the information technology department in the Department of Justice and we expect to collaborate with them in the near future.
Project Impact
GPRA Outcome Goals
Project References
Area Background
As the data collected by organizations grows in size, the capacity to extract information from all this data has yet to catch up. A first step in understanding data is to organize it in suitable ways for analysis. This factor has prompted the development of {\it data warehouses} as repositories of integrated information. Within a data warehouse, a very useful data organization, well suited for analysis is the dimensional model, which presents the data as a set of cells in a multidimensional space. A popular abstraction of this model is the datacube: a multidimensional structure that contains at each point an aggregate value, i.e., the results of applying an aggregate function for an underlying relation. Implementing datacubes can be achieved by materialization of the cells or by computation of them on demand. Both strategies have shortcomings: the staggering amount of space needed and the potential for long delays to obtain the answers. Our research addresses the investigation of techniques for efficiently scaling datacubes. These techniques make use of a variety of statistical models to provide characterizations of portions of the datacube. We use these characterizations to provide datacube compression or to provide quick and approximate answers that can be refined on-line. Also, the usage of this models enables the scaling of a variety of data mining algorithms.
Area References
A. Prasad Sistla and Jezekiel Ben-Arie
Contact Information
|
Jezekiel Ben-Arie |
A. Prasad Sistla |
WWW PAGE
IDM'99 Report: http://www.eecs.uic.edu/~benarie/video_db/
Web Based Demo of Digital Video Library: http://arik.eecs.uic.edu/cgi-bin/vdsearch.cgi
Project Award Information
Award Number: 9711925
Duration: 09/01/1997 - 08/31/2000
Project Title: "Similarity Based Retrieval from Video and Pictorial Databases"
Keywords
Content Based Retrieval, Video and Pictorial Databases, Multi-media databases, Shot and image segmentation, Motion and activity recognition, Object recognition.
Project Summary
This project proposes research towards the development of a system for similarity based retrieval from video and pictorial databases. The salient features of the proposed system are :
An expressive query language, called Hierarchical Temporal Logic (HTL), for expressing spatio-temporal queries on video databases. A modular similarity based retrieval system for the temporal part of the query which can be used on top of any suitable picture retrieval system. An image understanding system for extracting the objects and activities in the videos in the database which uses local and global color characteristics, motion signatures, shape characteristics and texture. An activity recognition module that recognizes human actions will be used to interpret video shots with respect to temporal events. A dictionary based system for translating the parts of the query into feature vectors which are searched in a space consisting of the feature vectors corresponding to the videos present in the database. An incremental learning module which expands the dictionary from examples.
HTL uses the classical temporal operators to specify temporal properties of videos (i.e. the sequencing of video segments), and it employs level modal operators to specify such properties at different levels in the video hierarchy. At the atomic predicate level, the language allows specification of properties on the contents of a single video segment including objects and motions (activities). For a given user query, each atomic predicate in the query is translated into feature vectors (or hyper regions in feature space) using a dictionary based scheme; these feature vectors are then searched using multi-dimensional indexing in the database consisting of the feature vectors of all the video segments in the video database; the result of this is a similarity list containing entries where each entry contains the id of a relevant video segment and a similarity value denoting how closely it satisfies the atomic predicate. Such similarity lists for the atomic predicates can also be obtained by a using a picture retrieval system. The similarity lists for the atomic predicates are combined together to obtain a similarity list for the main HTL query.The unique aspect of our feature vector extraction is the combination of the static generic object characterization in pictures and the motion/activity recognition in videos using a unified approach based on multidimensional indexing both in the image domain and in the action domain.
The project proposes extensions to the HTL query language, by incorporating additional temporal operators in order to make it more expressive. A graphical user-interface for specifying HTL queries will be developed. To improve the accuracy of the similarity based retrieval, alternate similarity functions will be investigated. Techniques such as temporal indexing (i.e. indices on the time dimension) will be investigated in order to enhance the performance of the retrieval system. Also, extensions to the currently implemented picture retrieval system are proposed.
The project also proposes development of new methods for segmentation of a video stream into shots using edge detection technique based on feature vectors of the frames. Also proposed are novel image segmentation and methods for person detection based on novel approach of model based segmentation and color characteristics. The research proposes generic recognition of objects and activities employing a flexible dictionary approach that translates atomic predicates into feature vectors which are then matched with corresponding features of video frames/shots. The dictionary can represent also generic inanimate or animate objects and activities by hyper-regions in the feature vector space. It is proposed to develop learning techniques which incrementally expand the dictionary with additional entries and generalizes existing entries.
Goals, Objectives, and Targeted Activities
Indication of Success The Similarity list generator for a subclass of HTL queries, called conjunctive queries, has been implemented. Preliminary results are encouranging. We have succeeded in segmentation of human faces using color and frequency characteristics. We have completed a model-based segmentation scheme that detects man-made objects in cluttered scenes. A robust scheme for head detection is now in development with successful results for various poses of human head. Also, a robust scheme for tracking human body parts for the purpose of activity recognition is under construction. A web demo is available at http://arik.eecs.uic.edu/cgi-bin/vdsearch.cgi.
Project Impact and Output Human Resources: Minlin Deng: Ph. D. student, Tao Hu, Ph.D. student, Zhiqian Wang: Ph.D. student, Dibyendu Nandy: Ph.D. student, Xiao Du, Ph.D. student, R. Venkatasubramanian: M. S. student. Education and curriculum development at all levels. The PI, Prof. Sistla teaches graduate and undergraduate courses in Databases and other areas of Computer Science. The Co-PI, Prof. Ben-Arie teaches graduate and undergraduate courses on image understanding, image analysis and image processing. His research has influenced and contributed to the contents of these courses.
GPRA Outcome Goals Discovered a novel Volumetric Frequency Representation that encapsulates both the 3D structure of objects along with a continuum of their views. This establishes a new approach in computer vision which can be for many applications such as pose-invariant object/face recognition,shape reconstruction from single images, 3-D motion recovery. Discovered and developed a novel method for segmentation of objects based on their generic models. This method is very useful in detection of objects and persons in cluttered scenes.
Project References
Area Background
Content Based retrieval from Video and Pictorial Databases is an active area of research. The main research goal is to facilitate retrieval of video and pictorial objects based on their content. The area spans multiple disciplines of which the Database and Information Retrieval and the Vision Understanding and Image Processing Disciplines. The Database area is concerned about languages for specifying queries, efficient retrieval mechanisms. The Vision Understanding and Image processing areas are concerned about processing raw pictures and videos, and extracting quantitative and qualitative information from them.
Area References
Potential Related Projects
Any project that is concerened with information retrieval from pictorial and video data.
Adaptive Multiresolution Data Representation
R. Daniel Bergeron (*)
John P. McHugh (**)
(*) Department of Computer Science
(**) Department of Mechanical Engineering
University of New Hampshire
Kingsbury Hall
Durham, NH 03824
Phone: (603) 862-3778
Fax: (603) 862-3493
Email: rdb@cs.unh.edu
jpm@alma.unh.edu
www: www.cs.unh.edu/~rdb
WWW PAGE
www.cs.unh.edu/projects/vis/mdr
Project Award Information
Keywords
multiresolution data, hierarchical data, level of detail, error representation, orthogonal wavelets.
Project Summary
This project is based on the development and evaluation of a multiresolution data representation model for interactive visualization and exploration of very large multidimensional multivariate scientific data. The combination of wavelets and multidimensional visualization techniques provides a powerful and effective basis for real-time visualization of very large data sets. We believe that compactly supported orthogonal wavelets can be used effectively to furnish an authentic representation of very large scientific data, and at the same time, provide a fine to coarse hierarchy for detail investigations. An essential component of our wavelet-based multiresolution data representation model is the ability to provide a meaningful localized error estimation for each level of the representation. Preliminary results show that large data sets can be effectively viewed at low resolutions. A scientist can search the coarse representation for interesting patterns, and can use the local error display to identify areas of the coarse representation where the visual representation has too much error to be trusted. In both cases, the scientist can then explore the identified areas at a higher resolution. The current research is an interdisciplinary effort aimed at developing and evaluating tools to support the interactive exploration of computational fluid dynamics (cfd) output representing the turbulence generated by gravity waves in the atmospheres of the Earth and the outer planets.
Goals, Objectives, and Targeted Activities
The principal immediate goal for this project is to develop the basic software that allows us to use the multiresolution data representation with the computational fluid dynamics (cfd) simulation output. This effort requires adaptations to the existing multiresolution data generation code and to the simulation code as well as the development of new code to support interactive access to the multiresolution hierarchy over the net. A critical component of the current research will be the development and evaluation of appropriate error criteria and presentation for cfd data of this type.
Indication of Success
We expect this research to result in contributions to both the theory and practice of multiresolution data representations. We will be exploring the time/space/accuracy trade-offs that are inherent in the use of our multiresolution data representation in the context of computational fluid dynamics. In addition, we will be developing a set of software tools to utilize our representation for interactive access to and visualization of extremely large cfd data sets. Such an environment will significantly improve a scientist's ability to manipulate and interpret very large data sets. Although we are focusing on cfd datasets, we believe that much of the experience and knowledge we gain from this effort and most of the software we develop will be applicable to other scientific domains.
Project Impact
This project is just beginning, but we expect the effort to contribute to the educational experience of a number of graduate and undergraduate students. In addition to the one graduate project assistant funded by the grant, we expect to involve at least 4 computer science and mechanical engineering graduate students by identifying master's thesis topics associated with this project. In addition, we plan to request an REU supplement to support undergraduate students on the project.
GPRA Outcome Goals
This project addresses very real and very critical problems associated with a large segment of current scientific research -- the difficulty of dealing effectively with extremely large data sets. Our major goal is to improve the scientist's ability to understand and interpret very large datasets and, therefore, to help scientists to conduct their own research more effectively.
Project References
Area Background
Scientific data visualization [Nielson and Shriver, 1990; Rosenblum et al., 1994
] is an emerging multidisciplinary research area that incorporates aspects of da
ta modeling, data access and storage, interactive techniques, understanding of h
uman perception and considerable emphasis on domain-specific needs and visualiza
tion goals. The notion of exploratory data analysis [Tukey, 1977] has been a fundamental component of much scientific data visualization research and corresponds closely to recently developed ideas of data mining and knowledge-directed data discovery. In recent years, significant research efforts have been devoted to development of multiresolution data representations including notions of multiple levels of detail for rendering purposes. One particularly significant development in this area is the use of wavelet transformations [Strang, 1989; Stollnitz et al. 1995] which were originally used primarily for data compression. Mallat's seminal paper [Mallat, 1989] was a primary motivation for much of the later use of wavelets for volume rendering and surface representation and for our data modeling work. A basic one-dimensional wavelet transform of n data elements produces two sets of n/2 coefficients, called the summary and detail components. The operation is invertible, so that the original signal can be reconstructed from the coefficients. The coefficients are essentially weights applied to wavelet basis functions which have compact and limited support. Consequently, small coefficients can be discarded (treated as 0) without major impact on the reconstructed signal. This characteristic makes wavelets useful for data compression. The multiresolution application of wavelets arises because the summary coefficients essentially act as a lower resolution approximation to the original data and recursive applications of the wavelet transform produce a hierarchy of resolutions. For our purposes, however, it is just as important that (for orthogonal wavelets) we can treat the detail coefficients as representing the locally-defined error of the associated summary coefficients at each level of the hierarchy.
Area References
Potential Related Projects
The data modeling aspect of this project relates to ongoing research based on a comprehensive data model for scientific databases. This approach uses metric space and topological foundations to combine conventional numeric data with categorical data into a single metric space. We are also developing formal models for distributed adaptive multiresolution data representations. This work is being carried out in collaboration with Ted M. Sparr and several computer science graduate students at UNH.
Automated Acquisition of Domain Knowledge and Language Patterns, with Application to Language Metadata
Lois Boggess and Julia Hodges, Principal Investigators
Department of Computer Science
Mississippi State University
Contact Information
Lois Boggess or Julia Hodges
Department of Computer Science
Drawer 9636
Mississippi State, MS 39762
Phone: (601) 325-2756
Fax : (601) 325-8997
Email: lboggess@cs.msstate.edu, hodges@cs.msstate.edu
List of Supported Students and Staff
Bo Tang is a Computer Science doctoral student funded by this grant.
Lynellen Perry is a Computer Science doctoral student whose work for the project has been funded in part by this grant and in part by a grant from the Mississippi State University Office of Research.
Janna Shaffer has just joined the CS doctoral program and is funded by this project beginning this January.
Project Award Information
Keywords
Content-based document indexing, natural language processing, information retrieval, text categorization.
Project Summary
This system maps World Wide Web documents to Library of Congress subject headings. Specifically, it is designed to automatically classify Web documents into subject headings of the Science (Q) and Technology (T) portions of the Library of Congress classification system, after being trained and tested on thousands of documents previously classified by library scientists and expert document analysts. The project represents a scaling up of knowledge-based methods and natural language processing techniques which have been successful in providing more than 80% of the descriptive phrases produced by expert document analysts for journal articles in a specific scientific domain.
Goals, Objectives, and Targeted Activities
This research furthers concept-based (in contrast to word-based) indexing of documents. In the months since we have started, we have concentrated on subjects related to the Science (Q) sections of the Library of Congress classification system - specifically Astronomy, Physics, Geology, and Mathematics. We have downloaded and stored the contents of several hundreds of web sites per major subject area, including ftp sites containing large collections of documents. These web sites have been classified into the Library of Congress subject headings by human specialists. The collection of data for training and testing purposes continues: in the coming year we will download and store sites in Biology, Computer Science, and most of the Engineering fields.
Our initial efforts in index generation have focused on building a knowledge base that includes a structured representation of the subject headings used by the Library of Congress and synonyms for those headings. In our earlier work with scientific journal articles, we found that some correct indexes were not recognized if the phrases that would normally serve as indicators for those indexes were buried in compound noun phrases. We are building a more structured representation of the phrases and synonym information, to make use of important sub-phrases within phrases. We have almost completed the development of the initial knowledge base. Once this is complete, we will test the effectiveness of the additional structure for the phrases and synonyms by measuring improvement in the recall and precision rates. We will also modify our statistical program to allow for dynamic updating of the knowledge base so that the index generation component will become more precise in its generation of the Library of Congress subject headings as it processes more and more documents.
We also target natural language processing errors which have serious impact on knowledge extraction, especially errors in determining the parts of speech of unrestricted text. We believe that reducing these errors will require expanding the context examined by the classifiers and that hybrid classifiers combining more than one kind of classification technique are the most likely sources of breakthrough in reducing remaining errors. To deal with context, we are training very large neural networks, with emphasis on distributing attention across the networks. In so doing, we have opened the context window from the three or four words typical for current part of speech taggers to as much as twelve words. Our best result so far reduces our most frequent serious error by 13%.
Our formal goals for the first year include the following:
Indication of Success
One unexpected early development is the possibility of developing a more reliable body of training and test data for machine learning projects based on classified web sites. OCLC (Online Computer Library Center, Inc., a service organization for libraries around the world, and the "parent" organization for the Dewey Decimal Classification System) made available to us thousands of records of web sites classified by Library of Congress subject headings. Web sites are notoriously volatile: Files move or disappear, and sites that remain in place undergo modification, so that classifications are no longer relevant. OCLC is looking into working independently and/or with us on establishing a corpus of accurately classified Web documents which are captured and will not undergo modification.
In the first months a prototype system has been constructed which incorporates subsets of the Library of Congress subject headings into a database that comprises a roughly hierarchical network. This database also contains Library of Congress information about synonymous relationships between phrases related to the subject headings. Study of large neural networks incorporating contexts of up to 12 words has resulted in comparison of approaches to evolving such networks in parallel. A paper detailing a promising parallel approach to evolving these large networks has been submitted (Boggess, 1999, in review). Recurrent neural networks are also being explored. A hybrid stochastic/neural network part of speech tagger is in progress.
Project Impact
Two of the three doctoral students participating are women, and both women have targeted international conferences to which they will submit papers for which they will be first author, during this semester. Issues related to the project are being incorporated into a graduate level course on natural language processing this semester. The students of that course have the opportunity to implement experiments related to language processing, terminology, and/or classification of documents, using data collected by the project.
This project is possible thanks to donations of data and time from OCLC, a not-for-profit service and research organization affiliating more than 60,000 libraries worldwide. We have regular conference calls with two OCLC researchers. This exchange of ideas has led to plans to partition effort on joint goals. For example, it appears to be the case that OCLC will develop a repository of classified "web sites" (downloaded sites that will not change with time). They offer vast expertise in human document classification, and we have mutually exchanged advice, information, and software tools for language processing and classification.
GPRA Outcome Goals
Connections between discoveries and their use in service to society.
A predecessor to this project was successful in providing 85% of the descriptive phrases indexing journal articles by content that highly trained document analysts provided, for the field of physical chemistry. This project seeks to perform the same task, for the fields represented by Library of Congress Q and T call letters (most of science and technology) for much more diverse documents (journal articles are highly constrained as to format, in comparison with web pages). Library of Congress subject headings range from broad (e.g., "Astronomy", "Tobacco", "Volcanoes") to specific (e.g., "Actinomycetales, Research", "Aeronautics in astronomy, United States [and] Infrared Astronomy, Research"). Success of this research will in effect allow web documents to be indexed by a long-established, widely-adopted controlled vocabulary.
Timely and relevant information on the national and international science and engineering enterprise.
If paired with a web crawler, the final product of this research will be a system with the capability to produce far fewer spurious returns from searches for information on the web, without requiring the user to know the exact words that indicate relevance in the target document, and with a high percentage of the relevant sites presented to the user. It should be noted, however, that while the system will have the described capability, it is not within the resources of the research team to collect and establish the training data required for a system to respond to all existing Library of Congress subject headings in the sciences and technology.
Project References
Area Background
Our work is a confluence of two areas of expertise: corpus-based natural language processing and knowledge bases. It can be seen from at least three perspectives: information retrieval, broad-based study of the characteristics of language, and knowledge bases themselves. Our focus has been on extraction of information from a body of text from some particular domain. Sometimes the purpose of the extraction has been to add to or update existing knowledge already embedded in a knowledge base. Sometimes, as in the present classification task, the purpose has been to identify the most relevant concepts associated with that text. In either case, there are many ways that the same concepts can be expressed by writers who have the full freedom of a natural language such as English in which to express their ideas. Consequently, the system must handle vocabulary and grammar that has not been anticipated, and relate previously unseen entities in a text to known concepts. In doing so, we examine the properties of language in the aggregate. We use clustering algorithms to examine the contexts in which words are used, with the result that we group words and concepts by "the company they keep". An important aspect of the work is to weigh the evidence that favors one classification or interpretation over competing classifications and interpretations.
Area References
Unsupervised Document Set Exploration Using Divisive Partitioning
Daniel Boley
Department of Computer Science and Engineering
University of Minnesota
4-192 EE/CS, 200 Union St. SE,
Minneapolis, MN 55455
Phone: (612) 625-3887
Fax: (612) 625-0572
Email: boley@cs.umn.edu
Project URL: "http://www.cs.umn.edu/~boley/research/idm99.html" contains this short report. A set of papers on this algorithm as well as prototype software can be found in "http://www.cs.umn.edu/~boley/PDDP.html".
Award Number: IIS-9811229.
Duration: September 15, 1996 to August 31, 2001.
Title: Unsupervised Document Set Exploration Using Divisive Partitioning.
Unsupervised clustering, data mining, classification, hierarchical taxomonies, Web agents.
This project is devoted to the research and development of a hierarchical div isive clustering algorithm. The basic underlying activity is the basic research needed to fully develop the divisive partitioning method and the applied research needed to demonstrate its use on specific applications and to make it easy for humans to use. The conceptual framework for the basic divisive partitioning method has been developed, and an efficient prototype implementation has been installed locally, on which all experimental results have been based. The research issues to be addressed in this project include completion of the theoretical foundations of the algorithm, experiments on larger datasets to measure the scalability of the methods, experiments on a datasets from a variety of sources as a way to gain understanding about the theory behind the method, extensions to the algorithm capabilities, extensions to other applications beyond text document clustering, and validation of algorithm results. Technology transfer forms an additional part of this project, and will be addressed by applying the methods developed as part of the project to other application datasets (e.g. genome, toxicology, medical literature datasets), as well as extending it to other paradigms such as hierarchical taxonomies, document rating aids and collaborative filtering. By incorporating it into a Web agent, the methods will be made available to a wide audience. All application datasets have either been collected by colleagues and will be shared as part of ongoing projects, or are available over the Web.
The project goals are to develop the Principal Direction Divisive Partitioning (PDDP) Algorithm and demonstrate its effectiveness in a variety of applications.
The specific goals can be grouped into three main categories:
(1) demonstrate and test the applicability and effectiveness of the PDDP algorithm to new application areas, including very large data sets;
(2) extend the capabilities of the algorithm to include features for automatically determining the number of clusters or automatically producing labels to help users navigate through large datasets;
(3) develop user interfaces to allow easy access to the functionality of the algorithm, allowing users to apply it to a variety of datasets in a straightforward way.
The research plan included the following tasks:
(a) Demonstrate the scalability compared to older methods,
(b) Demonstrate the effectiveness in terms of quality of resulting clusters,
(c) Incorporate an autonomous stopping test,
(d) Develop methods for producing hierarchical taxonomic labels,
(e) Develop a user interface based on a Web agent.
Success is measured in the progress made within each task in the research plan. This project has been in progress for only 6 months during which we have actively pursued all 5 tasks mentioned above. In this period we have accomplished the following:
(i) We have developed the conceptual framework necessary to incorporate essential new features of the methods. The features are necessary to make the method a practical and useful method in the domain of Web exploration and include an automatic stopping test and a dynamic incremental updating feature [Ref. 1].
(ii) We have completed the design of a client-side Web agent incorporating this algorithm as a tool to automatically [i.e. without user participation] organize a large collection of Web pages visited by a user using a normal Web browser. The feasibility and practicality of the design has been demonstrated [Ref. 3], but robustness issues still need to be addressed.
Human Resources: The project is currently supporting the research work of Vivian Borst working toward a Ph.D.
Education and Curriculum Development: The project is supporting development of the regular semester course CSci 5521 "Pattern Recognition" to be offered next year.
Infrastructure: The highest priority item purchased so far with the modest equipment budget has been a dedicated disk partition for large datasets. The second highest priority will be the purshase of a high performance computational engine for the computational experiments.
Industry: Not Applicable.
Activities Spawned: This activity is leading to the exploration of divisive partitioning methods for other applications taking place within the Dept, including collaborative filtering, genome mapping, vision-based defect detection. All of these are still in the preliminary stages.
This project depends on the combination of several widely different fields which have come together to produce methods which are extremely successful. The foundation of the techniques -- the ideas that lead to the practicality and the scalability -- are methods for large sparse eigenvalue problems in numerical linear algebra. Specifically, the core computation is a so-called Lanczos-based method for computing the leading principal component for a large collection of sparse vectors each of which represents a document (or other data sample). The methods preserve the inherent sparsity making it possible to represent large data collections in relatively little space, and also making it possible to carry out the computations in a relatively short time. These underlying methods serve as a foundation for a variety of divisive partitioning techniques, which are analogous to the classical agglomerative clustering techniques. Instead of coalescing the individual data samples (documents) one by one into clusters (e.g. with a nearest neighbor criterion), we start with all the data samples in one big cluster and split it repeatedly along the principal component. The splitting process yields weights indicating which features are most critical in placing data samples in one cluster as opposed to another cluster, whose utility is currently underdeveloped.
Area References
To be added in future years, once greater familiarity with other projects is obtained.
Content-Based Image Retrieval for Medical Databases
Carla E. Brodley (*), Avi C. Kak (*), Lynn S. Broderick (**) and Alex Aisen (***)
(*) School of Electrical and Computer Engineering, Purdue University
(**) Radiology Department, University of Wisconsin
(***) School of Medicine, Indiana University
Contact Information
Carla E. Brodley
School of Electrical and Computer Engineering
Purdue University
West Lafayette, IN 47907
Phone: (765) 494-0635
FAX: (765) 494-6440
Email: brodley@ecn.purdue.edu
WWW PAGE
Content-Based Image Retrieval of Medical Images -- IRI-9711535
Keywords
Content-Based Image Retrieval, Machine Learning, Medical Images, Computer Vision
Project Award Information
Project Summary
Increasing computerization of the process of acquiring and storing diagnostic images requires more sophisticated approaches to image retrieval in order to optimally use the images for improved health-care delivery, physician education and research. A content-based image retrieval system is being developed that takes a human-in-the-loop approach, because completely automated approaches are not feasible for radiological images in which the clinically useful information may consist of gray level and texture deviations in highly localized but difficult-to-segment regions of an image. In this approach, a physician manually delineates the pathology-bearing regions in an image. The CBIR system then analyzes these regions for their properties to index the images into a database and retrieve similar images typically with known diagnoses. This project focuses on the domain of high resolution computed tomography (HRCT) images in patients with a variety of lung diseases.
Goals, Objectives, and Targeted Activities
In the first year of our project we collected a database of roughly 400 HRCT images of the lung, investigated methods for indexing and retrieving images, and developed a user interface that allows physicians to mark the PBR's, pose queries to the system and view the retrieval results. In the second year of our project we will focus on the following objectives:
Indication of Success
We have demonstrated that a human-in-the-loop approach to CBIR of medical images is a critical factor in the success of our approach. In our approach the physician delineates the pathology bearing regions (PBR) and a set of anatomical landmarks of the image at the time the image is entered into the database. From the marked regions, our approach applies low-level computer vision and image processing algorithms to extract attributes related to the variations in gray scale, texture, shape, wavelet coefficients, etc. We have demonstrated experimentally that for our domain, HRCT of the lung, local image characterization using PBRs is far more accurate than any method that creates an index based on a global characterization of the lung.
Recently we have developed a new approach to image retrieval, which we call ``customized queries''. This approach automatically customizes the retrieval index (the computer vision features that characterize the image) to the query image. The underlying rational of the approach is that the features that retrieve the most visually similar images within a disease class differ from those that discriminate among disease classes. During retrieval, to obtain the customized feature vectors for each query, our approach first classifies the query using the features that best discriminate the disease classes (we apply a standard machine learning technique of forward sequential selection to determine these features). Next the corresponding feature set for that class is used for retrieval within the subset of the database that has that disease. Customized queries has increased retrieval precision (the number of retrieved images visually similar to the query image as judged by a experienced lung radiologist) to approximately 90%. This is a radical increase over the 50% realized by the traditional approach to CBIR, which uses the same set of image features for every query.
Project Impact
Human Resources: The project currently supports two Ph.D. students (Chi Ren Shyu and Jennifer Dy). Education and Curriculum development at all levels: The PI, Prof Brodley teaches graduate and undergraduate courses in artificial intelligence and machine learning. The Co-PI, Prof Kak teaches graduate and undergraduate courses on image understanding, image analysis and image processing. Both Prof Brodley's and Prof Kak's research has influenced and contributed to the content of these courses.
GPRA Outcome Goals
1. Discoveries at and across the frontier of science and engineering: Our approach to CBIR for medical databases relies on human input, machine learning and computer vision. Specifically, we apply expert-level human interaction for solving that aspect of the problem which cannot yet be automated, we use computer vision for only those aspects of the problem to which it lends itself best -- image characterization -- and we employ machine learning algorithms to allow the system to be adapted to new clinical domains. As such we have made a bridge across the disciplines of medicine, machine learning and computer vision.
2. Connections between discoveries and their use in service to society: The shift in technology from analog film-based methodologies to computer based digital technologies is creating large digital image repositories. CBIR provides an opportunity to tap the expertise contained in these databases in the following way: observing an abnormality in a diagnostic image, the physician can query a database of known cases to retrieve images (and associated textual information) that contain regions with features similar to what is in the image of interest. With the knowledge of disease entities that match features of the selected region, the physician can be more confident of the diagnosis.
3. A diverse, globally-oriented workforce of scientists and engineers: Our project is training two graduate students in the diverse areas of computer vision, machine learning and medical informatics. For our project the personnel consist of 50% male and 50% female participants.
Project References
Area Background
Content-based image retrieval (CBIR) refers to the ability to retrieve images on the basis of image content, as opposed to on the basis of some textual description of the images. CBIR uses image content directly. An example of a retrieval query of this sort would be ``show me images from a given database that are similar to a particular image.'' A key element of the approach revolves around the types of patterns that can be recognized by the computer and that can serve as the indices of the retrieval.
Area References
Peter Buneman*
Co-PIs: Alex Borgida, Luca Cardelli, Richard Hull, David Maier, David Stemple, Victor Vianu, Stan Zdonik**
*Department of Computer and Information Science
University of Pennsylvania
**Respectively: Rutgers University; Microsoft Cambridge; Lucent; Oregon Research Institute; University of Massachussets, Amherst; University of California, San Diego; Brown University.
Peter Buneman
CIS, University of Pennsylvania
200 South 33rd Street
Philadelphia, PA 19104
Phone: (215) 898-7703
Fax : (215) 898-0587
Email: peter@cis.upenn.edu
Databases, Programming Languages, Persistent Object Systems, Heterogeneous Databases, Scientific Databases, Database Integration, Constraint Databases, Semistructured Data.
This project sponsors a series of workshops in collaboration with EU on various aspects of databases and programming languages. It is intended to initiate workshops in various emerging areas of database research. Over the past two years it has been used to sponsor workshops related to constraint databases, semistructured databases and scientific databases.
The goals of the project are to foster international exchanges, and to encourage the development of new areas of database research.
The workshops have all resulted in some form of publicly disseminated proceedings or report. The success of the workshops should be measured by the results of the report. )
Perhaps the most important recent impact of these workshops has been their contribution to the field of semistructured data. This grant has funded the two most important workshops in this area and has been, in part, responsible for the fusion of database research and web document technology, especially in the relationship between XML and semistructured data.
A brief synopsis of some of the recent workshops is given here. The publications in many cases are electronic. Since it is highly topical, The proceedings of the workshop on "Query Languages for semistructured data and non-standard data formats" are given in some detail.
Electronic Information Literacy for Educational Environments
Jamie Callan
Center for Intelligent Information Retrieval
Computer Science Department
University of Massachusetts
Jamie Callan
Univ. of Mass., Computer Science Dept.
740 N. Pleasant St., LGRC, Room A243
Amherst, MA 01003-4610
Phone: (413) 545-4878
Fax : (413) 545-1249
Email: callan@cs.umass.edu
WWW: http://www.cs.umass.edu/~callan/
http://ciir.cs.umass.edu/research/IIS-9812358
Information retrieval, information literacy, K-12 education.
The goal of this research is to improve the abilities of K-12 students to find, evaluate, and organize information available on the Internet. These skills comprise a significant subset of the Information Literacy skills that Library Science teaches. The approach consists of building a Web search interface in which Information Literacy skills are matched to Information Retrieval (IR) tools in a way that teaches skills while helping students locate information on the Internet. Improved queries are created from the student’s information need, supporting information from the surrounding educational environment, and query expansion from educationally-focused databases. Information filtering techniques identify, and if desired eliminate, retrieved information at the wrong grade-level or containing inappropriate content. Analysis of results is supported by extracting best document passages, frequent concepts, and proper names. Result sets are organized by clustering similar documents. Content evaluation is supported by citation analysis, and additional searches for supporting information.
In this first year of the project the emphasis is on extending the initial prototype search interface and gaining experience in working with students. IR software modules developed in other projects for adults (e.g., query expansion) are being customized for use by young students. New components, for example, assessment of reading level, are being developed. The search interface has been installed in the library at the Neil Pepin Elementary School in Easthampton, Massachusetts. Tutorials are being developed that introduce teachers and students to Internet searching and Information Literacy skills. The software and tutorials are being tested in four fourth grade classes (approximately 100 students). An evaluation methodology is being developed for use in future years that will indicate the extent to which students are learning the desired Information Literacy skills.
Professor David Raker leads the evaluation component of this project. An evaluation methodology is being developed for use in the next two years that will indicate the extent to which students learn the desired Information Literacy skills. The evaluations will be based on tests with measurable outcomes, for example, finding specific information on the Internet, or identifying document characteristics that indicate whether the author is an authority or trusted source of information on a topic. Questionnaires and other more subjective tools will be used to measure student and teacher satisfaction.
Software components will be tested with component-level tests that indicate their contribution to overall results. For example, the effectiveness of customized query expansion facilities, or software for screening material with inappropriate content, will be tested with data gathered for that purpose.
The results of the research will be tools, techniques, and tutorials that improve the Information Literacy skills of K-12 students, specifically their abilities to locate, evaluate, and organize information available in a disorganized and noisy medium (the Internet). Students will view evaluation as an integral part of the information seeking process, and will know how to evaluate every piece of information they find to determine whether it meets their needs with respect to recency, authority, and other Information Literacy criteria. These skills are essential in a society based on free speech and many communication channels. The underlying assumption, adopted from the Library Science community, is that the earlier these skills are taught, the better they will be learned, and the more likely they are to be retained. Educators and librarians believe that these skills are a key component of independent, critical thinking, and lifelong learning.
This year we are training about 100 fourth grade students how to use the Internet to find and evaluate information for history projects (ancient Egypt, ancient Rome, Mesopotamia), because this meets an immediate need of the school in which we are working. However, nothing in either the software or the tutorials is subject-specific. We expect to broaden the project in later years, to larger numbers of fourth grade students, and to students in either a middle school or high school.
This project is a joint effort between researchers in Computer Science (Callan), Education (Raker), and Library Science (Lewellen). One of its goals is to establish a long-term research relationship to address the use of information technology and the Internet in K-12 education.
One of the project goals is to investigate how the very substantial body of Information Retrieval tools and techniques developed for adult information analysts can be adapted for use in K-12 educational environments. Much of the software base for this project was developed under other funding from NSF, DARPA, and industrial sponsors.
Information Retrieval (IR) software helps a person find information in a large, usually unstructured, text database. The best known part of the process is the search engine that determines which documents match the query that specifies an information need. However, equally important are tools that help a person specify the information need (e.g., by suggesting related words and phrases), browsing tools, and tools that summarize, organize and visualize the documents that are returned.
Information Literacy is the set of skills required to find, evaluate, organize, and synthesize (use) information. Information Literacy has traditionally been studied as a part of Library Science. Tools for finding information in a library include card catalogs, and indexing by a controlled vocabulary. Information used by librarians for evaluation includes citation patterns, the publisher's reputation, the author's credentials, and date of publication.
KMeD: Knowledge-based Image Retrieval with Spatial and Temporal Constructs
Wesley W. Chu*, Alfonso Cardenas*, Ricky Taira**
*Computer Science Department
University of California, Los Angeles
**University of Washington
Contact Information
Wesley W. Chu, P.I.
Computer Science Department
University of California, Los Angeles
Los Angeles, CA 90095
Phone: (310) 825-2047
Fax: (310) 825-7578
Email: wwc@cs.ucla.edu
Alfonso F. Cardenas, Co-P.I.
Computer Science Department
University of California, Los Angeles
Los Angeles, CA 90095
Phone: (310) 825-7550
Fax: (310) 825-7578
Email: cardenas@cs.ucla.edu
Ricky Taira, Co-P.I.
Children’s Hospital
University of Washington
Seattle, WA
Phone: (206) 528-2744
Fax: (206) 528-2730
Email: rtaira@u.washington.edu
WWW Page
http://www.kmed.cs.ucla.eduList of Supported Students and Staff
Graduate Students (past and current): Alexander Bui, Jason B. Borja, Christine Chih, Chih-Cheng Hsu, Sang-Hyun Park, and Roderick Son
Keywords
Content-based retrieval Medical image database system
Knowledge-based query processing Multimedia database system
Knowledge-based spatial temporal query language Iconic query language
Project Award Information
Project Summary
A knowledge-based approach to retrieve medical images by feature and content with spatial and temporal constructs is being developed. Selected objects of interest in a medical image (e.g. x-ray, MR image) are segmented, and contours are generated from these objects. Features (e.g. shape, size, texture) and content (e.g. spatial relationships among objects) are extracted and stored in a feature and content database. Knowledge about image features can be expressed as a hierarchical structure called a Type Abstraction Hierarchy (TAH). TAHs can be generated automatically by clustering algorithms based on feature values in the databases and hence are scalable to large collections of image features. Further, since TAHs are generated based on user classes and applications, they are context- and user-sensitive.
A knowledge-based semantic image model which provides a mechanism for accessing and processing spatial, evolutionary, and temporal indexing techniques is also being developed. A knowledge-based spatial temporal query language (KSTL) will be developed which supports operators for approximate matching via feature and content, conceptual terms, and temporal logic predicates. Further, a visual query language is also proposed and is being implemented that will accept point-click-and-drag iconic input on the screen that maps into KSTL. The proposed system has been implemented in a testbed to evaluate the functionality of the proposed tasks. The results from this research are applicable to other multimedia information systems as well.
Goals, Objectives, and Targeted Activities
Our research is focused on the retrieval of medical images by feature and content represented by spatial and temporal constructs. We have developed a comprehensive data model methodology and indexing technique for sequence databases for knowledge-based query processing. We are implementing a subset of the visual query language Mquery and of the underlying KSTL language, and a knowledge-based query processing technique in the Knowledge-based Medical Image Databases (KMeD) testbed. We are collaborating and planning with projects at the School of Medicine proof of concept of the applicability and attractiveness of our advances in the actual medical domain.
One important aspect of our content-based image retrieval challenge is to perform automatic image segmentation, so as to then capture this knowledge in our database. More specifically, we want to generate the contour of an object (e.g. tumor) in an image. We will continue collaborating with other colleagues focusing on automated segmentation techniques for our applications. We are now testing segmentation advances to incorporate into our testbed.
Indication of Success
We have progressed as we projected. This progress has earned publication/presentation of our work in several journals, proceedings, and conferences. We have attracted the interest of actual medical users to the point of working closely together through this grant and in larger grants from NIH. This has also led to recent proposals to NSF and NIH in which our project advances are an important element to build on.
Project Impact
Our Knowledge-based databases modeling and query processing technology is being transferred to the on-going NIH Program Project Grant at the Radiology Department in the UCLA School of Medicine for application in thoracic oncology and other medical domains. Our project and other NIH grants have had much synergism, with visible impact.
We have supported three graduate research assistantships. The project has led to four Master degrees, one Doctoral degree and two on-going doctoral students; three undergraduate students have also participated on the project via special studies courses.
GPRA Outcome Goals
1. Discoveries at and across the frontier of science and engineering.Project References
Area Background
Database systems, knowledge-based systems
Area References
H. K. Huang and R. K. Taira. Infrastructure Design of a Picture Archiving and Communication System. American Journal of Roentgenology, 158:742-749, 1992.
Potential Related Projects
There are several content-based image retrieval projects (e.g. Faloustos, Ghafoor and Yu) in the IDM that could be leveraged on each other’s research work.
Dr. Shi-Kuo Chang
University of Pittsburgh
Department of Computer Science
Contact: Dr. Shi-Kuo Chang
Dept. of Computer Science
University of Pittsburgh
Pittsburgh, PA 15260
Phone: (412) 624-8423
Fax : (412) 624-8465
Email: chang@cs.pitt.edu
Project URL: http://www.cs.pitt.edu/~chang
An image information system is an information system for the input, storage, processing, output and communication of a large quantity of images. An image information system typically has the following five components: image input subsystem, image processing subsystem, image output subsystem, image database subsystem and image communications subsystem. Recently, advances in image storage technologies have made the creation of very large image databases feasible. Multimedia communications also greatly facilitate the distribution of images across communication networks. Parallel computers lead to faster image processing systems. High resolution graphics and dedicated co-processors enable the design of image output subsystems with superior image quality. The state-of-art image information systems have found their way into many application areas, including geographical information systems (GIS), office automation (OA), medical pictorial archiving and communications systems (PACS), computer-aided design (CAD), computer-aided manufacturing (CAM), computer-aided engineering (CAE), robotics, and scientific databases (SD) applications.
S. K. Chang and E. Jungert, Symbolic Projection for Image Information Retrieval and Spatial Reasoning, Academic Press, London, 1996
|
Albert Mo Kim Cheng |
http://www.cs.uh.edu/~acheng/acheng.html
real-time systems, expert systems, rule-based systems, analysis, optimization, synthesis
3 years, year 3, ``Optimization of Real-Time Rule-Based Expert Systems.''
The major focus of this research is to develop the scientific foundation and to implement practical tools for optimizing and synthesizing rule-based expert systems to meet specified response time constraints. Real-time rule-based expert systems are embedded artificial intelligence (AI) systems increasingly used in many applications, such as airplane avionics, medical monitoring instruments, smart robots, space vehicles, and other safety critical applications. In addition to functional correctness, these systems must also satisfy stringent timing constraints. The result of missing a deadline in these systems may be catastrophic. If a given rule-based system cannot deliver an adequate performance in bounded time, then it has to be optimized or resynthesized. The first part of the project investigates several approaches (state-space-based and semantics-based) for optimizing the rule base of expert systems. The second part of the project investigates the optimization of the match phase, which has a highly unpredictable runtime.
Several steps for rule base optimization are accomplished this year (1998-1999): (1) development of a new run-time dynamic optimizer for real-time EQL rule-based systems to complement our static optimizer, (2) more refinement of our static EQL optimization algorithm with bidirectional search; (3) implementation of parallel sorting algorithms on the IBM SP2 supercomputer for use in parallel matching of condition elements in rules; and (4) derivation of new heuristics for rearranging the condition elements in rules to reduce the match time.
To complement our static optimizer [Zupan & Cheng 98], we have developed a new run-time dynamic optimizer for real-time EQL rule-based systems to satisfy the specified response time constraints. Unlike the static optimization algorithm which optimizes a program before its execution, the dynamic algorithm tries to minimize the number of rules to be executed at run-time. The new run-time optimizer constructs a predicate dependency list, which consists of a predicate, an active rule list, and an inactive rule list, for each predicate in a rule-based program. It then generates an inactive rule list by putting all false rules in a predicate dependency list together, and dynamically selects rules for execution at run-time.
The optimized program executes only selected rules at run-time but has the same semantics as the original program. We implemented and evaluated the performance of the proposed algorithm using both synthetic and industrial real-time rule-based programs. Though the dynamic optimizer can be applied to any real-time rule-based system, we use EQL real-time systems in the implementation so that we can compare the performance with our static algorithm. The performance evaluation shows that the dynamic optimizer reduces response time as well as the number of rules and predicates significantly more than that achieved by using the static optimizer alone in many cases.
We have modified the optimization method [Zupan & Cheng 98] that we have developed earlier for EQL systems to optimize MRL (Macro Rule-based Language) systems. In particular, we implemented this method to optimize MRL systems after their corresponding state space graphs are constructed. Since the time and space complexity of bidirectional search (O(b^(d/2))) is better than breadth-first search's (O(b^d)), we use bidirection and bidirectional breadth-first search strategies instead of the original bottom-up and breadth-first search strategies employed earlier.
To further reduce the execution time of rule-based systems, we have been investigating parallelization of these systems. Earlier, we have developed techniques for rule-splitting to achieve a higher degree of parallelism while avoiding conflicts caused by parallel rule-firing. During this past year, we have studied the problem of implementing parallel OPS5 on the IBM SP2 supercomputer. Since up to 90% of the execution time is spent on matching (in the Rete network in the case of OPS5), improvement in the parallelization of the match phase is essential.
There are many comparisons in this match phase, especially in the two-input nodes. We first experimented with a good sorting algorithm to reduce the comparison time in two-input nodes. Performing comparison in two-input nodes is similar to performing a joint of two tables of data. If each class of working memory elements can be sorted in advance, then we can save more time when searching the required data. We implemented several alternatives of parallel sorting. We are attempting to find a good sorting algorithm which will give the best performance on IBM SP2 and can be used in the OPS5's rule interpreter.
Our research considers the case where an expert system forms the decision module of a real-time monitoring and controlling system. This real-time system takes sensor input readings periodically, and the embedded expert system must produce, based on these input values and state values from previous invocations of the expert system, an output decision which ensures the safety and progress of the real-time system and its environment prior to the taking of the next sensor input values. Thus, the upper bound on the expert system's execution time cannot exceed the length of the period between two consecutive sensor input readings. Therefore, the first objective is to determine before run-time a tight upper bound on the execution time of the expert system in every invocation. The second objective is to optimize the expert system by modifying or resynthesizing the rule base if the execution time is found to exceed the specified timing constraint. The most related area to our work is database systems, especially those with timing constraints or temporal characteristics (real-time, active, or temporal databases). Certain techniques for the analysis and optimization of rules can be applied to database queries and vice versa.
I would like to collaborate with group(s) working on the analysis and optimization of real-time expert and database systems.
Principal Investigator: Tzi-cker Chiueh
Affiliation: State University of New York
Department: Computer Science
Institution: SUNY at Stony Brook
Contact Information
Contact Name: Tzi-cker Chiueh
Address: Computer Science Department, SUNY
City: Stony Brook, NY 11794-4400
Phone: (516) 632-8449
Fax : (516) 632-8334
Email: chiueh@cs.sunysb.edu
URL: http://www.ecsl.cs.sunysb.edu/~chiueh
WWW PAGE
Project URL: http://www.ecsl.cs.sunysb.edu/codir.html
List of Supported Students and Staff (optional)
Project Award Information
Keywords
Text Compression, Inverted Indexing, Lazy Index Updates, Query Result Caching, On-Line Document Update Consistency, Information Retrieval Session Management
Project Summary
This project originally focused on the low-level storage and indexing engine of text retrieval systems, and now evolves to lazy index update to support on-line document database changes and query result caching to speed up IR query processing.
Goals, Objectives, and Targeted Activities
Activities in the Year 1998-1999:
Target activities in Year 1999-2000:
Indication of Success
The project has deviated a little bit from the original project plan, in that it now evolves towards high-level text retrieval technologies, such as lazy index update and query result caching. In the following year, we plan to tackle the document classification problem and the user interface problem specifically designed to help individual users to explore the Web. The implementation efforts for this project proceed as scheduled and we expect the first release of Codir should be ready by the end of this year (8/31/99).
Project Impact
Include a brief discussion on the impact of the project on:
GPRA Outcome Goals
The Codir project fulfills the GPRA Outcome Goals by developing fundamental systems implementation technologies for text retrieval systems, training graduate students in this important area, and publishing results in international academic conferences.
Project References
Area Background
This project studies how to build more efficient and effective text retrieval systems, which form the backbone of Internet search engines such as Alta Vista and directory services such as Yahoo!. We originally focused on compression (to save space) and retrieval (to save time) techniques used in text retrieval systems, and then move on to study software algorithms to extend read-only document databases to support both reads and writes. More recently, we investigate the idea of query result caching, which reuses results from previous queries to service subsequent queries, to optimize the performance of information retrieval query processing. Finally, we are exploring the concept of information retrieval session management that focuses on the computer support for individual IR users rather than on the text retrieval server.
Area References
Jan Chomicki
Dept. of Computer Science
Monmouth University,
Jan Chomicki, Ph.D.
Department of Computer Science
Howard Hall
Monmouth University
West Long Branch, NJ 07764
PHONE: (732) 571-4457
FAX: (732) 263-5202
chomicki@moncol.monmouth.edu
www.monmouth.edu/~chomicki
www.monmouth.edu/~chomicki/prog99.html
John Lu, Research Assistant (01/98-12/98)
This project is also supported by a parallel grant IRI-9632871 (with the same title) to Peter Z. Revesz, Dept. of Computer Science, University of Nebraska - Lincoln.
interoperability; data models; data translation; spatial, temporal and spatiotemporal databases.
The goal of this project is to explore the theoretical foundations of constraint-based database interoperability and its practical applications. The project aims to facilitate data sharing and query reuse among different temporal, spatial and spatiotemporal database applications.
We have pursued several ways to extend the relational model of data to accommodate spatiotemporal database applications admitting discrete or continuous change. We have developed a novel concrete data model for spatiotemporal databases. Essentially, a spatiotemporal object is viewed as consisting of a temporal domain, a spatial reference object, a reference time instant, and a geometrical transformation parameterized by time. We have identified many natural classes of such objects and have been studying their properties, in particular closure w.r.t. set-theoretic operations. Some preliminary results are reported in [5].
We have applied a different, previously developed by us spatiotemporal data model [2] based on parametric polygons to the problem of animation of spatiotemporal databases. Our experimental data indicate that such representation can be much more efficiently animated than constraint databases. This work, whose implementation part was done mainly by Peter Revesz and his students at the University of Nebraska, is reported in [4]. We are planning to compare our approach with a traditional database approach that precomputes all the snapshots of an animation and stores them in a relational database. Using our approach, we have implemented several real-life spatiotemporal database applications: urban growth and bird migration.
Jan Chomicki and his students at Monmouth University have designed and implemented a query processor for a subset of relational algebra over two-dimensional spatial databases.
Database interoperability is a big area. Our results indicate that when the issues in this area are sufficiently restricted they can be solved. We have shown that different spatial, temporal and spatiotemporal data models can be mutually translated. That gives the user the flexibility of using different data models for different tasks. For instance, constraint databases are well suited for general querying, while time-parameterized representations are superior for animation.
While there are well established concrete representations for temporal and spatial databases (intervals, polygons), there are none for spatiotemporal ones. It is a challenge to model continuous change in databases. Our foundational work in this area has brought forward several proposals that need to be further studied and more widely applied. We also believe that animation of spatiotemporal databases - an area we have started to explore - is a very promising research direction that could lead to cross-fertilization of database and computer graphics research.
Discoveries at and across the frontier of science and engineering. This project has led to the definition of new spatiotemporal data models that may expand the scope of existing database management systems. For example, building on our work, specialized spatiotemporal extensions can be added to object-relational systems like DB2 or Informix. Such extensions can provide a novel way of interacting with a database through an animation. Also, our work provides a foundation for building interoperation tools for spatial, temporal and spatiotemporal databases.
Spatial databases contain information about phenomena that have a spatial extent. Examples include geographical features (e.g., rivers, mountains), administrative entities (cities, states), as well as areas defined by arbitrary properties (the areas of frost or high disease incidence). Temporal databases contain information about phenomena with a temporal extent. Examples include employment and credit records. Spatiotemporal databases accommodate both spatial and temporal aspects, and have applications in land use management, ecology, the study of urban growth and species migration, and other areas dealing with temporal evolution of spatial entities. Spatiotemporal databases should provide constructs to model both discrete and continuous change.
Many different data models have been developed in the areas of spatial and temporal databases (spatiotemporal databases have only recently received broader attention). It is essential that different models be made to interoperate. We contend that interoperation is only possible with a common semantic basis. We think that constraint databases, a natural generalization of relational databases in which restricted logic formulas are stored in a database, provide such a basis.
(1)Specific applications of spatiotemporal databases: study of climatic patterns, land use management, urban growth, species migration etc.; (2) Resolution of feature conflicts in spatial images and maps; (3) Indexing and query optimization in spatial, temporal and spatiotemporal databases.
Contact Information
Panos K. Chrysanthis
Department of Computer Science
University of Pittsburgh
Pittsburgh, PA 15260
Phone: (412) 624-8924
Fax : (412) 624-8854
Email: panos@cs.pitt.edu
URL: http://www.cs.pitt.edu/~panos
WWW PAGE
http://www.cs.pitt.edu/admt/projects/mobile.html
List of Supported Students
Stavros Papastavrou, Graduate Student Researcher.
Gloria A. Santin, REU.
Gary D Walborn, Graduate Student Researcher.
Susan Weissman-Lauzac, Graduate Student Researcher.
Project Award Information
Keywords
Mobile Computing, Transaction Processing, Disconnected Operations, Data Replication and Caching, Materialized Views.
Project Summary
The objective of this research is to develop innovative techniques that would make access to shared data by mobile computer users possible, anywhere and at anytime. In order to overcome the inherent limitations of mobile environment, we have developed PRO-MOTION, a flexible and adaptive infrastructure to support disconnected transaction processing in a mobile, client-server operating environment. By means of negotiated agreements, PRO-MOTION supports data caching and dynamic data replication on mobile clients and realizes both traditional as well as application specific data and transaction correctness criteria. We have also developed a prototype of a query processing facility that supports the exploration and query of databases from a mobile computer. A user can be involved in the formulation of queries while the mobile computer is disconnected from the network. Complete queries are formulated on metadata stored on the mobile computer in an incremental manner and without involving access to the actual data in the remote database to materialize intermediate steps. In the same context, we explored the idea of maintaining cache data on a mobile client as versions of materialized views and proposed an extension to SQL that allows a mobile computer to program these versions. In the course of our investigation, several commit protocols applicable in distributed database systems in general were also developed.
Goals, Objectives, and Targeted Activities
This year our goals and objectives are:
Indication of Success
The project has already met most of its initially stated goals and went even beyond them. Further, it has already had an influence on research work by others. The initial focus on data management in mobile environments had been on supporting efficient data retrieval and attempting to minimize energy consumption by the mobile computer. This project raised the issue of local updates on shared data on a mobile computer and proposed the first mobile transaction model. The first indication of its success is its impact on most of the subsequently developed mobile transaction models.
PRO-MOTION is one of the three different infrastructures proposed for mobile transaction processing and, in conjunction with our work on customizable views mechanism, is expected to contribute to the better understanding of (1) disconnected and autonomous mobile database operations, (2) data caching and dynamic data replication, and (3) the mechanism needed to support them. The fundamental building blocks of PRO-MOTION are compacts, which are the realization of our proposed notion of agreements and function as the basic unit of caching and control. Compacts encapsulate access methods, state information, consistency constraints (including temporal ones), restrictions, obligations, and privileges along with the shared data. Interestingly, the first impact of compacts went beyond the area of mobile computing as it can been seen in the work of J. Waesch, T. Tesch and W. Klas, researchers at GMD at Darmstadt, in semantics-based transaction management for cooperative applications.
Project Impact
GPRA Outcome Goals
The goals of this project are in agreement with the long-term vision of universal access, providing data access anywhere and at anytime while taking into account the individual user's specific needs. Every attempt is being made to develop theoretical frameworks and then transfer them into pragmatic environments and so their impact is far-reaching in the commercial sector. The project thus far has attracted a diverse body of students, including women, who are particularly under-represented in this research community.
Project References
Area Background
Several exciting advances in mobile computers and wireless technology have made database access by mobile computer users possible anywhere and at anytime. This, in turn, is enabling the development of an unprecedented number of new database applications that are not only affecting the way that we compute but, more significantly, they are changing the way we do business. Mobile and wireless computing facilitates more collaborative activities as well as opens enormous possibilities for information and data sharing.
However, the network and system assumptions underlying the traditional approach to distributed data management are no longer applicable in these new mobile database applications. For example, greater mobility implies a higher rate of spurious disconnection, and portability suggests more accidental disruptions. In contrast to traditional assumptions, disconnections in mobile environments do not necessarily imply failure of the disconnected mobile computer. The part of a computation executing on a mobile computer should be able to continue while the mobile computer is moving and not connected to the network. In these new mobile, distributed environments, therefore, it is important to study the issues of data consistency and sharing, and develop innovative methods that satisfy their requirements while overcoming the inherent limitations of mobile and wireless computing (e.g., frequent disconnection, limited battery life, restricted storage capacity, and low-bandwidth wireless network links).
Area References
Potential Related Projects
Crafting Recovery for Advanced Transaction Models and Applications: Mobile transactions are instances of advanced transaction models.
Transaction Management in Multidatabase Systems: One can view mobile data management system as a special case of a multidatabase system. For example, the notion of local autonomy in mobile environments is manifested in the ability of mobile computers to continue to operate in an independent fashion when disconnected.
Distributed Constraint Maintenance: Efficient maintenance of integrity constraints that will require little, if any, coordination between mobile computers and database servers, will both facilitate disconnected operations as well as reduce the cost of communication during periods of adequate network connection.
Mobile computers and wireless networks are becoming an integral part of distributed computing environments which means that there is great potential for collaboration with projects focusing on wide-area database systems and heterogeneous and semi-structured data management.
M. Isabel F. Cruz
Computer Science Department
Worcester Polytechnic Institute
Contact Information
100 Institute Rd
Worcester, MA 01609-2280
Phone: (508) 831-5621
Fax : (508) 831-5776
Email: ifc@cs.wpi.edu
WWW PAGE
http://www.cs.wpi.edu/~ifc/grants/career98.htm
Keywords
Query languages, Visual languages, Digital libraries, Multimedia, Information visualization, Web retrieval, Constraint languages, Human-Computer Interaction
Project Award Information
Project Summary
New technologies such as multimedia, digital libraries, and electronic publishing require large databases and powerful query languages. This project investigates a database management system that supports a meta-query language with which users can design their own visual query languages to specify both the data to retrieve and the display format. Theoretical aspects of the research address the characterization of classes of visual queries that can be evaluated with guaranteed time complexity by providing a careful design of the query evaluation engine. Practical goals include the implementation of a database management system supporting visual queries, the dissemination of results using the WWW, and the transfer of technology. The project has a strong educational component, seeking the involvement of graduate and undergraduate students and the inclusion of prototypes in the classroom so as to reflect the more interactive and visual aspects of today's computer science. Visual query languages will be key components of the next generation of declarative database interfaces because they take advantage of the user's visual perception to convey information efficiently. Their successful implementation will provide database systems with fundamental capabilities not currently available.
Goals, Objectives, and Targeted Activities
We are currently involved in several projects. We describe their current status and plans for coming year.
Indication of Success
The list of different projects that we are currently investigating and implementing are an indication of the success of the current grant. Along with theoretical work, other results such as the ones obtained through usability studies on our prototypes demonstrate the practical relevance of our research. Several projects have been extensively used and tested (e.g., Mocha and DelaunayMM ) and are already available or will be shortly available on the Web. Several projects were started about a year ago, and have already produced results, whereas previous projects such as Delaunay are being considerably extended. We collaborate with other universities (Brown University, SUNY Buffalo, University of Rome, University of Bari, IRISA) and our work has been published in conferences and journals in the areas of Databases, Visualization, Multimedia, and Computational Geometry. Several new students have joined the different projects in the last six months, and their contributions are already noteworthy. Our research work has directly influenced the curriculum at WPI, including the development of a new undergraduate course called "Webware: Computational Technology for Network Information Systems."
Project Impact
Selected Project References
Judith Bayard Cushing
Elizabeth Kutter**
Affiliation:
Computer Science
The Evergreen State College
Olympia, WA 98505
**Molecular Biology, The Evergreen State College
Contact Information
LabI, The Evergreen State College
Olympia WA 98505
Phone: (360) 866-6000 x-6652
Fax: (360) 866-6794
Email: judyc@evergreen.edu; kutterb@evergreen.edu
WWW PAGE
http://www.evergreen.edu/scidb
Project Award Information
Supported Students and Staff (optional)
Michael Crozier and Tory Ringer, Undergraduate Research Assistants; Justin Laird, Programmer.
Keywords
Scientific Databases, Object Databases, Middleware, Interoperability, Data Mining.
Molecular Biology, Computational Chemistry.
Project Summary
Scientific researchers often expend significant effort using their program with others’ data, their data with others’ programs or linking their data to others’ data to gain new understanding. Recent increased physical interoperability has exacerbated conceptual differences between applications and data sources. We use object technology to develop infrastructure for computational science experiments – domain-specific data structures and object-oriented (database) systems to represent inputs and outputs for experiments that use different applications. For computational chemistry, the middleware dubbed computational proxy facilitated experiment invocation, monitoring, and results capture by modeling application programs and processes. For molecular biology, we addressed integrating and reasoning about multi-tiered computational results. One issue involves providing adequate information (metadata) about computational results so that those results are more readily reproducible, and tracking operations on these results such that those operations can be repeated for other sets of results or modified. A second issue involves declarative representations of scientific data, applications and computations such that the data structures can be extended, metadata can be captured, and program inputs generated and outputs parsed semi-automatically.
Goals, Objectives, and Targeted Activities
Molecular biology applications, as other scientific domains, need to store and provide views of large amounts of specialized quantitative information. High speed sequencing technology, considerable funding to map the genomes of key biological organisms, and public databases such as GenBank, PDB, EMBL, JIPID, and SwissProt make available millions of genetic sequences. In addition, university and industry laboratories maintain private sequence databases for preliminary or proprietary research. The need for common interfaces and query languages to exploit these heterogeneous databases is well documented, and several such systems exist or are under development. Other efforts address conceptual data model standards. Our own work on database and program interoperability has shown that providing scientists a single interface is but the first step towards making the databases really useful. (1) Molecular biologists need to analyze their own sequences as well as sequences from public databases, and to manage the results of computations on data from multiple sources. (2) A single conceptual data model is too limiting – in some communities the data structures or functional requirements differ because the organisms differ, and advances in the science require changes to the model. We focused on the first of these issues during the past two years, but turn to the issue of extensible conceptual data models in the next, and to ecological and geophysical domains as well.
Molecular biologists often split the output from computational biology programs into many (and relatively small) data items so that they can reason about them with other similar items. For example, a researcher might make hundreds of sequence comparison searches for a scientific study, each of which generates many separate items, and might be faced with thousands of items to organize, compare, reason about and keep track of. Our research aimed to develop conceptual structures and implement prototypes for integrating, analyzing and tracking inputs and results from numerous computational biology programs.
Working with molecular biologists at the University of Washington Biotechnology Center, Zymogenetics, and initially with bioinformaticians at Cold Spring Harbor, we developed a conceptual data model and prototype that helps researchers organize result items from sequence comparisons into clusters that can be marked, named, annotated, and manipulated. Researchers can apply simple operations to result elements, create new elements, and compare results across different programs. Our goals for this year were to evaluate the effectiveness of the current prototype, and make software and operational changes to bring the technology into the hands of practicing scientists. The prototype was deployed in three sequencing laboratories to gain direct experience with scientists. Sample datasets were articulated and made available as examples in the user’s guide. The software was evaluated and changes needed for immediate use effected; additional changes were articulated in a design document. Functional specifications for a scripting language for reasoning about results were developed. We evaluated the work in terms of what additional computer science research is needed, and what additional research could be applied.
Indication of Success
This is going to be the research assessment part of your report (accomplishments, success stories). In addition, include an analysis, prediction, and/or expectations of the long-range results of the award. Degree of success -- have the stated objectives been accomplished thus far in the life of the project? Note that the reported progress does NOT have to reflect the initial research plan. On the contrary, innovative deviations are always welcome.
The computational proxy work and our work as a proof of concept of the idea of computational proxies sparked a significant development effort for an operational system at the Pacific Northwest National Lab. Interest in our work with molecular biologists has literally exceeded our time to respond, and demonstrates the value of close collaboration between domain scientists and computer scientists to build databases and prototypes of "real world" scientific problems. For example, presentation at the ’97 SIGMOD data mining workshop of how our prototype could be used for data mining the genome databases prompted requests in ’98 for our presence at data mining workshops and collaboration with researchers developing algorithms and architectures for data mining. Presentation of our work at the ’97 Scientific and Statistical Database Workshop led to a request for us to write a chapter in Shasha and Wang’s Pattern Discovery in Molecular Biology Data.
In the past six years, we note considerably increased interest and expertise within small molecular biology communities for collaboration and developing extensions to existing tools for particular needs arising from differences in genetic structure of specific families of organisms. At the semi-annual meeting of the T4-phage community (Summer ’98), the database interest group session numbered almost 40 attendees, up from 5 only four years earlier. Perhaps more significant than numbers was the nature of the suggestions from the community about what kinds of tools were needed and how the community could go about developing those tools. The community did not merely describing the problem in general terms to computer scientists, but proposed working themselves to develop specifications and writing prototype scripts and collaborative self-funded development with the computer scientists. While such collaborations were common in large laboratories (and still are), the willingness of smaller laboratories to get involved and their increased technical expertise is in part an indication of the success of interdisciplinary research projects such as this. NSF initiatives have been key in building both infrastructure and technical sophistication among the molecular biology community.
One strong indicator of success has been the acquisition of additional funds to support the work: (1) from the Washington Technology Center to subsidize the prototype development and deployment. The WTC recognizes needs within critical high tech fields within the state. (2) from the NSF Research Experiences for Undergraduates, for undergraduate students to participate in the interdisciplinary research conducted by the PI’s under this grant. Finally, Tim Hunkapiller and others, our major collaborators at the University of Washington, received considerable funding from drug companies to establish a genome data mining venture, where many findings funded by this NSF project are put into daily practice discovering new drugs and specific gene functions.
Project Impact
Project impact has been in several areas, including human resources, education, institutional infrastructure and technology transfer. The human resource aspect is particularly important, since the development of bioinformatics expertise has been noted as a national priority.
Noted for innovative education, Evergreen faculty make considerable effort involving undergraduates in research, and integrating research results into the curriculum. Thus, more than the three to four active student participants gain an appreciation for the nature of scientific research. This grant has involved two computer science students each year directly in its work; those students have been encouraged to work collaboratively with molecular biology students. One success story is a 1998 interdisciplinary student project – the Bioinformatics Automated Research Toolkit (BART). Students worked with molecular biology researchers at Evergreen, University of Washington Biotechnology Center and Zymogenetics, Inc., to design and implement a toolkit for researchers less technically expert than those using our prototype. The system aimed to integrate web-accessible sequence analysis programs (e.g., BLAST, FASTA) into a single computational environment and to help researchers reason about those results. It was implemented in Sun’s Java Development Toolkit and could run on any platform with that environment (e.g., Sun Solaris, NT Windows, Linux). BART was demonstrated at the International T4Phage meeting the following summer. The molecular biology student went on for a summer internship at UCSD with bio-informatics and the computer science students continued with computer science aspects.
This grant has contributed to one Ph.D. dissertation and two Master's Theses (in conjunction with Oregon Graduate Institute), and six undergraduate senior projects. Two of the projects involved women. The project has placed four baccalaureate graduates into projects involving molecular biology systems, first at Cold Spring Harbor and now at startup firms in Boulder, and the Fred Hutchinson Cancer Research Hospital. Each of the last three summers, two to four Evergreen students have worked as interns in scientific databases at the Environmental and Molecular Sciences Laboratory in Richland, WA. Those students have gone on to work at other National Labs, and at small high-tech ventures in Seattle and Portland. One student has gone on to graduate school, working in domain specific languages and interoperability. Three more students soon graduate.
The project has provided an exemplar project for computer science and molecular biology curricula at Evergreen, demonstrating interdisciplinary research and the integration of research with education. More specifically, the project’s interoperability and conceptual data model problems are used as real world examples in software engineering and database classes at the college. Cushing has even taught rudimentary sequence analysis to 100 freshmen and Kutter has developed labs in molecular modeling and analysis for her upper division classes.
The grant had a important impact on institutional infrastructure, with the establishment of a scientific database laboratory at Evergreen. Prior to this there were no Unix workstations available for undergraduate or faculty projects. The lab now provides equipment and software for student and faculty research and projects in computational chemistry, ecology, geology, and medical imaging, as well as molecular biology. As a direct result of work accomplsihed with this grant, four other grants have been awarded, including a major (in excess of $1 million)
research experience for undergraduate research in molecular biology.
The project’s technology transfer to the molecular biology community has been accomplished through collaboration with University of Washington molecular biologists who formed a genome data mining company that works closely with pharmaceutical companies and genome sequencing companies such as EBT and Zymogenetics. Initial work and placement of graduates (more demand than we can meet) transferred interoperability and data modeling technology to bioinformaticians at Cold Spring Harbor who later formed a second startup company and the Pacific Northwest National Laboratory, where a multi-year, 15-person-year software development effort was launched.
GPRA Outcome Goals
This project has integrated computer science research with molecular biology research needs. More than "applied" research, the effort has extended prior research to new domains and has furthermore identified additional research needed to solve unearthed problems in interoperability of programs and data. Developing prototypes and contributing personnel for further development in both established industry and start-up firms demonstrates the frontier nature of the work. The work is = used to complete the sequencing of the T4-phage and other genes, which are being used for novel medical purposes (delivery of drugs to targeted cells).
Through educational outreach at Evergreen and such organizations as the National Collegiate Inventors and Innovators Alliance (NCIIA) and other professional societies, Evergreen faculty have helped to improve achievement in mathematics and science skills. Thus, for example, one of the undergraduates working with the computational chemistry aspects of the project received the "Best Project" award at the regional American Chemical Society Student Research Meeting.
Our prototype was deployed not only at labs in Seattle and Olympia, but also in Germany, and the prototype effected additional systems that increase the flow of timely genomic information and new computational applications to researchers involved in data mining the genome databases. Feedback and ongoing discussion with these researchers enhances the research of our research.
Project References
Area Background
Solving problems in scientific databases offers opportunities to bring together research issues in computer science applicable to database systems, programming languages, software engineering, and distributed systems. Scientists are excellent collaborators, generally with a deep understanding of their data structures, system requirements, and technology limitations. Scientific domains offer wide variety of examples for which analogies can be drawn to larger issues where a general solution may initially be too difficult. Just as the sciences offer significant opportunities to identify computer science research issues, computer scientists can offer significant help to various scientific disciplines in building the infrastructure necessary to forward critical research in key areas of national impact such as molecular biology, global environmental change, or species diversity.
Within the mainstream of database research, semantic and object-oriented data modeling are important because of the importance of representing complex scientific objects independently of the physical schema and in an extensible manner. Distributed databases and collaboratory technology is relevant because of the interplay between an individual scientist's private databases and laboratory-wide or public databases. Programming language research, particularly as it applies to the development of domain specific languages, extensible data structures and processing semi-structured data, provides help on how we might develop tools that are flexible enough to support active scientific endeavor, and rigorous enough to satisfy need for high trust in results. Object and component technologies can be applied to promote reuse, portability and data distribution. Distributed software architectures, providing what Wiederhold calls mediation services. To make this technology work for the scientist, scientists must agree as to the meaning of the "mediated" data and services. Examples include that of individuals (e.g., Markowitz; Rieche and Dittrich; Kemp et al) and communities (e.g., the Life Sciences Research Interest Group of the OMG).
In addition to research in computer science, research in the specific domains is critical to forwarding the use of effective scientific database systems. Molecular biologists, computational chemists (witness the Nobel Award in 1998), ecologists, geographers and geologists all contribute to metadata standards, database archives, domain glossaries and thesauri, toolkits, algorithms, etc.
Area References
Potential Related Projects
Work on this project could be extended with further collaboration in domain specific languages, conceptual data modeling, schema migration, data mining, human computer interaction, metadata standards and visual programming. I am particularly interested in working with people with previous expertise in spatial data structures for new work in ecology and geosciences.
Alex Delis
Computer and Information Science
Polytechnic University
Contact Information
6 Metrotech Center
Brooklyn, NY 11201
Phone: 718 260.3313
FAX: 718 260.3074
E-mail: ad@naxos.poly.edu
http://naxos.poly.edu/~ad
Keywords
Scalable Distributed Database Architectures, Flexible Transaction Processing Models, Collaborative Query Processing and Client/Server Indexing Techniques.
Companies and organizations continuously deploy computing systems to handle ever increasing volumes of data. The management of such massive data is often carried out by networks of data servers and workstations/PCs (clients). Operating in such networked environments, Client-Server Database Systems (C/S DBSs) have shown reduced response times for client queries over their centralized counterparts, while keeping infrastructure costs low by using inexpensive hardware. C/S DBSs offer distributed processing with the overall control coordination often provided by the servers. Recent work on C/S DBSs has addressed system design, concurrency control, data caching, and performance issues. For limited number of clients, C/S DBSs have demonstrated promising performance rates. However, once the number of clients attached per server becomes very large, C/S DBS clusters fail to guarantee satisfactory response times for all participating clients (``scalability'' problem). In an effort to address the scalability problem, we study cooperative processing methods and explore the options that emerging network technology offers in order to achieve improved client response characteristics. More specifically, we are studying the effects of high-speed networks on the design of C/S DBSs; access methods for spatial and multi-dimensional data in C/S environments; and techniques that facilitate query materialization through multi-client cooperation.
We are currently investigating: (a) the impact that switched networks will have on the development of C/S and distributed transaction protocols as most client sites will be capable to efficiently exploit their long-term memory, (b) the effect that client-clustering may play in the realization of very scalable architectures and the constraints that such clustering may impose on the dynamic reconfiguration of systems, (c) the feasibility of query handling via multi-client cooperation, and (d) the deployment of spatial indexing techniques that may be suitable for very large data-sets stored in shared nothing database architectures.
We have:
Area Background
Conventional database systems strive for fast and convenient access of data. Modern C/S DBS proposals offer reduced response times by off-loading the top level functionalities of database servers to clients. Transient data caching in client short-term memory and the provision of global memories has helped considerably in this direction. Aggressive and application specific concurrency control mechanisms for database servers have contributed as well. Our work aims at extending the success of these earlier efforts and provide a basis for further improving the scalability aspects of C/S DBSs.
Area References
Potential Related Projects
S. Banerjee amd P. Chrysanthis, Univ. of Pitt. ; L. Rashid, Univ. of Maryland.
|
|
|
|
Department of Computer Science |
|
|
Carnegie Mellon Univ. |
|
|
Pittsburgh, PA 15213 |
|
|
Phone: (412)-268.1457 |
|
|
Fax : (412)-268.5576 |
|
|
Email: christos@cs.cmu.edu |
|
|
|
Image, video, sequence indexing; searching by content
We are studying a domain-independent method to search multimedia databases by content. Examples of these searches include `find all images that look like this graphic drawing, which is a photograph of a sunset' in a collection of color images; `find stocks that move like Motorola's' in a collection of stock price movements. The target method should be much faster than sequential scanning, but it should return the same response.
We are focusing on two new settings: (a) for natural and medical images, with sub-pattern matching and state-of-the-art signal processing methods, like wavelets and (b) for voice and related time sequences that include time-warping and Hidden Markov Models (HMM), which makes it challenging to find good features.
We have already developed :
Long range results: The project couples modern database techniques (like R-trees and variants) with signal processing and machine learning, to solve large scale, real problems. We have successful collaborations with Electrical Engineering (morphology, wavelets), Medicine (tumor-like shape indexing) and Statistics/Numerical analysis (Singular Value Decomposition), and storage devices (Active Disks).
Activities enabled: the design of 15-828A; the AT&T patent filing [Korn97].
The following refereed references mention the above NSF support:
Database research focuses on fast and convenient access to information. For example, given a collection of bank records, we want to find quickly the balance of our account. Supported datatypes have traditionally been strings and numbers. Our work aims at extending the success of commercial databases to additional datatypes, like video, audio, time series, images.
Christos Faloutsos, Searching multimedia databases by content, Kluwer Academic Publishers, Norwell, MA, 1996. ISBN 0-7923-9777-0.
Discoveries at and across the frontiers of science and engineering: Active Disks for data mining; Singular Value Decomposition for compression and for ``Ratio Rules''; relevance feedback for multimedia querying (MindReader).
Connections between discoveries and their use in the service of society: Our SVD-based compression method will allow faster data mining, bringing the relevant data higher up in the storage hierarchy, and allowing fast, approximate rule detection. The MindReader system allows for a user-friendly querying of multimedia and mixed-media databases by multiple examples.
A diverse globally-oriented workforce: Our efforts are inherently cross-disciplinary, straddling databases, signal processing, medicine, machine learning and statistics.
Leonidas Fegaras
Department of Computer Science and Engineering
The University of Texas at Arlington
Leonidas Fegaras
Department of Computer Science and Engineering
The University of Texas at Arlington
416 Yates Street, 301 Nedderman Hall, P.O. Box 19015
Arlington, TX 76019
Phone: (817) 272-3629, Fax: (817) 272-3784
Email: fegaras@cse.uta.edu, WWW Page: http://lambda.uta.edu/
Query Optimization, Object-Oriented Databases, Nested-Relational Algebra, Query Unnesting, Database Programming Languages, Database Physical Design
One of the key factors for object-oriented database (OODB) systems to successfully compete with relational systems as well as to meet the performance requirements of many non-traditional applications is the development of an effective query optimizer. This research addresses efficient query optimization for OODB languages. The research framework is based on a calculus, called the monoid comprehension calculus, which has already been shown to capture most features of modern OODB query languages. This research concentrates on two very important optimization problems and proposes practical, effective, and general solutions. The first problem is query unnesting, an optimization that, even though improves performance considerably, is not treated properly by most OODB systems. A framework is developed that generalizes many unnesting techniques proposed recently in the literature and is capable of removing any form of query nesting using a simple and efficient algorithm. The second problem is query optimization in the presence of side effects. A practical method is developed that allows the same optimization techniques proposed for regular queries to be used with minimal changes for OODB queries with side effects. A prototype query optimizer is constructed that shows that these optimization techniques improve performance considerably. The results of this research will help OODB vendors build better query optimizers, leading to better system performance. This research is carried out in collaboration with David Maier at the Oregon Graduate Institute.
During the first year, Fegaras and Maier will develop a framework for query unnesting in the presence of methods. Maier's long experience with the "Revelation" project will be a valuable resource in solving this problem. At the same time, Fegaras and the UTA students will implement the query unnesting algorithm and incorporate it into the prototype OQL optimizer already in existence at UTA. We will evaluate the merit of query unnesting by measuring the performance improvement gained by this method. If the integration of our system with the SHORE evaluation engine is stable at that time, the performance improvement measurements will be done using real data and real query response times; otherwise, the measurements will be done using cost estimations.
During the second year, Fegaras and Maier will refine the framework for handling object identity and adapt it to OQL. At the same time, Fegaras and the UTA students will implement the framework extensions described in the first year. A realistic application domain will be chosen to evaluate the effectiveness of our query optimizer. At that time, we expect that we will have a full integration of our optimizer with the SHORE evaluation engine. As a first measure of plan quality, we will use this evaluator to get real performance measures for our optimizer and compare them with those of a commercial OODB system, such as O2.
During the third year, Fegaras and the UTA students will implement the framework extensions described in the second year. Since no other OODB query optimizer handles object updates during query optimization, we will evaluate the merit of our method by measuring the performance improvement gained when our method is compared against the naive interpretation of updates. At the same time, Fegaras and Maier will work on more advanced theoretical issues and extensions to the basic model, such as query optimization in the presence of materialized views and the incremental maintenance of materialized views.
By the end of this project, we will have a fully functional object-oriented database management system built on top of the SHORE query evaluation engine. A web page with online demos is being constructed to let other researchers use and test our work by submitting ad-hoc queries to a fixed database. The source programs of our optimizer, the benchmark queries, and the data used in our performance results are available to other researchers through the web at http://lambda.uta.edu/lambda-DB/manual/. Currently, our system can handle most ODMG ODL declarations and can process most ODMG OQL queries. It also supports embedded OQL in C++ with a very low impedance mismatch, transactions, updates, index creation & maintenance, macros, and methods with OQL body. Our query optimizer unnests all nested queries, materializes path expressions into pointer joins, performs some forms of semantic optimizations, uses a cost-based polynomial-time heuristic for join ordering, and uses a rule-based cost-driven optimizer to produce physical plans. The generated physical plans are translated into C++ code. Our query evaluation engine is stream-based and supports many evaluation algorithms, including indexed nested loop, pointer join, and sort-merge join.
We will evaluate the merit of our query unnesting algorithm by measuring the performance improvement gained by this method. We would consider our work successful if we gain at least 50% performance improvement for deeply nested queries. As a measure of plan quality, we will get real performance measures for our optimizer and compare them with those of a commercial OODB system, such as O2. Even though O2 does not use the same evaluation engine, we would consider our work successful if our measurements indicate at least 20% performance improvement for deeply nested OQL queries.
We expect many OODB vendors as well as researchers in academia and industry to use our work to improve the performance of their systems. In addition, the results of this project will be published in conference proceedings and journals and will be available through the web. Finally, we expect that this project will lead to a PhD thesis and to 3-4 Masters Projects at UTA.
Our system will be the first ODMG-compliant object-oriented database management system to be freely available to other researchers. It will be open-source so that other researchers can improve it and extend it. We expect that our theoretical work as well as the freely-available source code will have an impact on the research on object-oriented database implementation and will improve the quality of the next-generation database systems.
One of the reasons for the commercial success of the relational database systems is that they offer good performance to many business applications, mostly due to the use of sophisticated query processing and optimization techniques. However, many non-traditional database applications require more advanced data structures and more expressive operations than those provided by the relational model. Currently, there are two major approaches for addressing the needs of these applications. One approach is object-relational databases (ORDBs), which extend the relational model with object-oriented features, and the other is object-oriented databases (OODBs), which provide database capabilities to object-oriented programming languages. OODBs provide powerful data abstractions and modeling facilities but they usually lack a suitable framework for query processing and optimization. One of the key factors for OODB systems to successfully compete with relational systems as well as to meet the performance requirements of many non-traditional applications is the development of an effective query optimizer. Even though there are many aspects to the OODB query optimization problem that can benefit from the, already successful, relational query optimization research, there many key features of OODB languages that make this problem unique and hard to solve. These features include object identity, methods, encapsulation, user-defined type constructors, large multimedia objects, multiple collection types, arbitrary nesting of collections, and nesting of query expressions.
Michael Freeston1, 2 Terence Smith1
1Alexandria Digital Library Project
Department of Computer Science
University of California, Santa Barbara
2Department of Computing Science
Kings College
Aberdeen
Scotland
Contact Information
Michael Freeston
Alexandria Digital Library Project
Department of Computer Science
UCSB
Santa Barbara CA 93106
Phone: (805) 893-8589
Fax : (805) 893-3045
Email: freeston@alexandria.ucsb.edu
http://www.alexandria.ucsb.edu/~freeston
WWW Page
www.alexandria.ucsb.edu/~freeston/nsf/project_97
List of Supported Students and Staff
Mike Hoerhammer Research Assistant
Project Award Information
Award number: IRI-9619915
Duration: 09/15/97 - 08/31/00
Title: A New Generic Indexing Technology for Digital Library Support
Keywords
access, BANG, BV-tree, database, index, multi-dimensional
Project Summary
This project aims to develop a new, generic indexing technology founded on multi-dimensional indexing. The theoretical and practical basis of the technology has been developed by the project PI in a research effort extending over the past ten years, culminating recently in a number of key theoretical advances. The most important of these is a general solution of the n-dimensional B-tree problem [Fre95] – a problem which McCreight (co-inventor of the B-tree) has described as "…one of the persistent puzzles of computer science". The challenge now is to test these advances in a practical implementation of the whole technology. The philosophy behind the approach is in direct contrast to the current trend towards ‘data blades’ or ‘data cartridges’ i.e. the idea of plugging in a different index method for each data type. By studying the underlying principles of indexing techniques, we have been able to extend the worst-case performance characteristics of the B-tree to a method of indexing any data structures which can be expressed as directed graphs. The resulting technology thus has a very wide scope, ranging from conventional relations to logic clauses, and from spatial object vectors to pictorial image rasters. The central objective of the project is to develop support for all these areas within one integrated, generic indexing system. The aim is to maximize functionality and internal compatibility while maintaining low software complexity.
Goals, Objectives, and Targeted Activities
The application focus of the project is the Alexandria Digital Library (ADL) at the University of California, Santa Barbara. At present, ADL relies for its indexing support on existing commercial database systems. However, the indexing functionalities required by ADL are well beyond those which existing systems support. The new project is running in parallel with ADL, monitoring its indexing needs and devising solutions within the new technology.
Indication of Success
During the first year of the project we implemented an enhanced form of point and spatial BANG indexing (see project references below). Subsequent refinements made it possible to re-implement the system entirely in Java with very little sacrifice in performance. In particular, a modification of the partitioning algorithm for spatial extent indexing resulted in a spectacular improvement, outperforming the R*-tree significantly in both containment and intersection queries, and dynamic and bulk-loading index build operations.
We have recently turned our attention to scalable queries, in response to the need of ADL to return spatial query results of a magnitude proportional to that of the query window. By introducing an additional scale dimension to the index, we have been able to eliminate a subsequent filter step, thereby making query performance almost independent of the query window size for typical geographic data sets. For large windows, this can improve performance by two orders of magnitude or more. We are also re-attacking the problem of providing both attribute-based and text-based indexing efficiently within a dynamic database system - an urgent need for digital libraries. Our approach is based on the close relationship between Patricia trees and BANG index structures and is yielding encouraging results.
Project Impact
It has become increasingly clear that ADL requires a level of functionality and performance which commercial database systems and GIS systems are currently unable to provide. The success of the ADL project has now led to an ambitious research and development plan for the next five years which will almost certainly widen this gap further. Urgent attention is therefore now being given to transferring the new indexing technology into robust support systems for ADL components.
GPRA Outcome Goals
Put at its simplest, improvements in indexing speed save time - and time is money. We have no way of quantifying the total time currently spent each day in the US alone by computers performing index build and search operations, but we are confident that the value of, say, a 5% speed improvement in these operations would be worth billions of dollars per day.
As an enabling technology for ADL, this project is also helping to solve the efficiency and scalability problems which digital libraries must overcome before they can serve a wider public need. The future ADL research and development plan directly addresses this need: it aims to provide a global infrastructure of georeferenced digital libraries and computational modeling environments, for every educational level and for research.
Project References
Area Background
Although there is always a session devoted to indexing techniques at the major database conferences, it is sometimes said we should stop doing research into this area as we have plenty of good indexing techniques already. In fact this is not the case. There have been many design proposals, but very few have stood the test of time. The database industry still relies almost exclusively on the B-tree - an indexing method devised over twenty-five years ago. It is also sometimes said that indexing techniques are now about as good as they are going to get - that because we have relied on the B-tree for so long, there probably isn't anything better. This is nonsense. Just try recalling any incident from any moment in your life, however long ago. You will be able to retrieve a complete visualization in your mind within a fraction of a second - certainly within the time of 50 disk accesses at most. Current storage and indexing methods don't come anywhere near such performance. This illustrates a further point: that multi-media is not just changing the content of database systems, it is blurring the boundaries with other areas of computer science: image recognition, for example. For this reason, the study of indexing methods today is a wider, more interesting and much more challenging field than ever before.
Area References
Potential Related Projects
As pointed out in the project summary, the scope of the indexing technology under development in this project is very wide. Potential applications which we have identified for study include: the replacement of 'conventional' relational database indexing by multi-dimensional indexing; complex structure indexing in object-oriented databases; spatial indexing in GIS systems (including large-scale persistent quad-trees); substring indexing in text retrieval systems; constraint database indexing; multiple-resolution image indexing; image feature indexing; and logic clause indexing in large-scale deductive database systems.
Ophir Frieder
Department of Computer Science
Illinois Institute of Technology
Chicago, Illinois 60616
Phone: (312) 567-4496
Fax : (312) 567-4543
Email: ophir@cs.iit.edu
http://www.cs.iit.edu/~ophir
http://www.cs.iit.edu/~ophir/nsf.html
Information retrieval, text databases, parallel processing, relevance feedback, structured and unstructured data integration, gene sequencing, multi-sequence alignment
Duration: September 1993 - February 2000
Current year: 6
Title: NYI: Large Scale Information Processing
Data are electronically available throughout and in a diversity of formats. To efficiently process and extract information from these data requires distributed, efficient, portable, high-performance information processing engines. Depending on the type of data to be processed and the application demands, examples of such processing engines are: gene sequencing engines, information filtering or retrieval systems, database engines, etc. To support the distributed nature involved, reliable, verified, and efficient communication protocols must be developed. It is within this context that my group has and continues to focus our research efforts. Topics we focus on include:
In the 1998-1999 research year, we will continue our involvement with NCR personnel as part of our NIST TREC activities; results from TREC-7 were encouraging, and we hope to further improve in TREC-8. An additional doctoral student is likely to complete her research prior to the completion of this project.
Our ongoing project focuses on the design, development, and evaluation of large scale information systems. The likelihood of technology transfer to the general user community is improved by close interaction with either an industrial partner that will commercialize the technology or a scientific agency that maintains the critical data repository. Such a scientific agency, if available, typically directs and supports the tools that are commonly used by the scientific community to process the data. Therefore, continuously, throughout the entire duration of all of our efforts, we interacted with our industrial partner when developing our information retrieval technology and with scientists and upper management at the National Institutes of Health when we developed the high performance gene sequencing efforts.
Our greatest focus centered around structured and text database integration. Traditionally, database applications stored predominantly structured data; today, however, the focus has shifted towards the integration of multiple data types. Jointly with an industrial partner, we integrated structured and text data using strictly the relational model. By developing a set query templates, we developed an approach that integrates both structured data and text using standard, unchanged SQL. In contrast to traditional information retrieval systems, the described approach provides for portability across platforms and for the opportunity to exploit parallelism without additional development costs. The approach was implemented on both serial and parallel platforms. Using the benchmark TREC data and query sets, we evaluated the performance of our system. Evaluation consisted of accuracy assessments and runtime and scalability measures. The results demonstrated both high accuracy and near linear speedups using a 24 node system. Portability was demonstrated by the success execution on multiple platforms using database software from multiple vendors. Currently our system is in commercial use by an information systems vendor and is also used in the Center for Complimentary and Alternative Medicine at the National Institutes of Health.
In the Human Genome area, we developed parallel computational methods for both retrieving similar sequences as well as aligning multiple sequences from large genetic and protein databases. In sequence searching, we combined the merits of prior differing load balancing approaches (static and dynamic) to develop a static partitioning scheme whose performance favorably compared against the prior state of the art. In terms of multiple sequence alignment, using a modified simulated annealing algorithm, we developed the first scalable, iterative multiple sequence alignment algorithm. This algorithm was based on the sequential Berger-Munson algorithm. Both parallel approaches are now in operation at the National Institutes of Health.
Finally, to sustain high communications reliability as required by large scale information systems, we developed fault tolerant protocols for intranet server systems. These protocols are deployed nationwide by MCI.
Per our proposal, we have designed, developed, and evaluated several information processing systems. We have transferred our technology both to commercial and research organizations. We have also developed a fault tolerant communication architecture to support reliable information retrieval services. This technology has also been adopted and deployed by the commercial sector.
* Human Resources
* Education and curriculum development at all levels.
The PI teaches graduate and undergraduate courses whose content is influenced by the research conducted in this project.
* Industry -- collaborations, transfer of technology, patents.
We developed several information processing systems some of which are presently in daily use at the National Institutes of Health, NCR, and at Harris Corporation. Efficient multicast and fault-tolerant protocols were also designed, verified, and developed. Some of the technology is currently deployed by MCI in, at least, 27 major cities nationwide.
Our systems are described in a variety of publications, two books, and three US issued patents. Three additional US patent applications are currently under evaluation.
Information retrieval is the selection of documents that are potentially relevant to a user's information need. Given the vast volume of, understanding the data requires the use of visual aids and searching the document database demands vast computational resources. As part of our study, we developed both visual data understanding tools as well as related parallel information systems. Our systems were customized for the needs of a diversity of domains including traditional text databases and biological gene sequences.
Richard Furuta and Frank M. Shipman III
Affiliation:
Department of Computer Science
Texas A&M University
Richard Furuta and Frank Shipman
Department of Computer Science
Texas A&M University
College Station, TX 77843-3112
Phone: (409) 845-3839 (Furuta) / (409) 862-3216 (Shipman)
Fax : (409) 847-8578
Email: furuta@cs.tamu.edu and shipman@cs.tamu.edu
URL: http://www.csdl.tamu.edu/~furuta/ and http://www.csdl.tamu.edu/~shipman/
http://www.csdl.tamu.edu/walden/
Award Number: IIS-9812040
Duration: 10/01/1998 - 9/30/2001
Title: Information Discovery and Communication with Meta-Documents over the World-Wide Web
Hypertext/hypermedia, guided paths, computers and education, document structure
This project investigates use of an additional layer of structure, called a meta-document, to organize decentralized network-accessible information, such as that found on the World-Wide Web. By developing meta-documents, documents whose components are themselves documents, discovered information can be placed into context and communicated to others. Consequently meta-documents hold promise of being useful in applications where information, created and maintained by others, needs to be transferred; educational applications at all levels have this characteristic. To make meta-documents effective, the project will study basic issues in information representation, presentation, personalization, structuring, location, and exploration. In addition, practical study will be required of the means for authoring meta-documents and the means for keeping them in synchrony with the rapidly shifting contents of the Web. The resulting concepts have theoretical implications in understanding how to structure and communicate network-based digital information. They also have immediate practical application in producing materials that can be used in teaching and in providing overviews that will help the general public better identify relevant Web-based information on selected topics.
The research program will use the current implementation of the Walden's Paths system, which will provide a testbed for implementing and evaluating new ideas in prototype implementation. We plan to focus on qualitative rather than quantitative evaluation in educational settings by involving local teachers and their classes. We plan to gain access to teachers in two ways. The first is to draw from contacts made by the Cognition and Instructional Technologies Laboratory to build alliances with teachers in the Bryan and College Station Independent School Districts (K-12). The second is by working with instructors in our College of Education to incorporate versions of Walden's Paths into their graduate classes, thereby gaining access to feedback from the already practicing teachers (again K-12) in those classes who have returned to Texas A&M for advanced degrees. Our ultimate goal in evaluation is to gain insights that will permit enhancement of our project's prototypes in two specific areas: first, easing the organization of Internet-based materials by teachers (questions of authoring) and second, structuring the students' browsing of Internet materials in order to remove impediments to initial location and contextualization of information while not discouraging the free-form browsing that leads to developing effective skills for learning through focused exploration.
Our goals for year one are to accomplish the following tasks:
The project includes continuing funding for two graduate students. We expect that over the lifetime of the grant, that we will graduate at least three individuals: two at the Master's level and one at the PhD level.
A focus of the project is organizing Web-based materials, with one of the natural applications being curriculum development. Based on our earlier experience, we expect that the results of the project will find useful application in curriculum development at the K-12 and also at the college undergraduate levels.
In addition to the project's Web pages, referenced above, see the following published articles:
Our investigation of meta-documents, as encapsulated into the Walden's Paths project, draws from a long-standing area of investigation in Hypertext/Hypermedia systems. Vannevar Bush's pioneering article "As We May Think", published in 1945 in in both the Atlantic Monthly and in Life Magazine, introduces the notion of paths for two purposes: as a personal means for remembering and organizing found information, and second, to provide a means to communicate that information to friends and associates. Paths were incorporated into stand-alone hypertext systems beginning around 1988--Randy Trigg and Polle Zellweger describe two implementations carried out on systems at Xerox PARC at that time.
The Web raises novel issues, not faced by these earlier projects. Although the technology underlying the Web is largely foreshadowed by earlier hypertext applications, its use environment is new. Unlike its predecessors, the Web is highly heterogeneous, both in readers but also in information provided. Readers range from highly-educated academicians to elementary school students. Ages range from retirees to pre-schoolers. People make material available on the Web for all conceivable reasons--to communicate, educate, persuade, promote, defraud, and delude, to name just a few of the possibilities.
Many issues are technological (seeking ways to provide stability in an inherent unstable environment) but others relate to the contextualization of the material. This is especially evident in the classroom, where it becomes noticable that the encyclopedic materials available on the Web often must first be placed in broader context before they are accessible to students.
Pointers to the literature may be found as references to the papers identified above. Of continuing interest is Vannevar Bush's essay, also available online at The Atlantic Monthly
Bush, V. "As we may think." The Atlantic Monthly (July 1945), 101-108.
OBIWAN: Ontology Based Informing Web Agent Navigation
Susan Gauch
Electrical Engineering and Computer Science
415 Snow Hall
University of Kansas
Lawrence, KS 66045
Phone: (785) 864-7755
Fax : (785)864-3226
Email: sgauch@ittc.ukans.edu
Home Page: http://www.ittc.ukans.edu/~sgauch
Project Award Information
CAREER/EPSCOR: Cooperative Agents for Conceptual, Search and Browsing of World Wide Web Resources 97-03307, 08/15/97 - 07/31/01.
Keywords
information agents, distributed search, ontologies, world wide web, navigation, visualization
Project Summary
As the World Wide Web (WWW) continues to grow, the tools needed to manage that information need to grow as well. The focus of this project is to study the effectiveness of cooperating, distributed agents on categorization, browsing, searching, and visualization of information on the WWW. The effects are studied on single server (local) and multiserver (regional) levels. In particular, we are developing automatic site categorization techniques with respect to a shared ontology to create conceptual meta-information that characterizes a web site. We are exploring the use of this conceptual meta-information for query brokering, local and regional browsing and local and regional visualization.
Goals, Objectives, and Targeted Activities
Our primary goals are four fold:
Indication of Success
We have been able to characterize the contents of a local site with greater accuracy than naive Bayes, as validated by a controlled experiment. We have developed agents which allow users to browse and search multiple sites simultaneously, through the use of a shared ontology. Finally, we have been publishing papers in technical reports, refereed conferences and journals based on this research.
Project Impact
GPRA Outcome Goals
We have discovered a new, more accurate way to automatically categorize web pages. This should help in organizing information and impact the scientists, businesses and individuals who use information to further their personal and professional goals. In addtion, by providing a system which will organize and present information in a consistent organization, we can support users while they simultaneously browse information located in multiple locations with idiosyncratic local organization.
Project References
Area Background
Initially, information on the World Wide Web was found by random browsing, a time consuming and ineffective method. Now, all-encompassing on-line search engines make finding information much easier. In fact, it has become too easy. It is not unusual to find 4,000 items or more which match a given query. What is needed is not just a search engine that produces better results but rather an organization of the search results based on the concepts contained in the various Web pages. The search process must also become more distributed, removing the incredible demand placed on the handful of popular search sites. In addition, searching is not the only desirable way to access information - users should be able to browse through the Web in an organized manner. Several search engines provide subject hierarchies which can be browsed, but the associated Web pages are manually placed in the categories which limits the amount of information available. In addition, each time the user visits a new node in the ontology, a page must be sent from the centralized server, making browsing a tedious activity. Finally, both searching and browsing allow users to view only a tiny portion of the World Wide Web. They are effective means of getting a detailed, partial picture but do not provide an abstraction or overview of the Web as a whole from a conceptual level: What kind of information is available today? How is the information available on the Web changing? These broader questions are essentially unanswerable today. The main open questions that this area related to this research are the following:
How can ontologies be built and/or modified to reflect local site contents?
How can ontologies be merged to reflect regional contents?
What should the topology of the local and regional agents be?
How does distributing the search process affect the quality of the retrieval sets?
What is the quality of the automatically built browsing structure for the Web?
How can we visualize large portions of the Web?
How can we visualize the adaptation of the Web over time?
To address these issues, we need to learn more about agent communication, information summarization and information presentation. In general, an active area of research is the use of search agents. Some search agents are specifically designed to locate information available from various home pages on the World Wide Web. These agents mask the complexity of the Information Superhighway and filter the exploding amount of information available. ProFusion [Gauch et al, 1996], for example, contains a mediator agent which interacts with information agents representing underlying Web search engines. The mediator selects which information agents/search engines are best for individual queries, and fuses the multiple search results.
Area References
Potential Related Projects
Judith Klavans and Nina Wacholder, Columbia University, Automatic Identification of Significant Topics in Domain Independent Full Text Analysis.
Award and Contact Information
NSF Program Information and Data Management
NSF Award No: 9630765
PI Name Fredric C. Gey
Award Period August 1, 1996 - July 31, 1999
PI Organization: University of California, Berkeley
Department Name: UC Data Archive and Technical Assistance
PI Address: 2538 Channing Way, \#5100, Berkeley, CA 94720-5100
PI Phone: (510) 642-6571
PI FAX: (510) 643-8292
PI e-mail: gey@ucdata.berkeley.edu
Project Web Site: http://ucdata.berkeley.edu/nsfprob.html
Keywords
Information retrieval, cross-language retrieval, probabilistic document retrieval, search engines, digital libraries, Chinese/Japanese retrieval, logistic regressionProject Summary
This project develops and tests probabilistic approaches to text retrieval. In addition to English document collections, the project has been researching methods for cross-language retrieval where queries are expressed in one language searching for documents in multiple languages. The project has matched English queries against English, French, German and Italian document collections. For the past few months, the project has been working on Japanese text retrieval and Japanese to English cross-language retrieval.
Using the statistical technique of logistic regression, documents are ranked in order of estimated probability of relevance with respect to a query. The methods are subjected to rigorous performance experiments with the collections of documents and queries of the TREC (Text Retrieval Conference) series of conferences sponsored by NIST (National Institute of Standards and Technology) and DARPA (Defense Advanced Research Projects Agency). The research contributes to understanding of the mechanisms of foreign language retrieval by applying its methodologies to queries and document collections in Chinese, Japanese, and European languages.
Goals, Objectives, and Targeted Activities
The past year's and current research has been targeted toward:
Project Impact
The project has supported, fully or partially, 5 graduate student researchers in developing and testing information retrieval algorithms. Our principal graduate student researcher, Aitao Chen, completed and filed his dissertation A Comparative Study of Pattern Recognition, Neural Network, and Statistical Regression Approaches to Information Retrieval at the end of fall semester 1998. Two other graduate students from the School of Information Management and Systems have joined the project and been trained in developing and maintaining text retrieval software during the past year.
Two former graduate students who were partially supported for Chinese document retrieval, are now working for the Internet search engine company Excite, where they are developing Excite’s Chinese language search engine, and implementing language identification software for Excite.
Indications of success
As we enter the final year of this grant, progress on this research has begun to show fruition in both results and technology transfer. Several years of experiments reported in the TREC series of conferences have shown that the regression algorithms which were developed for the English language work equally well in other languages—Spanish, German, and Chinese. Our work on cross-language retrieval with English queries matched against English, French, German and Italian documents has shown the centrality of phrase discovery for effective cross-language information retrieval. The development of a tutorial on the fundamental models in information retrieval performs a service by introducing non-IR computer specialists to the basic concepts of this research area; the tutorial has been well-received at several conferences.
One of our studied logistic regression algorithms forms the fundamental ranking algorithm for the Hotbot search engine and for the Berkeley NSF Digital Library project.
GPRA Outcome Goals
This project has supported the PhD activities of five graduate students in information science, all from the Peoples Republic of China, and two of whom are women. The project has provided valuable technology transfer of its fundamental algorithms to digital library and internet search engine implementations. Through two graduate students now working in industry, the project has aided in U.S. industry development of a Chinese language internet search development.
Project references
Area Background and Reference
Information Retrieval algorithms support the computerized search of large document collections (millions of documents) to retrieval small subsets of documents relevant to a user’s information need. Such algorithms are the basis for internet search engines and digital library catalogues. The fundamental models for retrieval are Boolean/logic (including fuzzy logic), geometric/vector space similarity, and probabilistic document retrieval. Application areas include foreign language and cross-language retrieval, text categorization, text summarization, speech and broadcast retrieval. Performance can be subject to unbiased objective testing against test collections of hundreds of queries matched to millions of documents.
Reference: Readings in Information Retrieval K. Sparck-Jones and P. Willet eds., published by Morgan Kaufmann, 1997.
Figure : Japanese Full-Text Retrieval System
Potential Related Project
I am co-principal investigator of the DARPA research contract "Search Support for Unfamiliar Metadata Vocabularies," which attempts to bridge between ordinary natural language search expressions to domain-specific classification indexing languages. More information can be found at the project web site http://www.sims.berkeley.edu/research/metadata.
Arif Ghafoor (PI), Rangasami L. Kashyap (Co-PI), Shankar Moni (Co-PI)
Purdue University, West Lafayette, IN, 47907
Arif Ghafoor
School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, 47907
Phone: (765) 494-0638 Fax: (765) 494-3371 Email: ghafoor@ecn.purdue.edu
Distributed Multimedia Systems Laboratory (URL: http://shay.ecn.purdue.edu/~dmultlab/nsf_proj/)
multimedia data representation, fuzzy queries, spatio-temporal modeling, content-based retrieval, image processing
The goal of this project is to develop a multimedia database system with capabilities to handle heterogeneous media queries. This system caters to the computational and storage requirements while accommodating and exploiting the inevitable semantic and representational imprecisions. The design of this system is based on multilevel data models and search mechanisms. These methodologies facilitate the users for posing various types of queries, including: (i) low-level, such as finding objects in a multimedia database, (ii) mid-level, based on spatio-temporal semantics, such as locating events associated with multimedia data, and (iii) high-level, targeted towards searching pre-composed multimedia documents, based on constituent mono-media data, their spatio-temporal dimensions, and logical structure. The multi-level search mechanisms are tightly interlinked. Imprecision in the search results is modeled using a set of fuzzy parameters. The results of this research are helping develop a comprehensive framework for building wide variety of multimedia applications in commercial, educational, governmental and military sectors.
Our long term goals include the development of a generic multi-level representation of multimedia data that can overcome several challenges faced by the database community. In particular, for low-level data modeling our research will focus on evaluating the unsupervised classification approach for image segmentation using a log-likelihood function. Two image segmentation models, namely the facet model and the texture model, are being evaluated in terms of computation and exactness of match. We are also planning to develop and evaluate other models. For mid-level multimedia data representation, we are planning to develop indexing, Petri-net and neighborhood graph models for semantic modeling and managing fuzziness in query formulation.
So far the stated objectives for this project have been met quite well, as demonstrated by the concrete research results produced to date. Success of this project in terms of quality research results has been aided by the collaboration between two researchers with backgrounds in image processing and multimedia databases. This collaboration has been instrumental in providing a sharp understanding of research challenges astride these areas. As a result of this unique collaboration, several research ideas and publications are being produced. The following research results has been obtained:
The project is expected to provide viable solutions for developing a general framework for developing multimedia databases needed for a broad range of applications. The NSF funding has provided an opportunity for this collaboration which would have been difficult, otherwise.
Two students are currently pursuing their doctoral studies as part of this project. One student has completed his doctoral thesis in this area. The research results of this project have been incorporated in a graduate level course on multimedia systems (EE 624), which is taught by the PI. The project and its planned implementation is a cornerstone of the cutting-edge research being carried out in our lab. We anticipate interest from industrial organizations as the project matures and implementation-worthy results are produced.
Our project is concerned with data management and information retrieval technologies essential for developing future multimedia systems. The emphasis of our research is on developing an automated system to allow multi-level data representation to assist query processing at different levels of abstraction. It will allow users to access images and video data based on appearance of objects as well as events surrounding these objects. The key tradeoff for users is between the accuracy of matching and the computational cost of the query. The framework developed for this project will provide solutions for challenging problems in multimedia data organization and integration, indexing and retrieval mechanisms, intelligent searching techniques, information browsing, content-based query processing and so forth. A large variety of potential applications will benefit from this framework.
There are several journals and conferences which have excellent coverage of issues in this area. They include: IEEE Multimedia; ACM Journal on Multimedia Systems; IEEE Trans. on Knowledge and Data Engineering; ACM Multimedia Conference; IEEE Int. Conf. on Multimedia Computing and Systems. Several special issues from IEEE Computer and ACM Journal on Multimedia Systems have been specifically devoted to this topic. Several industrial projects undertaken by IBM, Siemens, NEC, Oracle, Fuji Electric Co., etc., are focused on this topic.
Within the NSF IDM program, several projects related to multimedia data modeling and management are being conducted in UCLA, Case Western Reserve University, University of Maryland, University of Nevada, University of Illinois at Chicago, University of California at Santa Barbara, University of Pittsburgh, University of Maine, and University of Washington.
PI / Contact Information
Dina Q Goldin, Assistant Professor
Mathematics and Computer Science Department
University of Massachusetts at Boston
Address: 100 Morrissey Blvd., Boston, MA 02125
Phone: (617) 287-6483
Fax : (617) 287-6499
Email: dqg@cs.umb.edu
URL: http://www.cs.umb.edu/~dqg
Project WWW Page
http://www.cs.umb.edu/~dqg/cdb
List of Supported Students
Michael Uchitelev <uchitelm@cs.umb.edu, PhD student (as of 1/17/1999)
Project Award Information
Keywords
constraint databases, constraint query languages, query algebras, spatiotemporal databases, geographic information systems
Project Summary
Constraint databases (CDBs) extend the relational database paradigm to allow working with infinite amounts of data, as long as it is finitely representable (via constraints). We will work on the algebraic layer of CDB systems, i.e., the layer underneath the user interface layer and above the disk access layer, where the query is represented as an operator-based expression. In relational databases, this layer is instrumental in achieving efficient query optimization and evaluation. The aim of the proposed research is to show that Constraint Query Algebras are an extension of relational query algebras that:
Indication of Success
Last semester, we have performed a comparative analysis of commercial systemsARC/View, Arc/Info, and Oracle's Spatial Cartridge, confirming that there are problems with the underlying technology which cannot be fixed without a fundamental change to the underlying data model. In addition, we have analyzed the DEDALE Constraint Query Language, which is based on a complex-object data model. We have concluded that its restrictions on data and operator formats prevent it from providing the general solution for spatiotemporal querying that we believe possible with constraint technology. The resulting technical report is in preparation.
Furthermore, we have written a book chapter on Constraint Query Algebras (see Project References), currently in review.
Project Impact
GPRA Outcome Goals
The data of scientists and engineers is often spatiotemporal, i.e., consisting of points in time and/or in space. Relational database technology, though vastly successful, cannot handle such non-traditional data. The current state-of-the-art spatiotemporal databases do not admit the same extensive data querying capabilities as do relational databases. Furthermore, for the queries that are allowed, optimization facilities are ad-hoc, resulting in sporadic performance problems. By applying Constraint Database principles, these shortcomings may be overcome, allowing scientists and engineers who work with spatiotemporal information timely and flexible access to their data.
- D.Q. Goldin. Constraint Query Algebras. Book chapter, in Constraint Databases, L. Libkin, J. Paredaens, G. Kuper editors, to be published by Springer Verlag.
- D.Q. Goldin, P.C. Kanellakis. Constraint Query Algebras. Constraints Journal , E. Freuder editor, 1st issue, August 1996.
- J. Chomicki, D.Q. Goldin, G. Kuper. Variable Independence and Aggregation Closure. 15th ACM Symposium on the Principles of Database Systems, Montreal Canada, June 1996.
- P.C. Kanellakis, D.Q. Goldin. Constraint Programming and Database Query Languages. Symposium on Theoretical Aspects of Computer Software, LNCS 789, pp. 96-120, Sendai Japan, April 1994
Area Background
With the proliferation of the Internet and the unprecedented accessibility of desktop computing, the users are quickly accumulating non-traditional data, such as for medical, scientific, or geographic applications. More and more, they are starting to expect the same kind of data manipulation facilities for this data as MIS departments have enjoyed for theirs.
Constraint databases (CDBs) extend the relational database paradigm to allow working with infinite amounts of data, as long as it is finitely representable (via constraints). A perfect example is spatiotemporal data, which consists of points in time and/or in space, typical of the applications mentioned above. Besides having constraints for data representation, constraint databases also have them as query language primitives. Therefore, CDBs provide a strictly more expressive query paradigm than relational databases, even for ``administrative'' data.
This project focuses on the internal query representation of CDBs queries, i.e. Constraint Query Algebra (CQA). CQA forms the ``middle layer'' of CDBs, underneath the user interface layer and above the disk access layer, providing a homogenous set of operators for expressing queries over any data. It is expected that the algebraic framework will enable powerful query optimization strategies for all CDB queries, similar to those for the relational model. When coupled with user-friendly front-end query languages, it should provide many advantages over the current generation of spatiotemporal systems.
Area References
- P.C. Kanellakis, G.M. Kuper, and P.Z. Revesz. Constraint Query Languages. Journal of Computer and System Sciences, 51(1):26-52, August 1995.
- M. F. Worboys. GIS: a Computing Perspective. Taylor & Francis, 1995.
Potential Related Projects
If there are researchers that regularly make use of a GIS or another spatiotemporal database in their work, and are interested in trying to model their data and their queries with Constraint Database technology, there is potential for very productive collaboration.
Clement Yu and Ardeshir Goshtasby
Contact Information
Clement Yu
Dept. of EECS, University of Illinois at Chicago
Chicago, IL 60607-7053
Phone: (312) 996-23 18
Fax: (312) 413-0024
Email: yu@eecs.uic.edu
Ardeshir Goshtasby
Dept. of CSE, Wright State University
Dayton, OH 45435
Phone: (937) 775-5170
Fax: (937) 775-5133
Email: agoshtas@eecs.uic.edu
WWW Pages
http://www.cs.wright.edu/~agoshtas/nsf.htznI
http://www.cs.wright.edu/~agoshtas/faceseg.html
Project Award Information
Award Number: IRI-9509253 (UIC), HU-9529045 (WSU)
Duration: October 1995 September 1998 (UIC), October 1995 - September 1999 (WSU)
Title: A Picture-Retrieval System Based on Contents
Keywords
Retrieval by content, picture retrieval, electronic dictionary,
semantic model, lesion diagnosis, image segmentation
Project Summary
Pictorial systems which retrieve pictures based on contents have
wide applications. For example, learning can be facilitated by retrieving
pictures from electronic encyclopedia; radiologists can be assisted in
diagnosis; satellite images of desired contents can be retrieved by earth
scientists to perform research; appropriate garments can be viewed by
customers before deciding where they go shopping, etc. The purpose of this
project is to construct a generic picture analysis and retrieval system
which can be easily adapted to various applications, including the
construction of digital libraries and their use in retrieval of i
nformation. Similarity based retrieval involving spatial relationships, and
fuzzy and precise matching will be performed. Other techniques that will
play a significant role are: multi-dimensional searching, parameterized
similarity functions, visual query tools, image analysis techniques such as
segmentation, object recognition, color and shape analysis. The usefulness
of the system is being demonstrated for the following applications:
retrieval of color photographs using information from human faces and
retrieval of skin cancer images for diagnosis purposes at Rush-Presbyterian
Medical Center in Chicago.
Goals, Objectives, and Targeted Activities
Our project consists of two parts: a photograph-retrieval system and
a skin cancer detection system.
1. Photograph-retrieval system:
In this project the contents of a picture are modeled by objects,
their attributes values and the relationships between the objects. For
example, in a picture there may be an object 'man' and an object 'woman'.
Attribute values describing the man can be 'young' and 'short', while those
for the woman can be 'middle-aged' and 'tall'. The relationships between
objects can be spatial or of an 'action' type. For example, the man can be
to the left of the woman. This is a spatial relationship. The man can be
shaking hands with the woman, which is an action relationship. Our research
has been concentrating on:
2. Skin-cancer detection system
We make use of two types of information in diagnosing a skin lesion.
They are (a) image features such as the diameter and color variations in a
lesion and (b) the patient history. Our work in this project consists of
Indication of Success
Project Impact
Human Resources
At UIC, Alp Aslandogan and King Liu are continuing as Ph.D.
students; Danny Tran obtained the M.Sc. and is working in a company in
Chicago; Xiangjun Liu obtained the MASON. and is continuing his Ph.D. in
biological science; Adbul Manasrah obtained the MASON. degree and is
working at UIC, and Sean Lynch is continuing his undergraduate studies at
UIC. At WSU, Marcel Jackowski obtained his MASON. and is continuing his
studies as a Ph.D. student; Lang Xu and Linda Cai obtained their MASON.
degrees from WSU and are working at Lucent Technologies in Columbus, Ohio;
Greg Underwood graduated and entered private business; and Fred Smith
graduated and after a period of temporary work is planning to enter
graduate school.
Education and curriculum development at all levels: Image and video databases are taught in a graduate database course. Students who work in this project are trained in both computer vision and databases.
Industry - collaborations, transfer of technology, patents: We are collaborating with Steven Bines, M.D. at Rush Medical Center and SidneyMiller, M.D. at Miami-Valley Hospital on the skin-cancer project. Images of lesi ons obtained at Rush have been transferred electronically to UIC and WSU for analysis.
GPRA Outcome Goals
We as humans understand contents of images and remember similar
scenes in our memory effortlessly. Giving the same capability to computers,
however, requires development of breakthrough methods both in vision and in
database. Effective content-based retrieval of images is achieved only when
vision and database areas join forces to solve many complex problems. In
this project we have developed new methodologies for quantifying different
properties of skin lesions to characterize them. This characterization has
enabled effective indexing and retrieval of skin cancer images. In a
different front, we are developing new methods for locating human faces in
color images and determining topographic relations of humans in the images.
These methods will enable indexing and storage of color images containing
humans, and later on provide efficient means to retrieve the images based
on their contents. The developed vision and database methodologies will
enable more efficient storage and retrieval of image data that will have
implications in many areas such as medicine, remote sensing, industry,
entertainment, and education.
Project References
Area Background
The objectives of this research are: 1. To retrieve image data
effectively and efficiently; 2. To analyze image data automatically and
extract key features which allow image data to be indexed; 3. To
classify image data, for example, for diagnosis of skin lesions.
Area References
Potential Related Projects
A problem of potential interest is in data mining of images where a
huge database of images is given and one is asked to determine structure
among the images and identify images that possess certain properties.
Another problem of potential interest is to characterize events involving
humans in videos by first locating human faces and different parts of the
body and then tracking them in the spatiotemporal domain.
Luis Gravano
Computer Science Department
Columbia University
|
Luis Gravano |
|
Eugene Agichtein, first-year Ph.D. student (http://www.cs.columbia.edu/~eugene)
Keywords
Internet databases, distributed information search and retrieval
Project Summary
The goal of this research project is to help users find the information that they need over the Internet. Unfortunately, Internet information sources vary widely in the types of information and access interfaces they provide. Furthermore, the number of available sources is overwhelming to users. Therefore, using this wealth of resources effectively presents challenging problems. To address these problems, the research goals of the project include: (1) smart query processing over HTML documents, using information like query logs, citations, and user feedback to find the most useful documents for a query; (2) extraction of succinct summaries of information sources with varying capabilities and interfaces, so that the evaluation of user queries can focus on only the most useful information sources; (3) integration of query results from multiple heterogeneous information sources, so that users can query multiple sources and still receive a single, coherent query result; (4) integration of text, relational, and image repositories, so that users can use all available information regardless of its format. The educational objectives include the expansion of the curriculum in databases and information systems at Columbia University. In particular, a new course covers the latest trends in database and information systems. In summary, this project will develop the tools to provide users with seamless and transparent access to the large number and variety of Internet sources. In effect, users will be able to express their information need in simple ways, and receive the relevant data with no further involvement in the processing of their queries.
Goals, Objectives, and Targeted Activities
This year we will continue exploring novel techniques for smart query processing over HTML documents. In particular, we will identify ``non-traditional'' properties of documents to be considered for answering queries (e.g., intended audience and geographical relevance of the documents). We will then define efficient, scalable algorithms for extracting such properties of WWW documents mainly from HTML links and other properties of the documents and the web sites where they originated. To evaluate our results, we will define metrics for our document ranking algorithms by adapting the information-retrieval notion of relevance, and the precision and recall metrics. We will build a WWW search engine that answers queries using these non-traditional document properties, and gathers user feedback (preferably in an automatic way) to further evaluate our ideas.
Another focus of next year's research will be the problem of source discovery over uncooperative, searchable text databases. Such databases do not export any kind of summary of their contents, so we need to resort to other mechanisms to characterize their holdings. We are currently exploring techniques that automatically construct a relatively small set of queries (i.e., not more than a few thousand queries) to issue to these uncooperative databases to summarize their contents. We will then use these summaries to direct user queries just like the GlOSS system does for cooperative text sources.
Indication of Success and Project Impact
Although this project started only five months ago, we have already made progress along a number of the proposed directions:
Project References
Area Background
The Internet has grown dramatically over the past few years. Information sources are available everywhere, both within the internal networks of organizations and on the Internet. Unfortunately, these information sources vary widely in the types of information and access interfaces that they provide. Therefore, using this wealth of resources effectively presents interesting and challenging problems. In effect, users have information needs, and should not be concerned with the format of the available data, or with the interface and access capabilities of the data sources. Thus, users should receive the information that they need even if finding this information requires accessing sources of text documents, relational databases, and image repositories, for example. This project focuses primarily on how to help users find and exploit the information that they need.
Increasingly, users want to issue complex queries across Internet sources to obtain the data they require. Because of the size of the Internet, it is not possible anymore to process such queries in naive ways, e.g., by accessing all the available sources. Thus, we must process queries in a way that scales with the number of sources. Also, sources vary in the type of information objects they contain and in the interface they present to their users. Some sources contain text documents and support simple query models where a query is just a list of keywords. Other sources contain more structured data and provide query interfaces in the style of relational model interfaces. User queries might require accessing sources supporting radically different interfaces and query models. Thus, we must process queries in a way that deals with heterogeneous sources.
A key goal of this project is to provide users with seamless and transparent access to the large number and variety of Internet sources. Users should express their information need, and receive the relevant data with no further involvement in the processing of their queries. To make this scenario possible, the database and information retrieval communities, and the digital library projects around the nation have addressed several pieces of the associated query processing puzzle. However, many other pieces remain tantalizingly open. We are addressing these open issues by building on work from several fields, most notably from the database, information retrieval, and natural language processing fields. We are combining these technologies in fruitful ways to address the problem of processing sophisticated queries over the Internet.
Area References
Principal Investigator: Robert Grossman
Affiliation: University of Illinois at Chicago
Department Name: Dept. of Mathematics, Statistics & Computer Science
Robert Grossman
University of Illinois at Chicago
322 SEO, MC 249
851 S. Morgan Street
Chicago, IL 60607-7045
Phone: (312) 413-2164
Fax : (312) 355-0373
Email: grossman@uic.edu; http://www.lac.uic.edu/~grossman/hp/rlg.html
http://www.lac.uic.edu/m3d2-98.htm
Data mining, managing large data sets, scalability, workshop
The goal of the workshop was to foster interactions among an invited group of computer scientists who are interested in developing scalable systems for managing and mining massive data sets. Data mining is the automatic extraction and discovery of patterns, associations, changes, anomalies, and significant structures in large data sets. Large data sets generated by scientific, engineering, medical and business applications are becoming increasingly common. Developing algorithms and systems to uncover patterns in large data sets is a fundamental computer science challenge.
The goals of the workshop were to: identify promising approaches and technologies for scaling data mining to massive data sets; investigate how data mining can be more tightly integrated with databases, digital libraries, data warehouses, and data archives; identify how data mining services can be integrated into the emerging national computational infrastructure; and to begin the process of developing community data sets, benchmarks, interchange formats, and testbeds appropriate for mining massive data sets. The workshop consisted of survey talks, background talks and technical talks on different aspects of data mining as well as several application talks describing data mining case studies. In addition, there were also white papers presented describing fundamental challenges and issues, proposed research agendas, testbeds, emerging standards, and related matters. Approximately twenty five participants attended the workshop.
This workshop brought together 25 researchers to discuss the current state of data mining. A report of their findings entitled "Data Mining Research: Opportunities and Challenges", has been publicly issued at: http://www.ncdm.uic.edu/dmr-v8-4-52.htm.
This workshop helped to bring together 25 researchers to identify the major research challenges in the field of data mining. The research challenges are arranged into five broad areas: A) improving the scalability of data mining algorithms, B) mining non-vector data, C) mining distributed data, D) improving the ease of use of data mining systems and environments, and E) privacy and security issues for data mining.
The report http://www.ncdm.uic.edu/dmr-v8-4-52.htm made a number of recommendations on fruitful areas to focus future data mining research. The report also describes several success stories in data mining. See report for details.
Data mining is the semi-automatic discovery of patterns, associations, changes, anomalies, rules, and statistically significant structures and events in data. That is, data mining attempts to extract knowledge from data.
Data mining differs from traditional statistics in several ways: formal statistical inference is assumption driven in the sense that a hypothesis is formed and validated against the data. Data mining in contrast is discovery driven in the sense that patterns and hypothesis are automatically extracted from data. Said another way, data mining is data driven, while statistics is human driven. The branch of statistics that data mining resembles most is exploratory data analysis, although this field, like most of the rest of statistics, has been focused on data sets far smaller than most that are the target of data mining researchers.
Data mining also differs from traditional statistics in that sometimes the goal is to extract qualitative models which can easily be translated into logical rules or visual representations; in this sense data mining is human centered and is sometimes coupled with human-computer interfaces research.
Data mining is a step in the data mining process, which is an interactive, semi-automated process which begins with raw data. Results of the data mining process may be insights, rules, or predictive models.
The field of data mining draws upon several roots, including statistics, machine learning, databases, and high performance computing.
Any project related to the mining of massive, distributed data sets would be relevant to our work, in general, and would benefit from the findings of the workshop.
Joseph M. Hellerstein
EECS Computer Science Division
UC Berkeley
Joseph M. Hellerstein
EECS Computer Science Division
387 Soda Hall #1776
Berkeley, CA 94720-1776
Phone: (510) 643-4011
Fax: (510) 642-5615
Email: jmh@cs.berkeley.edu
Indexability, Generalized Search Tree, GiST, Index, Abstract Data Type, Object, amdb, performance analysis.
This project addresses two main challenges. The first is to produce a software development environment for easily generating indexing techniques for new complex data, e.g., maps, images, sound, video, sequences, documents, etc., with large database workloads. This is being done by (1) implementing a data structure called a Generalized Search Tree (GiST) in the context of an Object-Relational database management system, and (2) developing a graphical GiST visualization and debugging system. The second challenge is to undertake a rigorous mathematical investigation of how and when one can develop efficient indexes; this is referred to as a Theory of Indexability. Indexability theory is akin to complexity theory, which has guided software developers in understanding how and when one can develop efficient algorithms. The main distinction is that indexability focuses on space/time tradeoffs for indexes, and on the fundamental secondary-storage challenges involved in database indexing. Execution of this research agenda will produce a framework in which data domain experts (e.g., biologists, chemists, earth scientists, computer multimedia researchers, etc.) can (1) identify whether their workloads are indexable, and (2) if so, easily develop efficient indexes for their workloads, integrating these indexes with minimal effort into Object-Relational database management systems. An additional effort is to integrate the development environment into a curriculum for teaching both basic and advanced techniques in indexing and indexability. This curriculum includes the software itself, along with assignments being tested in undergraduate and graduate courses at Berkeley taught by the principal investigator.
In the past year we have integrated our GiST library with the SHORE object repository, and developed an Access Method Debugger called amdb that provides graphical performance analysis and debugging of search trees. Part of the development of amdb involves analysis of the efficacy of a search tree; this in essence uses our indexability framework to compare the clustering in the existing tree with an optimal clustering (which may in turn perform poorly due to the workload's inherent indexability.) We use the HMETIS clustering approximation algorithm to use as an indicator of the indexability of the workload.
Now that an initial version of amdb is stable, we are using it in the curriculum of our graduate database course, to motivate the difficulty of indexing. As a first experiment, we are assigning students the task of using the GiST library and amdb to develop a good index for set-subset queries over set-valued data. Students will be given 2 weeks to work in teams of 2, and there will be a competition to produce the most efficient solution.
We have begun to use amdb to help apply GiST technology to a variety of applications. Our first application is to integration GiST into the image analysis work ("BlobWorld") being done in the UC Berkeley digital library project. Our second target application, which will begin this year, is to integrate GiST as an accelerator for genomics applications like BLAST.
This project has begun to have both academic and industrual impact. A demonstration of amdb was presented at ACM SIGMOD '98. Our initial work on indexability (Hellerstein, Papadimitriou and Koutsoupias, PODS '97) is driving other research groups: new independent results by Koutsoupias and Taylor appeared in ACM PODS '98, and indexability work by Ravi Kanth appeared in ICDT '99. One of the major database vendors has begun the integration of GiST technology into their commercial system, and they are interested in the use of amdb as well.
In the longer term, we believe that indexability analysis in general, and amdb in particular, will provide a significant benefit to the database research community. Over the last number of years there has been an enormous volume of publications proposing new indexes. Most proposals differ in only minor ways, and can easily be implemented in the GiST. Indexability analysis gives researchers and developers tools to robustly evaluate both the difficulty of their task and the efficacy of their solution. We envision a significant clarification of the field as a result of our work, and we also look forward to this work driving the development of effective indexing schemes for currently unsupported workloads. This work has begun in-house at Berkeley, and we are in the process of submitting our results for publication.
The amdb and indexability project has built bridges at Berkeley between the database group and the theory and digital libraries groups. Active collaboration is underway as a result.
Index structures are one of the core components of any information system -- they allow particular data items to be retrieved directly from large datasets, without scanning through all the data. The traditional index structure in database systems is the B+-tree, which works well for range and equality lookups on alphanumeric data. However, much of the data in use today is far more complex than traditional alphanumeric data: witness the "information explosion" on the world-wide web, which includes unusual data such as maps, images, sound, video, sequences (of DNA, stock prices, etc.), documents and so on. For most of this data, the traditional B+-tree approach is not sufficient for queries that are natural to the data types (e.g. "find all videos with explosions", or "find all drum solos".) As a result, there is usually support for only the most naive queries on such data, such as queries based on the textual names of the data objects.
We have developed a framework for solving this problem, by designing and implementing a data structure called a Generalized Search Tree (GiST). The GiST generalizes a large subset of the previously invented indexes into one manageable piece of software, by reducing search trees to their essential properties: a hierarchical decomposition of a dataset, with labels at each level of the hierarchy that describe the data below. Within this basic framework, it is possible to index any set of data, and support arbitrary queries over the data.
The GiST's flexibility provides great power, but it is currently difficult to know how to harness that power for indexing in new applications. In general, data must be carefully placed into the index, and the index nodes carefully labeled, in order to guarantee efficient processing of queries. Moreover, a set of queries may generate conflicting demands on the index, resulting in situations where some queries must suffer if others are to be satisfied quickly. While it is currently very easy to use the GiST software, it is often quite hard to make it work efficiently, and sometimes unclear if it can be made to work efficiently for a particular set of data and queries.
Our work seeks to provide tools -- both actual software tools as well as theoretical "tools" -- to help index developers control and tune indexes so that they are used effectively. The work will result in an index development environment, along with crisp guidelines to describe the possibilities and limitations of the environment. Both the software and the theory are critical: the software enables developers to achieve what is possible, and the theory warns them what is not possible. Both classes of results are missing in indexing research today, and are key to developing quick access for the new workloads emerging in multimedia, complex data analysis, and the world-wide web. In addition to developing these software tools, we are keen to prove their utility by applying them to the challenge of inventing new indexing techniques for emerging applications.
Joseph M. Hellerstein
Michael Stonebraker
EECS Computer Science Division
UC Berkeley
Joseph M. Hellerstein
EECS Computer Science Division
387 Soda Hall #1776
Berkeley, CA 94720-1776
Phone: (510) 643-4011
Fax: (510) 642-5615
Email: jmh@cs.berkeley.edu
Michael Stonebraker
EECS Computer Science Division
387 Soda Hall #1776
Berkeley, CA 94720-1776
Phone: (510) 642-5799
Fax: (510) 642-5615
Email: mike@cs.berkeley.edu
http://control.cs.berkeley.edu
Control, online, aggregation, interaction, data analysis, data mining, database, query processing, GUI, widgets, data visualization.
Despite significant advances in data processing over the years, many data-intensive applications still run in batch mode: they require a long period of time to execute, and during that time they provide neither feedback nor control to the user. This state of affairs holds in a number of important domains, from the desktop (e.g. spreadsheets) to the back office (e.g. decision support) to new high-end applications (e.g. data mining and visualization tools). In each of these domains, batch processing is frustrating or even
unacceptable for serious users. In order to investigate large data sets, users require on-line behavior: they need to get meaningful feedback while the system runs, and they need to be able to act on the feedback by controlling the system while it runs.
With the support of this grant we are developing a set of core technologies for building applications that provide CONTROL behavior, and deploying those technologies in a variety of applications including database query processing, data mining, user-interface widgets and data visualization.
In the parlance of today's data processing systems, the term on-line signifies ``interactive'', ``within the bounds of patience.'' On-line processing is the opposite of batch processing. In the early days of computing, all serious work was done in batch mode --- the COBOL or FORTRAN programmer submitted her ``job'' to the machine operator and busied herself with other tasks; the expectation was that
the computer would be occupied for some time generating a solution. By contrast, newer on-line systems would return an answer while the programmer remained connected to the system. Because early systems
typically required users to wait for the completion of one job before starting another, on-line processing had to have interactive performance.
For today's programmers batch processing as described above is supposed to be a distant memory at worst. Yet there are still a variety of scenarios where computers essentially run in batch mode, taxing the
patience of both programmers and naive users. Most of these scenarios arise when processing significant amounts of data. In database systems this happens in large-scale decision support and data mining applications, which can run for many hours without producing output. In simpler desktop programs like spreadsheets this problem also arises, particularly when a user's data set is larger than the
traditional sizes anticipated by application designers. New advanced applications like data visualization face this problem in a direct way, since they need to provide the ``big picture'' of a large data set in an interactive, browsable fashion.
The objective of the work supported by this grant is to address this problem directly by developing what we call CONTROL: Continuous Output and Navigation Technology with Refinement On-Line. The idea of
CONTROL rests on three basic principles:
Prior to this award, our initial work on the idea of Online Aggregation appeared in SIGMOD '97 and related publications. Demonstrations of early prototypes for online aggregation and online data visualization were shown at SIGMOD '98.
Since receiving funding, our work on an online join algorithm called the Ripple Join has been accepted for publicatioin in SIGMOD '99. Also, related work from our group on a Continuous Association Rule Mining Algorithm (CARMA) was also accepted for publication in SIGMOD '99.
We have successfully implemented prototypes of our online query processing ideas in the Postgres system at Berkeley, and in the Informix Dynamic Server with Universal Data Option. A paper on the implementation issues of the latter has been submitted for publication. Two major database vendors are currently considering the incorporation of online query processing technologies into upcoming releases of their systems, and we have been active in facilitating the technology transfer, running demos and giving talks.
We have begun the implementation of a toolkit of Java widgets that will interact naturally with massive data sets. We plan to use this toolkit to build a spreadsheet interface that can scale up to arbitrary data volumes; plans for this work have already generated significant interest from major desktop software vendors.
Tomasz Imielinski
Haym Hirsh
Department of Computer Science
Rutgers, the State University of New Jersey
110 Frelinghuysen Road
Piscataway, NJ 08854-8019
Tomasz Imielinski
Phone: (732) 445-3546
>Fax : (732) 445-0537
Email: imielins@cs.rutgers.edu
Haym Hirsh
Phone: (732) 445-4176
Fax : (732) 445-0537
Email: hirsh@cs.rutgers.edu
http://www.cs.rutgers.edu/datamining/
Data Mining, Rule Learning, Text Mining
We have developed the concept of a query-based approach to data mining. We introduced a concept of second generation data mining which involves not only rule generation but also rule rule management and rule postprocessing. To support this paradigm, we developed a declarative query language - MSQL, and a C++ API to provide the user with flexible yet powerful means to post-process the large number of rules generated. A prototype second generation data-mining system called Discovery Board was also developed as part of this research. We have also applied the idea of query-based data mining beyond traditional databases, to unstructured textual data. We developed the KDT text mining system, which focuses on the identification and browsing of subcollections that differ in various fashions from other parts of the collection. We have also developed the FACT text mining system, which is able to exploit background knowledge of the domain as it discovers patterns across articles in a collection.
One of our goals in this research was to develop efficient algorithms for query based rule-discovery. We successfully demonstrated this approach by implementing various algorithms for rule discovery, providing users with tools to query, store, and manage the rules generated from data. We have also developed query-based mining tools for collections of unstructured text.
Although at the time of the workshop this project will have just ended, we are continuing with work in this area. For example, we are trying to incorporate other forms of discovery objects (trees, regression or discrimination functions, etc.) into the mining tool's API to allow for cross referencing knowledge generated via different learning techniques. This poses a new challenge since the type of information which must be catalogued for the various learning algorithm differ greatly in form and semantics. We also want to provide users the ability to mine based on earlier mining sessions. This will allow analysts to increase their productivity, and to cross-reference their efforts. A side benefit to this is that answers to new queries can often be derived as a subset of answers from past queries (even if posed by other users), saving time and processing power. In a similar vein, we are also developing heuristics for efficiently "re-mining" the same, or similar data in the face of updates and modifications. The heuristics we are developing involve enabling the system to "learn" about the correlation distribution from the previous mining on the data, and use this information stored persistently to greatly improve the processing time on future iterations of rule-generation on the modified data. For text mining we are also trying to broaden the range of information about the documents to which mining is applied. For collections of technical papers we would like to mine citation patterns, as well as their relationship to the presence of keywords. We also plan to develop keyword-recognition methods tailored to text mining, find keywords that prove useful for mining.
We have developed the concept of second generation data mining, and built a proof-of-concept system -- "Discovery Board" which embodies our approach. Second generation data mining involves not just massive rule-generation from databases, but also querying, storage and management of these rules, and providing primitives for embedding them into applications as first class objects.
We have developed a number of algorithms for efficient rule-generation from large databases, and have performed substantial performance studies to identify the key issues that affects performance and usability of such programs. We have built a complex performance testbed to test different rule induction engines. In the testbed one can generate database instances which reflect predefined statistical data patterns. In this way we can study the impact of data distribution on different rule generation methods.
In addition, we have also formally defined MSQL, the rule querying and rule generation language as an extension to the OQL specification, proposed in the ODMG 2.0 standard. The system is now fully object oriented allowing to apply rule induction engine to any class and any set of methods, both user-defined as well as materialized.
We have also introduced the notions of metamethods (method constructors) which allow the engine to be applied to complex data such as temporal sequences and multi-dimensional data.
We have developed tools for data mining on unstructured text that identify those subcollections that differ from other subcollections of a full collection. We have also developed data mining tools for unstructured text that can exploit background knowledge in the mining process.
The project supported two graduate students during their graduate studies. Several papers and demos at various data-mining conferences were presented in the last three years. A PhD dissertation titled: "Second Generation Data Mining: Concepts and Implementations" was accomplished by a graduate student directly funded from this grant.
As part of the collaboration with Helen Berman from the Chemistry Department on her NSF sponsored Nucleid acid database project, we helped port the system to SGI platform. Preliminary studies were done confirming the hypotheses of researchers about the molecular structure of DNA. We are continually adapting our system to make it more useful to highly domain-specific applications such as these.
The Discovery Board system was also extensively beta-tested at major insurance companies, resulting in a feedback loop, which provided us with real life experiences of people actively mining-using this system.
In one of the trials, a major insurance company that was trying to explore anomalies in their medical claims database. Discovery Board helped them quickly isolate pockets of high cost claims, and typical scenarios within each disease group that would lead to a high cost claim. This allowed the company to take preventive steps in several cases, by either providing pro-active customer care, or more required tests in different age group patients. We also worked with another group within the same insurance company, mining on the customer and policy database. The system helped them identify characteristics of people which are likely to drop out of their health plan. It also helped spot a couple of counties in NJ where the policy dropout rates were much above the norm.
Our industrial partners were very pleased with the results and we are now exploring a colloboration with another insurance company to perform data mining on their web-server log files.
Many government and commercial organizations possess extremely large databases, with sizes often measured in terabytes, containing such information as consumer data, astronomical observations, biological sequences, financial records, census and tax data, etc. The extraction of information from such large databases has become known as database mining or knowledge discovery in databases, and is an area where machine learning techniques must meet performance requirements of very large database systems.
One particular database-mining task, the problem of rule discovery, involves finding sets of conditions on fields of a record that allow one to predict the value of other fields of a record. Rule discovery is viewed as an interactive process with a "human in the loop," an iterative process where the user is not only trying to discover interesting results, but also interesting questions to ask.
T.Imielinski and H.Manilla "Database mining - new frontier", Communications of the ACM, Nov 1996
Lawrence B. Holder and Diane J. Cook, Principal Investigators
Department of Computer Science and Engineering
University of Texas at Arlington
Lawrence B. Holder and Diane J. Cook
Department of Computer Science and Engineering
University of Texas at Arlington
Box 19015, Arlington, TX 76019-0015
Phone: (817) 272-2596 (Holder), (817) 272-3606 (Cook)
Fax: (817) 272-3784
Email: {holder,cook}@cse.uta.edu
URL: http://www-cse.uta.edu/~holder and http://www-cse.uta.edu/~cook
http://cygnus.uta.edu/subdue
Data mining, knowledge discovery, minimum description length principle, inexact graph match, parallel algorithms, scalability.
The main objective of this project is to improve the scalability and effectiveness of knowledge discovery systems in order to handle large, structural databases. First, we are integrating the structural discovery system SUBDUE with one or more attribute-value based discovery systems. Second, we are developing parallel and distributed versions of SUBDUE and the integrated discovery system. Third, we are applying the integrated discovery system to large scientific databases, such as the Brookhaven protein database, and evaluating the results using domain experts. Fourth, we are disseminating source code for serial, parallel and distributed versions of SUBDUE and the integrated discovery system, and publishing the results of this research.
The main goals of the project continue to be the integration of attribute-value based discovery methods into our SUBDUE system, enhancement of the capabilities and efficiency of the system, application of the system to real-world structural databases, and dissemination of the system and results. We are currently concentrating on Inductive Logic Programming (ILP) systems for methods to integrate into SUBDUE, adding concept learning capabilities. We are automating the process of partitioning large databases for more efficient processing in distributed environments. We are applying SUBDUE to several domains, including the Brookhaven Protein Databank, Human Genome Project DNA data, Predictive Toxicology cancer data, U.S. Geological Survey earthquake data and Aircraft Safety Reporting System data. We are collaborating with domain experts to evaluate our results, which will be submitted to relevant scientific conferences and journals.
With the development of SUBDUE fairly stable, we plan to focus primarily on evaluation and dissemination in the coming year. We will evaluate the sensitivity of SUBDUE to it various parameters, compare the quality of results of distributed SUBDUE to the serial version, and make minor refinements based on these results and the needs of specific domains. The version of SUBDUE allowing the inclusion of a negative graph has proven particularly interesting, and we will evaluate it further. Since this version represents a different approach to relational learning, we will compare it to inductive logic programming (ILP) systems. We will continue to apply SUBDUE to real databases looking for interesting and novel patterns. We plan to publish several of these discoveries in the domain-specific literature as a final measure of success for the usefulness of the SUBDUE knowledge discovery system.
Several advances have been made since the last reporting period. First, a major new release of SUBDUE (version 4.1) is now available from our web page (http://cygnus.uta.edu/subdue/). The major enhancements of this new version include substantially faster processing, the use of background knowledge substructures to bias the search, more intelligent handling of numeric graph labels, and support for distributed discovery using the Parallel Virtual Machine (PVM) message-passing library and the Metis graph-partitioning system from University of Minnesota.
In our continued effort to integrate ideas from attribute-value based systems, we have developed a concept-learning version of SUBDUE that can accept a ``negative'' graph. This allows SUBDUE to further bias the search by preferring substructures that not only maximize the compression (minimize the description length) of the positive input graph, but also minimize the compression of the negative graph. This version of SUBDUE has already proven useful in the chemical toxicity domain, where we can input negative graphs (chemicals not causing cancer) as well as positive graphs (chemicals causing cancer).
SUBDUE has been successfully applied to a number of real databases. Shaobing Su (Masters, 1998) applied SUBDUE to the detection of patterns in categories of proteins in the Brookhaven Protein Data Bank (PDB). Ravindra Chittimoori (Masters, 1999 expected) is applying SUBDUE to the Predictive Toxicology Challenge data to identify patterns predictive of cancer-causing chemicals. Jesus Gonzalez (Masters, 1999 expected) is applying SUBDUE to two other real databases: Aviation Safety Reporting System (ASRS) and U. S. Geological Survey Earthquake data. Ron Maglothin (Masters, 1999 expected) is applying SUBDUE to data from the Human Genome Project to detect patterns predictive of the start and end of genes in the human DNA. Results in all these domains have been favorably evaluated by domain experts and have been published or submitted for publication.
This project supports several graduate students and their research results have been included in current course offerings. Masters students Ron Maglothin, Jesus Gonzalez and Ravindra Chittimoori are directly funded under this project. All three plan to receive their degree in May 1999. Two of the students, Mr. Maglothin and Mr. Gonzalez, will continue for their Ph.D. degree. Previous masters students funded through this project (Nataliya Bocharova, Shaobing Su, Gehad Galal, Ramakrishna Kintada and Prasad Parthasarthy) are working in industry.
Results from the parallel and distributed implementations of SUBDUE were integrated into Dr. Cook's advanced Parallel AI Algorithms course that she designed and taught in the spring of 1997. Results from this research have also been included in Dr. Holder's Machine Learning course taught in Fall 1998.
Results from this project were influential in receiving two other awards. Dr. Cook with other departmental faculty PIs received NSF funds to purchase a distributed network of Pentium PCs and DEC Alphas to further investigate high-speed data mining algorithms. In addition, Drs. Cook and Holder received a grant from the Texas board of research to continue this project and apply the results to industrial databases.
We are developing a state-of-the-art computer system to extract knowledge from data in a variety of domains. In addition to enhancing our understanding of automated data analysis, this work has buily connections to other areas of science. Our results in the areas of chemistry and molecular biology have reveal interesting knowledge in these areas that are directly applicable to society. For example, discovering patterns cancer-causing chemicals will help scientists identify the carcinogenicity of new chemicals and understand the cause. Discoveries in protein and DNA data will help out ability to combat disease. The collaborations we have made between computer and other sciences have been beneficial to both in terms of accelerating the discovery of new knowledge. These collaborations have taken place among several researchers from diverse backgrounds and geographic locations (U.S., Mexico and Europe).
With the increasing amount and complexity of today's data, there is an urgent need to accelerate discovery of knowledge in large databases. In response to this need, numerous approaches have been developed for discovering concepts in databases using a linear, attribute-value representation. These approaches address issues of data relevance, missing data, noise, and utilization of domain knowledge. However, much of the data that is collected is structural in nature, or is composed of parts and relations between the parts. Hence, there exists a need to develop scalable tools to analyze and discover concepts in structural databases. Many reported discovery tools are also computationally expensive and cannot scale easily to large databases, especially those containing structural information. Recently, we introduced a method for discovering substructures in structural databases using the minimum description length (MDL) principle. The system is called SUBDUE, and it discovers substructures that compress the original data and represent structural concepts in the data.
Three potential IDM-related projects are the incorporation into existing databases of the ability to store and access discovered knowledge, the reorganization and optimization of databases to support data mining, and the reorganization of specific databases (e.g., NIH databases) based on learned knowledge.
Sushil Jajodia, X. Sean Wang
Department of Information and Software Engineering
George Mason University
Sushil Jajodia or X. Sean Wang
MSN 4A4, 4400 University Drive
Fairfax, VA 22030
Phone: (703) 993-1653 or (703) 993-1662
Fax: (703) 993-1638
jajodia@gmu.edu (http://www.ise.gmu.edu/~csis/faculty/jajodia.html)
xywang@gmu.edu (http://www.ise.gmu.edu/~xywang)
http://www.isse.gmu.edu/~csis/tdb
Temporal Databases, Time Granularities, Semantics, Query Evaluation, Data Mining, Interoperability.
The objective of this project is to provide a flexible framework for supporting multiple time granularities. The targeted use of the framework is for automatic evaluation of user queries and for discovering temporal patterns (i.e., data mining) in an environment where either the user queries or temporal patterns involve granularities that do not match the granularity of the stored data. The basic idea is to add the necessary functionalities so that the database system is able to understand and reason about information involving multiple time granularities. Algorithms for efficiently evaluating user queries and discovering temporal patterns will also be investigated, and an experimental prototype will be built. The impact of this research should be twofold: Firstly, the proposed framework should provide the basic tools for database systems to support flexible user queries, which would free users from the unnecessary tasks of writing complex queries or changing their pattern descriptions to fit the specific granularity used to store temporal data in the database. Secondly, the optimization techniques explored should provide strategies and algorithms for efficient evaluations of queries and discovery of temporal patterns. The proposed experimental work should provide insights into the subject and into the practicality of the proposed theory as well as methods.
We continued investigation of query evaluations and data mining when multiple granularities are involved. We focused on a system architecture study and system prototype construction. We have published our result of the study and set up a WWW demo site as our initial implementation prototype. In the following years, we plan to focus more on the prototype system implementation beyond the demo system, and focus on data mining techniques that involve calendars. In parallel with prototype implementation, optimization techniques will be investigated for query evaluation when multiple granularities are involved.
Our research reports have been accepted for publication in leading technical journals of of the field. Please see the project reference part. In the system implementation aspect, the research has resulted in a working demo system that exhibits our ideas. The project has been highly successful and achieved the research results as stated for the first year in the proposal. The long term goal is to continue the investigation of query evaluation, optimization and data mining techniques. The impact of this research should provide theories and tools for database systems to incorporate multiple granularities, which is important for future database systems and information systems.
The project has been supporting three graduate students (Mr. Changzhou Wang, Ms. Jia-Ling Lin and Mr. Peng Ning). Ms. Lin has graduated with a Ph.D. degree while Mr. Wang has gained his candidacy for Ph.D. The students supported by the grant include a woman. Two graduate level courses at George Mason University directly disseminate the results of this research: Advanced Database Systems and Distributed Database Systems (INFT803 and INFT861).
The technical issues that this project is attacking are at the forefront of database technology development. Both data management and data mining issues involving multiple granularities are not dealt with elsewhere in such a comprehensive manner. The solutions and techniques resulting from this project will server large communities of database users, including government users, electronic commerce users, and private sector users, who deal with information with multiple time granularities on a daily basis. At the same time, the training provided by the project (with respect to Ph.D. student training) has contributed directly to the information technology workforce with a Ph.D. graduate entering the U.S. workforce.
There is a rich variety of time granularities, and users as well as applications often require the flexibility of viewing the same temporal data in terms of different time granularities. The increased awareness of this requirement by the database community is evident from the growing research literature on this subject and from the inclusion of constructs for handling multiple granularities in SQL-92, the relational query language standard, and TSQL2, a temporal extension of SQL-92. In spite of this, the fact remains that whenever there is a difference between the way the information is stored in the database and the way it is required by the users, it is up to the users to understand these differences and specify appropriate operations in their queries to reconcile them. There is a need for the databases to have increased knowledge about the time granularities that users usually deal with in their daily activities, and handle requests, either queries or data mining requests, automatically.
Affiliation & Contact Information
|
Anupam Joshi |
Raghu Krishnapuram |
WWW Page
http://www.cs.umbc.edu/~ajoshi/web-mine/
List of Supported Students and Staff
Liyu Yi, CSM, GRA (effective 9/98), Scott Jiang, UMBC, GRA (effective 1/99) Olfa Nasraoui, CSM, GRA (effective 3/99)
Project Award Information
Keywords
Web Mining, Information Personalization, Robust Fuzzy Clustering, Adaptive Web Sites, Web User Categorization, Web Document Classification, Snippet Analysis, Web Access Patterns, Clickstreams
Project Summary
This is an inter-institutional collaborative project carried out by Anupam Joshi at the University of Maryland, Baltimore County, and Raghu Krishnapuram at the Colorado School of Mines. The goal of this research is to develop scalable robust techniques to model noisy data sets containing an unknown number of overlapping categories, and apply them to create a software tool for web personalization and mining. Personalization has two components: (1) tailoring the content delivered to the user from a web site; and (2) exploring the available web pages and categorizing them. The approach consists of developing new practical clustering algorithms by combining fuzzy methods, robust statistics, with Monte Carlo/bootstrapping techniques to model an unknown number of overlapping sets in the presence outliers. Since many web objects such as URLs, IP addresses, web pages, and snippets cannot be represented by numerical features, new techniques to handle such web objects, as well as to categorize or cluster them by using suitable similarity measures between such objects, are being explored. A software tool for web personalization and mining, which incorporates the algorithms developed into a software architecture, is being created and validated. The results of this project will generate new theoretical results and efficient algorithms for simultaneously estimating the prototypes/profiles of an unknown number of overlapping categories from noisy data sets, as well as a web personalization and mining tool that will be made available on the web. Thus, this project is expected to have a significant impact on the way documents are searched for and delivered. It will directly influence the usefulness and spread of the Internet and WWW, and in general, will contribute to the digital library technology.
Goals, Objectives, and Targeted Activities
Our first year goals are:
In the first few months of the project, we have concentrated on Goals 2-4.
Indication of Success
Our project started on September 1, 1998, and is in its initial stages. In our initial work, we have created a system that analyzes web logs that a server keeps. We recover from this log the "traversal path" of the user through the server. We have defined a non-Euclidean similarity measure between paths. Using this measure, and some new robust fuzzy relational clustering algorithms we have developed, we are able to find "typical" traversal paths through web servers at two different universities. This information can be useful in two ways. One, it can let the site developer know about the associations between the various URLs in their site. Secondly, it can be used to make the web site adaptive in conjunction with state maintaining devices such as cookies. We are also exploring the use of snippets to categorize web documents.
Project Impact
Since the project started only 4 months ago, the impact is in its initial stages. One graduate student was being supported in the fall of 1998, and two more graduate students will be supported in the spring of 1999. Two of these students are working towards their Master's theses and are expected to graduate in May 2000. The other is a Ph.D. student who is expected to graduate in the summer of 1999. Dr. Joshi is teaching a course in the spring of 1999 on Web and Data Mining. The course has proved quite popular, and in its first week is heavily oversubscribed.
GPRA Outcome Goals
Project References
Area Background
The research under this project covers two areas -- web personalization/mining, and robust fuzzy clustering. We briefly describe how we are applying techniques from these two areas in our project.
The idea of "personalizing" the web space for users, in part by "mining" the web documents and the users' traversals of the web space, can trace its descent from the work in information retrieval and filtering. The explosion of the web has brought the personalization problem front and center, not just because of the amount of time wasted in searching for information, but because of the massive traffic surge this has generated in the internet backbone. The problem can be attacked both from the client end by creating browsers that filter incoming information, and also with servers that adapt the information returned based on the user. The latter solution is further attractive since it also cuts down on network traffic. A related topic is that has been recently gaining momentum is the idea that we can learn much about users and customers by tracking and analyzing their clickstreams, which is of great importance in e-commerce. However, web mining in general, and site analysis in particular, may well turn out to be a flop unless proper attention is paid to formalizing the goals, analysis, and evaluation methods.Blindly dumping web object data into a data mining tool is not expected to yield good results. For example, users' access patterns are not entirely fixed, and while there are trends there to be discovered, they are buried in data with significant noise components. Therefore, robust methods, that can deal with significant amounts of noise, are needed. Predictive models (including statistical tools) cannot produce reliable results unless the data represents a significant percentage of the total number of possible (traversal) patterns, which is astronomically large. Thus, methods for reducing the dimensionality of the problem are required. Also, web objects (pages, URLs, etc.) are non numeric in nature, making "distance" measurements and "equality" judgements between them, a prerequisite to grouping of any kind, an issue in itself. Hence, algorithms that can also deal with non-numeric data are desirable. Moreover groups, i.e. clusters (of users, of web pages, of URLs) are not crisp, the same entity can belong to different groups to different degrees. Therefore, fuzzy approaches to deal with this problem need to be explored. These are problems that traditional data mining methods typical do not deal with. Our approach to this problem is to simplify it by categorizing the user space as well as the document space by applying clustering techniques. By keeping track of the memberships of the users, web objects, and traversal patterns in the categories, we hope to achieve a significant reduction in dimensionality and increase the reliability of the results. The clustering methods we employ for categorization have to be necessarily robust and fuzzy (and sometimes relational), to be able to handle large percentages of outliers and overlaps (and sometimes non-numeric objects). They also need to be of low complexity to deal with extremely large data sets.
Area References
This is a relatively nascent area. Besides the web site we maintain for the project(http://www.cs.umbc.edu/~ajoshi/web-mine/), other sites which reflect current work and provide good overviews are U. Washington (http://www.cs.washington.edu/research/adaptive/) , and Simon Fraser University (http://db.cs.sfu.ca/WebMiner/).
Potential Related Projects
To the best of our knowledge, this is the only project in its area supported by IDM. There are some ongoing projects in data mining (P. Smith) and agents for web browsing (S. Gauch) that are possibly relevent to this work.
Richard R. Muntz*, Walter Gekelman**, William H. Jepson***, Walter Karplus*, D. Stott Parker*
(*) Computer Science Department, University of California, Los Angeles
(**) Physics Department, University of California, Los Angeles
(***) Department of Architecture & Urban Design, University of California, Los Angeles
Richard R. Muntz
3732 Boelter Hall
Computer Science Dept., UCLA
Los Angeles, CA 90095-1596
City, State ZIP
Phone: (310) 825-3546
Fax : (310) 825-2273
Email: muntz@cs.ucla.edu
Multimedia Data Server, Real-time I/O, Randomized Data Allocation, 3D Virtual World, Video on Demand, Scientific Visualization.
We have developed a general real-time multimedia data server using parallel disks, that can simultaneously deliver data to heterogeneous applications with real-time requirements, such as 3D interactive virtual worlds, interactive scientific visualizations, video on demand, etc.
The goals for the next year include:
A prototype of the VWDS Data Server supporting simultaneous delivery of MPEG encoded videos and 3D urban simulation city models for multiple users has been implemented and successfully demonstrated at several events. Demonstrations, presentations and talks have been invited at more than 50 events including:
The project has enabled collaborations with the following organizations:
The project has provided research opportunities for several graduate students. Jose Renato Santos graduate with his Ph.D this past year. Chris Mitchell graduated with his Masters degree in CS. Current graduate students working in the project are: Damon Liu, Wai-Man R. Wong, Yasuyuki Sato, and Scott A. Friedman, all in the Ph.D. Program, UCLA CS Department.
A two-quarter course curriculum devoted entirely to real-time Urban Simulation has been developed and taught in UCLA Department of Architecture and Urban Design. Courses in high performance storage systems and multimedia information systems have been developed and taught in the CS Dept.
We address the problem of storing and delivering data to multimedia real-time applications. These applications access large amounts of data that typically do not fit in main memory and need to be continuously retrieved from disks. These continuous media data impose deadlines on the retrieval and delivery of information, which must be satisfied to avoid undesired discontinuities on data visualization by the end user. High utilization of resources is desirable to achieve a good cost/performance ratio. However, high utilization is inconsistent with guaranteed real-time service and statistically varying service requirements.The challenge of real-time data servers is to achieve high utilization of system resources and still be able to guarantee low delay bounds on data delivery.
Most of the work on multimedia storage servers has concentrated on video and audio playback, which has highly predictable access patterns. Once a video playback stream is started, data is sequentially read from the storage system at the playout rate. This predictability is exploited in many video servers that carefully layout data on disk to achieve good load balance and high real-time performance. We, however, consider much more general applications, such as 3D interactive virtual worlds, where the high degree of user interaction makes the workload much less predictable. We have developed and implemented,in our Virtual world Data Server, novel solutions for data storage and retrieval that allows the system to provide statistical real-time guarantees with high utilization of resources, for general workloads with unpredictable access patterns.
Some ideas of future research include:
Principal Investigator: Paul Kantor *
Co-Principal Investigator: Kwong Bor Ng **
* School of Communication, Information and Library Studies, Rutgers University
** School of Information Studies, Florida State University
Contact Information
Paul Kantor
4 Huntington Street, SCILS, New Brunswick, NJ 08901
Phone: (732) 932-1359; Fax: (732) 932-1504
Email: kantor@scils.rutgers.edu ; URL: scils.rutgers.edu/~kantor
Kwong Bor Ng
Louis Shores Building, School of Information Studies, Florida State University
Tallahassee, FL 32306
Phone: (850) 644-6237; Fax : (850) 644-6253
Email: kbng@mailer.fsu.edu ; URL: garnet.acns.fsu.edu/~kbng
WWW PAGE
scils.rutgers.edu/~kbng/SemanticDimensionalityProject.html
List of Supported Students and Staff (optional)
Not yet named
Project Award Information
Award Number: 9812086
Duration: Period of Performance of the entire project, e.g., 9/15/1998 - 2/29/2000.
Title: Semantic Dimensionality and Effective Data Fusion in Information Retrieval
Keywords
Data Fusion, Effectiveness Prediction, Information Retrieval, Meta-search Engines, Semantic Dimensionality
Project Summary
We investigate the effectiveness of data fusion (DF) for information retrieval (IR). DF techniques combine the estimates of relevance or usefulness provided by several different IR schemes, to produce a richer and more refined set of documents for examination by the human seeking information. This problem will become ever more important as information must be found in complex networked environments. It is unlikely that any one system will solve the problem of finding the best and most useful sources and documents. Data fusion, widely used in image processing and signal detection, has been shown in other settings, where the "noise" is largely random, to give substantial performance improvements. This work is based on theories developed by the investigators, and using a large collection of existing data developed at the Text Retrieval Conferences (TREC) at National Institute of Standards and Technology (NIST), to find laws or rules which predict which retrieval schemes should be combined, and how, to provide improved performance.
Goals, Objectives, and Targeted Activities
The raw material consists of ranked lists of documents L(t,s) prepared for each of more than 250 "topics" t, by each of the schemes or systems s participating in TREC in a given year. For each year, and for each topic t, we will compute a generalization of Kendall’s tau coefficient, which is appropriate for lists which may not contain the same entities. The resulting measure z(s,s’) can be used as one of the predictive variables. In addition we have available individual performance measures w(t,s) for every topic-system combination. The results of this research will be predictive models for deciding when two schemes (or more than two) can be effectively used in DF, to improve IR. Research will be conducted by applying discriminant analysis, non-linear clustering techniques, and other statistical methods to identify the form of the function f, and the most powerful predictive variables. Extensive use is made of the "Receiver Operating Characteristic" concept to determine whether one model is absolutely, or only conditionally, more powerful than another.
Indication of Success
According to our theory of Semantic Dimensionality, we tentatively propose two conditions for effective data fusion: (1) The condition of efficacy, which states that fusion of two IR schemes with comparable performance tends to be effective; (2) The condition of dissimilarity, which states that fusion of two dissimilar IR schemes tends to be effective. We use the IR systems participating in the TREC 4 routing task for training a model to predict the effectiveness of data fusion and the IR systems participating in the TREC 5 routing task to test that model. The model asks, "when will fusion perform better than an oracle who uses the best scheme for each pair?" We apply various statistical techniques to fit the model to the training data and use the receiver operating characteristic curve of signal detection theory to represent the power of the resulting models. After training, the prediction methods predict the sign of the effectiveness of data fusion between schemes in the testing set much better than chance (Ng and Kantor 1998). The results support the our theory of semantic dimensionality. We are going to continue our investigation by using a larger data set and more advanced statistical methods.
Project Impact
GPRA Outcome Goals
While the work is significant in its own right, in broadening our understanding of how different schemes for IR relate to one another, it also has theoretical and practical implications reaching beyond the field of IR research. For purposes of theory, this research will provide an extensive critical test of the theory of Semantic Dimensionality, which has been developed by the PI. This theory (Kantor, 1998A, 1998b) posits that problems of machine learning, including the problem of defining an appropriate retrieval scheme for a given problem or purpose, can be thought of as the problem of locating a "best point" in a Euclidean space whose dimension is a characteristic of the problem. We will test these ideas and, should they prove successful, will provide a foundation for their extension to other areas of machine learning and artificial intelligence.
With regard to application, the proposed research has immediate application to improving the methods of document discovery and retrieval used in a host of applications from corporate, to military, to personal. The research will be conducted using test sets from the large (nearly one million documents) TREC collections, and are thus representative of the kinds of schemes which are found in commercial software for searching the Internet, intranets, public databases of full text, and private databases of text.
Project References
Area Background
Data fusion is a relatively new concept. Generally it is an approach which combines data, evidence, or decisions coming from or based on various sources, of different natures, about the same set of objects, in order to increase the quality of decision making under uncertainty about the objects. Generally there are three level of data fusion. On the primary data level, all the information available to the detecting systems is considered together in the fusion process to make an overall estimate. On the attribute level, primary signals detected from the objects by the detecting systems are processed into a set of specific attributes, and decisions about the objects are made according to an optimal decision rule based on all such attributes. On the decision level, each detecting system individually makes its own partial decision about the objects, using its own data, and according to its own criteria, and a final decision is made based on these partial decisions.
Data fusion often involves multiple imperfect sensors and each of the sensors contributes its own estimation to the final decision. In IR, data fusion does not necessarily employ different IR systems, but only different IR schemes. Corresponding to the three basic components of automatic IR, i.e., document representation, query formulation, and matching computation, research in IR has considered three approaches to data fusion: 1. multiple queries; 2. multiple document representations; and 3. multiple retrieval techniques. In each approach, some kind of combination, in contrast to using just one single query, one single document representation or one single retrieval technique, can improve retrieval effectiveness.
Area References
Hillol Kargupta
Assistant Professor
Department of Electrical Engineering and Computer Science
Washington State University
Pullman, WA 99164
Contact Information
Dr. Hillol Kargupta
School of Electrical Engineering and Computer Science
Spokane Street, Pullman, WA 99163
Phone: (509)-335-4106
Fax : (509)-335-3818
Email: hillol@eecs.wsu.edu
http://www.eecs.wsu.edu/~hillol
http://www.eecs.wsu.edu/~hillol/gene_expression.html
List of Supported Students and Staff
Erik Johnson, Graduate student, School of EECS, Washington State University
Gene expression, evolutionary computation, function induction, distributed data mining
Learning a function from data, in other words function induction, plays an important role in machine learning, adaptation, and non-enumerative black-box optimization. Like all these problem domains, the emergence of adaptive living organisms through evolutionary search critically depends on function induction. Natural evolution can be viewed as the problem of evolving systems that can learn to be adaptive and survive the competition for natural selection. If this
perspective is correct then the living organisms must have an efficient mechanism to learn functions from observed data. Indeed, there exists many facts that corroborates this observation. The role of neuronal networks in our brain is now widely recognized to be an important mechanism for learning functions from observed data. This research suggests that there may be an alternate mechanism for function induction in the evolutionary operators of living organisms and gene expression (transformation of DNA to proteins through the construction of mRNAs) probably plays a critical role in this process.
The objective of this proposal is two fold. The first objective is to develop scalable gene expression based algorithms, that can inductively learn the underlying function structure from the observed data set. This research points out that the different steps in gene expression such as transcription, translation, and protein folding can be viewed in the computational ground and artificial evolutionary operators may be developed for scalable function induction. This work will significantly advance the research on gene expression based evolutionary computation.
The second objective of this proposal is to demonstrate the efficacy of gene expression based genetic algorithms for large scale, possibly distributed function induction applications in optimization and data mining. This aspect of the proposed work is divided along two dimensions. The first set of experimentation will be carried on based on a difficult test suite comprised of different large, multi-modal black-box optimization problems. The next phase will focus on testing the efficacy of the algorithm for distributed data mining applications. The PI and his students are developing Collective Data Mining, a well grounded framework for distributed data mining. The developed gene expression based algorithms will be used for collective data mining applications.
The goal of this research project is to develop an efficient scalable evolutionary algorithm for search, optimization, and machine learning by following the underlying mechanism of the natural process of gene expression (DNA-mRNA-Protein). This work will also apply the developed algorithms for distributed data mining. The approach consists of,
Upon successful completion, this work will revolutionize the field of evolutionary adaptive algorithms. It will offer a scalable approach for solving large optimization, machine learning, and data mining problems. Moreover, it will offer a fundamental understanding of the natural gene expression process from the perspective of evolutionery computation.
Indication of Success
Preliminary test results demonstrated linear growth in objective function evaluation with respect to growing problem size (number of search variables).
Function induction plays an important role in machine learning, adaptation, and non-enumerative black-box optimization. Upon successful completion this project will offer an efficient and scalable technique for function induction based on the underlying mechanism in natural gene expression. The main direct impacts of this research are summarized in the following:
Little work has been done that explores the role of gene expression from the perspective of scalable learning and optimization. This work will lay the foundation of the research in this area.
Padhraic Smyth, P.I.
Dennis Kibler and Michael Pazzani
Department of Information and Computer Science
University of California, Irvine
Contact Information
Padhraic Smyth, P.I.
Dennis Kibler and Michael Pazzani
Department of Information and Computer Science
University of California, Irvine
Irvine, CA 92697-3425
Phone: (949) 824-2558
Fax : (949) 824-4056
Email: smyth@ics.uci.edu, kibler@ics.uci.edu, pazzani@ics.uci.edu
Homepage: http://www.ics.uci.edu/~smyth, http://www.ics.uci.edu/~kibler, http://www.ics.uci.edu/~pazzani
WWW PAGE
S
Keywords
Data mining, knowledge discovery, machine learning, archive, repository
We are developing an online data repository of large data sets (from the 100 Mbyte to1 Gbyte range). The repository will include high-dimensional data sets as well as data sets of different data types (time series, spatial data, transaction data, and so forth). The primary role of the repository will be as a benchmark testbed to enable researchers in data mining (including computer scientists, statisticians, engineers, and mathematicians) to scale existing and future data analysis algorithms to very large data sets.
The structure of the repository has been finished and we are now engaged in adding data sets from our own collection as well as from outside sources. We expect that we will need to make some minor changes in the design as we deal with the great variety of data sets.
The initial repository design, and documentation requirements for stored data sets have been completed. Much of the work to date has been in designing the documentation requirements: a data archive is only as useful as the documentation that describes the data, and hence, we have invested significant time in carefully planning how our donors will document their data sets. We are now obtaining and installing large, high-profile data sets which are well documented and published in the literature. Now that the initial structure has been developed, data sets are being added regularly. We expect to have a dozen data sets online by March 7. Once these large data sets are made available, the measurement of success will be simple: if researchers adopt these data sets as they have for the Machine Learning Repository, then the project will have been successful.
The existence of a common set of large data sets will facilitate the evaluation and development of large scale data analysis algorithms. Researchers will benefit since they can compare their work with others. Research consumers or users will benefit since they can see the performance of the algorithms on a diverse set of data. This will promote the use of the relevant methods on a wide spectrum of problems. Since this information will be made available on the web, the impact of advances should be observed in a timely manner.
The commercial success of relational databases and the availability of relatively inexpensive sensing, storage, and processing hardware has led to explosive growth in online data storage over the last two decades. In turn, these large datasets have motivated the rapid development of data mining, namely, the search for structure in large volumes of data. In contrast, while science and industry scale up their data gathering activities, traditional data analysis research in statistics and machine learning has been relatively slow to take up the challenge, and much research and published work is still focused on relatively small data sets (such as those in the current UCI repository). This mismatch is understandable, since it is much easier for academic data analysts to work with well-known small data sets, than to actively seek out large, high-dimensional, ``difficult" data sets to work with. While it is clear that (in the long run) large data sets will eventually become commonplace in data-analytic research settings, this is currently occurring rather slowly (as can be seen for example from any of the recent International Conference on Knowledge Discovery (http://www-aig.jpl.nasa.gov/public/kdd98/) and Data Mining proceedings, where most papers are by academic researchers working on relatively small data sets). In this work we propose to accelerate the infusion of large high-dimensional data sets into the data mining research environment.
We are very interested in donations of large well-documented data sets to the archive, particularly for "non-traditional" data-types such as sequences, images, transaction data, and so forth.
Roger King
Richard M. Osborne
Department of Computer Science
University of Colorado-Boulder
Roger King
Department of Computer Science
Engineering Center
Campus Box 430
Boulder, CO 80309-0430
Phone: (303) 492-7398
Fax : (303) 492-2844
Email: roger@cs.colorado.edu
Database, Heterogeneous, Distributed, CORBA, Interoperation, Persistence.
Diplomat: A System for Building and Maintaining Heterogeneous Database Alliances
Start Date: August 15, 1996
Expires: July 31, 1999 (Estimated)
Sanctuary (formerly Diplomat) is a heterogeneous data source evolution environment, currently under development at the University of Colorado (Boulder). The goal of Sanctuary is to support large scale persistent applications by providing a consistent, evolvable persistence layer. Sanctuary allows for the lightweight interconnection and subsequent evolution of the set of heterogeneous databases (e.g., legacy database systems, modern database systems, flat-files, etc.) that typically make up such a persistence layer.
This year the Sanctuary team will continue adding CORBA services and databases with which the Sanctuary environment will interoperate. In addition, we are well on the way toward fully specifying the Common Object Interconnection Language (COIL). COIL is the language that will allow for the rapid definition of connections.
In this year, the Sanctuary team was able to implement hand-coded instances of the Sanctuary environment. This experience was a critical intermediate step and will lead to instances of Sanctuary that are not even partially hand-coded.
The Sanctuary project is directly funding three graduate students and one undergraduate student. In addition, a doctoral thesis will come from the work we have done.
Sanctuary is a heterogeneous database project, meaning that we are concerned with the interoperation of databases that vary in application interface, data model, query language, schema, and/or physical representation. This is a critical need as large companies are constantly reorganizing in our dynamic, global economy. The traditional approach to attacking this problem consists of building large, unwieldy integrated interfaces or schemas that "glue together" multiple, diverse databases. Sanctuary solves the same problem, but in a much more lightweight, dynamic fashion.
|
Center for Research on Information Access |
Center for Research on Information Access |
|
Columbia University |
Columbia University |
|
535 W. 114th Street, MC 1101 |
535 W. 114th Street, MC 1101 |
|
New York, NY 10027 |
New York, NY 10027 |
|
Phone: 212-854-7443 |
Phone: 212-939-7119 |
|
Fax: 212-854-9099 |
Fax: 212-666-0140 |
WWW Page
http://www.columbia.edu/cria/
Keywords
Information retrieval, natural language processing, computational linguistics, grammatical analysis, parsing, document analysis, topic identification, topic detection and tracking.
Project Award Information:
Award Number: IRI-97-12069
Title: Automatic Identification of Significant Topics in Domain-Independent Full Text Documents
Duration: Three years
Dates: September 1997 to August 2000
Project Summary
The goal of this project is to develop a suite of techniques for identifying significant topics in edited documents such as newspaper articles. Significant entities and concepts are most often referred to in text with nominal expressions such as nouns phrases (e.g., computer science) and proper names (e.g., Buffalo Bills). However, achieving even shallow understanding of nominal expressions remains a major blocking point for natural language processing (NLP). Related to this is the question of the role of natural language information in information retrieval (IR) and specifically the role of nominals in information access. An additional goal is to achieve a better understanding of the relative strengths and weaknesses of statistically-based and rule-based natural language processing systems.
We have developed linguistically motivated techniques which can link related nominal expressions, to identify documents or portions of documents that are relevant for a given application or user. To the extent that our techniques are based on linguistically-motivated patterns and not on domain-dependent vocabularies, our patterns apply across to text in any genre or domain. We assume that efficient and effective known statistical techniques will continue to be used for large scale tasks such as analysis of large corpora and retrieving a set of possibly relevant documents from the web. This research is useful for applications such as automatic indexing tasks, either for digital library applications or for direct user manipulation, second stage information retrieval where a subset of a larger corpus has been determined to be potentially relevant, advanced information extraction where important entities in the document must be identified and linked, summarization, and topic detection and tracking.
Goals, Objectives, and Targeted Activities
Indications of Success
We developed a modular system, LinkIT, for building a list of candidate significant topics for each document. The SNAP module takes Simplex Noun Phrases (SNPs) (Wacholder 1998), such as asbestos fiber, 9.8 billion Kent cigarettes, and Department of Energy. SNPs can be contrasted with complex NPs such as 9.8 billion Kent cigarettes sold by the company, where the head is followed by a participial verb. LinkIT then links together expressions that refer to the same higher level concept such as rights in reproduction rights and literary rights. This year we have achieved:
LinkIT is used by the two NSF sponsored STIMULATE projects for the computation of paragraph similarity (Eskin et al. under review), for the analysis of image captions for building an ontology of multimedia objects (Chang et al. under review), and for a distributed data base project.
Project Impact / Human Resources
Your department/institution infrastructure
This project is part of the Columbia University Digital Library program under the Center for Research on Information Access (CRIA). CRIA sponsors projects to develop tools and techniques which are essential for effective user-oriented text analysis and retrieval.
Industry -- collaborations, transfer of technology, patents
We have developed a plan with the electronic publishing division of Columbia University Press (CUP) and with the Columbia Law School for potential use use LinkIT in an intelligent indexing tool.
What activities have been enabled/spawned because of the accomplishments made possible by your award?
Area Background
The NSF is actively involved in funding research to enable increased universal access to the fast-growing body of electronic text. Our research directly addresses the need to find information more easily and more reliably. We are developing a range of innovative methods to improve current methodologies for information retrieval, indexing, extraction, and summarization.
Area References
Program: Information and Data Management
PI : Shahram Latifi, Professor
Dept. of Electrical and Computer Eng.
University of Nevada, Las Vegas
Las Vegas, NV 89154-4026
Office: (702) 895-4016
Fax : (702) 895-4075
Email : latifi@ee.unlv.edu
Graduate Students
Shulan Deng and Jun Zhao
WWW PAGE
http://www.egr.unlv.edu/~latifi/Grants/NSF
Keywords
ITU(CCITT) Group 3/4, Coding scheme, Data compression, Document image analysis, Image processing
Project Award Information:
Award Number: IRI-9616206
Duration: 3 years, 2nd year
Title: Direct Processing of Compressed Data
Project Summary:
Progress in computer and communication technologies allows a large volume of digital information to be exchanged and archived. While the need for data compression is evident, many operations may have to be performed on the uncompressed data. This research investigates the following two problems:
Significance:
Data compression research mainly focuses on compression ratio and quality of images recovered from lossy compression methods. Information/image processing research traditionally deals with raw (uncompressed) data. Our project attempts to bridge between data compression and information processing, and promotes a new way of thinking. According to market research firm Gartner Group Inc, document imaging is one of the merging technologies for the upcoming years. The algorithms developed in this project can be utilized by a new generation of document image analysis systems.
Goals, Objectives, and Targeted Activities
We are currently undertaking the following tasks:
Connected component extraction is an important operation in image processing. The problem of connected component extraction directly in the Group 4 domain is being investigated. The idea is to detect the corner information existing in the Group 4 domain and use this information to develop a new algorithm which is expected to run faster than conventional methods running in the uncompressed domain. Bartneck proposed a method for segmenting and representing connected components using corner types. The input image is scanned line by line with a 2 by 2 window. The contents of the window has 24 states defined by the combination of black and white pixels. There are five types of states: 1)background, 2) object, 3) contour, 4) convex corner, and 5)concave corner. Using the states obtained by the window, the contour of an object is extracted and coded into a string. He has proven that the original object can be reconstructed from the corresponding corner string.
The CCITT G4 compression method encodes an image using the runlength representation of the image and two-dimensional relationships among the runs. Hence we can detect corners and their types by analyzing the codes in a CCITT G4 compressed image file.
The second task being explored is the conversion of raster images into vector forms. This operation is of interest in many applications ranging from cartography to character recognition. A significant part of the problem is the identification of linear or curved strokes in the binary data. Earlier efforts achieve this objective in the uncompressed domain by using different techniques. The future work can be the extension and improvement of one or more of the conventional methods which will result in a vectorization algorithm that will operate directly on the runlength data. The basic data structure used is the line adjacency graph (LAG) whose nodes correspond to dark segments in the run length code. Problems to be investigated include how to reduce the sensitivity of the thinning algorithms, how to do segmentation of touching characters, and how to use contour tracing for discrimination between characters having similar vectorization. In the meantime, we have considered applying the emerging technology, wavelet transform, to document image processing. Most segmentations techniques rely on a priori knowledge or assumptions about the generic document layout structure and textual and graphical attributes. However, in many applications it is desirable to have segmentation methods that do not assume any a priori knowledge about document layout. Wavelet transforms have played an important role in classification of texture and abnormalities, and it is hoped they can be exploited efficiently in document image processing.
Indication of Success
We initially investigated the possibility and efficiency of direct processing of compressed data. After an extensive literature review and examining the CCITT (recently renamed as International Telecommunication Union or ITU) G4 scheme, we developed a new compression scheme for binary document images, referred to as {\it Modified} G4 or MG4. Compared to other compression standards, this new coding scheme not only offers competitive compression, but also provides flexibility in processing of compressed data. Algorithms for operations such as skew detection, rotation, and connected component extraction are derived and implemented in MG4 compressed domain. These algorithms are shown to run faster than their traditional counterparts. The MG4 has not been optimized and we are currently looking at ways to improve its efficiency and simplify its implementation.
Project Impact and Output
Human Resource: This grant has strengthened the Ph.D. program in our department. The following three students have been supported by this project:
Ms. Shulan Deng, Ph.D. student
Mr. Bin Zhu, Ph.D. student
Mr. Jun Zhao, Ph.D student
The students' tuition and fees were fully paid from the grant.
Education: The project has impacted the education within our department in a major way. One graduate course was developed and taught which was mainly based on this project. The grant is also providing interesting and challenging materials for independent research projects and class projects in data compression and image processing. The results of this project were presented in a tutorial given at the IEEE IPCCC'98 and the IEEE local section meeting in November 98. The results of the project have been presented at one international conference and one workshop. The work on MG4 has been submitted to a journal.
Industry: The new scheme, once fully optimized, has the potential to revolutionalize the binary document processing in industry; for instance, the fax machine could be one application for this product.
Project ReferencesArea Background
Processing digital images has become widespread recently due to remarkable advances in hardware and software industry. Such images can be produced by converting photographs, printed text, and other media into digital form. Direct acquisition of data in digital form has also been desirable because of the reliability and precision involved in handling digital information.
The CCITT G3/G4 standard, developed based on run-length data, is widely accepted for binary document compression in general and facsimile transmission in particular. By means of this standard, users can interchange and share information efficiently and conveniently. Traditional algorithms for image processing are based on pixel, which are computationally expensive. This has motivated researchers to consider processing image data in the compressed domain.
The algorithms for run-length-based coding schemes are well understood and proven to be efficient. In the CCITT G4 domain, White Pass modes can be used to detect skew and image matching, while vertical modes can be searched for possible bar codes. Therefore one can see that the features and structures existing in coded data can be utilized and accessed directly to facilitate image operations.
Gregory H. Leazer
Dept. of Information Studies, Graduate School of Education and Information Studies
University of California, Los Angeles
Contact Information
226 GSE&IS Building, Mailbox 951520
Los Angeles, CA 90095-1520
Phone: (310) 206-8135, Fax : (310) 206-4460
Email: gleazer@ucla.edu, URL: http://dlis.gseis.ucla.edu/gleazer/homepage.html
WWW PAGE
Not yet determined
Project Award Information
Keywords
Digital libraries, information retrieval, intertextual networks, metadata and document representation.
Project Summary
New digital information resources necessitate the creation of new tools for resource discovery. Users must negotiate a large number of collections of information, such as new digital libraries and the collections of traditional libraries and museums. The distribution of resources into separate systems prevents optimal searching, and users must search multiple systems sequentially to perform exhaustive searches. A solution to this problem is to provide an integrated control multiple collections. The focus of this project is to investigate user behavior in the selection of diverse information resources from distributed collections using qualitative field work methods. A prototype system which integrates various types of metadata and forms networks of linkages is used to determine how users select resources. The network of linkages will reflect the evolving textual identity of individual resources by forming networks of intertextual associations; an intertextual network includes, for example, a progenitor work and its derivations, such as successive editions and translations. The educational component of this project is to develop a broad interdisciplinary program that integrates various theories in information retrieval.
Goals, Objectives, and Targeted Activities
In the last six months we have:
In the next year we will complete the experimental prototype system. We will also conduct a preliminary evaluation of the system by observing users of the system and assessing their success in completing assigned tasks. We will also develop rules for formulating associations among resources based upon our recent experiences in forming intertextual networks, and compare automatically derived networks with manually created networks in order to assess the accuracy of automatic linking systems. We will also develop two courses on digital libraries and metadata issues.
Indication of Success
The work of locating important local collections of resources and coordinating partner activities has been slower than anticipated; this observation is consistent from the informal remarks made by other digital library research project researchers that generating library content is a time-consuming process. However, we are now clearing hurdles in this area, and expect to develop the experimental prototype soon.
We have made significant progress in verifying the data model of works and items, and forming intertextual networks. One paper was given describing the research area, and three other papers reporting results on the characteristics of intertextual networks are in various stages of the publication process.
Long range results: The findings will also provide for the design of diverse types of information retrieval systems, including search agents that seek out resources among distributed collections, and present an integrated view for user consultation.
Project Impact
GPRA Outcome Goals
The President’s Information Technology Advisory Committee recently stated that information technology will be one of the key factors driving progress in the 21st century. Digital library research, including this project, not only investigates general issues in information technology, but also advances the infrastructure required for scientific research and collaboration. Digital libraries will also play a critical role in educating students of all levels in science and math. This research project is directed to improving digital library services, understanding components in their design, and developing methods for their assessment.
Project References
Area Background
Predating the advent of computers, early information retrieval systems arranged standardized descriptions of works in an order that would express a work’s evolving textual identity over time. Early union catalogs allowed users to locate works without prior knowledge of the existence of a particular library or its collection. Such a system allowed people to search multiple collections simultaneously, and assembled the various "versions" of a particular work, allowing users to select the most appropriate manifestation for their purposes. Although never fully achieved, an attempt was made to create a single coherent view over a growing, evolving mass of materials. Digital libraries today provide substantial promise of enhancing scholarly communication by providing desktop access to the full range of resources required for scientific work. We must surmount several problems to integrating digital library collections today into a single virtual collection. Such problems include the growing diversity of metadata techniques for representing networked resources; the control of evolving textual identity over distributed collections (a problem related to issues in version control); and the evaluation of information retrieval methods in digital libraries, and their impact on long-term user benefits such as student learning.
Area References
Potential Related Projects
I am a co-principal investigator of the Alexandria Digital Earth Prototype (Adept) Project, a proposal currently being reviewed under the Digital Libraries Initiative II program. I am scheduled to work as an evaluator for the project, using evaluative techniques included in this current NSF project. Furthermore, the Adept project is a component of the proposed Interlib project, a collaborative effort with several institutional members, including the California Digital Library, where I will assist on strategic planning efforts.
I am also a Project Director with the National Center for Research on Evaluation, Standards, and Student Testing (CRESST), a national research center funded by the U.S. Department of Education. I am involved in technology-mediated learning assessment programs. CRESST is potentially involved in developing digital library and learning assessment methods.
Contact Information
QIANHONG LIU
Department of Computer and Information Science
New Jersey Institute of Technology
Newark, NJ 07102
Phone: (973) 596-2872
Fax : (973) 252-8228
Email: qhliu@cis.njit.edu
URL: http://www.cis.njit.edu/~qhliu
WWW PAGE
http://logic.njit.edu:60006
Project Award Information
Keywords
Web-Based Query Processing, Online Databases, Information Presentation,
Interactive Query Processing, Multimedia Systems, Web Search Engine
Project Summary
Online networked environments impose new requirements on database query processing, some of which relate to the characteristics and capabilities of the clients and the properties of the network. The goal of this project is to invent and develop new technologies for interactive querying of online databases in support of electronic commerce and electronic transactions. By online databases we refer to search engines and databases containing product information such as electronic catalogs. Online databases allow interactive querying of content. The specific problem addressed in this project is that if the answer set to a query consists of a very large number of items, users may not be able to examine all the answers but rather limit themselves to subsets of answers. Consequently, if a user’s decision is based only on some subset of answers retrieved from the database, certain vendors may end up getting preferential treatment. Thus, online databases used in browsing and purchasing will be forced to develop "fair" information presentation issues. This project aims to study information presentation strategies for such networked environments. Rather than creating a theory of fair information presentation techniques, the current approach assumes that fairness will be captured by contracts between the information provider and the online database system. Such contracts specify how often a given object is to be displayed in response to relevant user queries. In particular, the project assumes that query processing in online databases will be defined with respect to a set of contracts. A test system is being built to experiment with the performance and scalability of online query processing algorithms.
Goals, Objectives, and Targeted Activities
The objective of this project is to design a test system to study new query processing algorithms based on the theory of contracts. For each algorithm we plan to study its theoretical performance with respect to scalability. The nature and expressive power of contracts needs to be carefully defined so that the problem is decidable and tractable, and that the contracts in fact can be enforced. A unique problem in this context is that contract satisfaction needs to be incremental in the sense that at any given time the system should show progress towards contract satisfaction. The test system will function with different query processing algorithms so it can be used to study the various algorithms that we develop in the course of this research or incorporate from other research projects. We are also working towards developing a framework in which general types of contracts can be accommodated and restrictions from the client and the network can be taken into consideration. The target output of this research is a system that will allow testing of query processing algorithms in online databases with respect to scalability, performance, and incremental contract satisfaction.
Indication of Success
In the first half-year of our project, we have built an online system based on preliminary results on query processing with respect to display size of the client. The information providers specify how often they want their objects displayed in response to relevant user queries, with the notion of a relative contract. To service each user’s request, all the answers to a query are computed from the web search engine, and then filtered with respect to the contracts stored in the contract base. A display agent decides which objects are to be displayed, assuming a limited display area on the client. The implementation of the query processing strategies resolves the fairness issues and determines the page layout. The test system will help us to study the theoretical performance of query processing algorithms with respect to scalability for online databases.
Project Impact
Five Master degrees were awarded from the project. The graduated students were employed immediately after their graduation, as full time employees at Lucent Technologies, AT&T, and Panasonic Research Center. Currently there are 6 ongoing MS projects and theses, and one graduate is supported by the grant.
GPRA Outcome Goals
This research will have a potentially large impact on the technologies used in information sources for electronic commerce, including web search engines, online catalogs, and electronic advertising. The current web search engine technology is fundamentally unsuitable for those applications. The result of the project will allow a new wave of search engine technology for the web. On a more general level, just as data processing and financial corporations benefited from relational database technology and transaction processing, we expect enterprises operating in the new information space to benefit from the query processing technologies developed in this project.
Project References
Area Background
The problem of large result sets has been studied in a number of systems such as web search engines, databases used in information retrieval applications, e.g., library and document databases, and various proposed digital library systems. The standard approach in information retrieval research is to assume that the query is incompletely specified, and to allow users to incrementally refine the query until they are satisfied with the result set. Another approach is to return an internsional description of the result set. Intensional answers impart meta information to users which can itself be used to further refine a query. Web search engines return the complete result set but order the elements by using proprietary ordering algorithm. Airlines reservation system such as SABRE sort the result set alphabetically. Preliminary work on query processing in online database has been reported in the context of online advertisement.
Area References
|
Mario Lopez |
|
|
Math and CS Dept |
Math and CS Dept |
|
University of Denver |
University of Denver |
|
2360 S. Gaylord Street |
2360 S. Gaylord Street |
|
Denver, CO 80208-0189 |
Denver, CO 80208-0189 |
|
Phone: (303) 871-2821 |
Phone: (303) 871-3287 |
|
Fax : (303) 871-3010 |
Fax : (303) 871-3010 |
|
Email: leut@cs.du.edu |
Email: mlopez@cs.du.edu |
Multidimensional Data, Indexing, Geometric Techniques, R-trees, Spatial Data
Storage and retrieval of multidimensional data is necessary for many applications. Recent advances in supercomputing and data acquisition necessitate being able to provide efficient retrieval for ever increasing data set sizes. In addition, recent advances in data warehousing and distributed data management provide a need for indexing of distributed high dimensional data. The project will explore and exploit a number of computational geometry techniques in the development of a new generation of efficient techniques for efficient loading, retrieval, and update of large disk-resident multidimensional data for both single machine, parallel machine, and distributed architectures. The resulting indexing techniques will enable higher level database architectures to provide efficient access to larger datasets, thus enabling data base technology to keep up with demands placed upon it from science and business on a whole.
So far the project has made significant improvements to packing of static data sets. Given the increasing number of large static multidimensional data sets from fields such as earth science, weather modeling, and military app