Department of Computer Science Seminar

Title: Automatic Discovery and Classification of Web Sources

Speaker: Anne H. Ngu

Time: 12:30pm-1:30pm, April 1st 2005

Location: Neuces Conference Room


The World Wide Web provides a vast resource to all type of users. These resources range from unstructured text information to various Web-based access to distributed data sources and services. Despite the tremendous progress made in various search engines, finding the desired information and use them effectively in a specific application can still be very tedious and frustrating. An automated system that can find, classify, describe, and wrap new sources without tedious and low-level coding of source-specific wrappers is needed to assist users in accessing hundreds of dynamically changing Web data sources. A correct classification of any kind of Web source must address both the capability of the source and the conversation/interaction semantics inherent in the design of the source. We propose a service class description (SCD)-a meta-data approach for classifying Web sources that takes into account both the capability and the conversational semantics of the source. The ability to discover the hidden interaction pattern of a Web source leads to increased accuracy in the classification process. For concreteness, we have chosen a well known and practically important scientific We source known as BLAST to validate our approach. Our results show that an SCD-based approach successfully classifies two thirds of over 500 BLAST sources with 100% accuracy and two thirds of bioinformatics keyword search sources with around 80% precision.