:: Home ::   | Service | Partners | Company |
 

 

Anatomy of a self-learning system for visually detecting hidden text in web pages

This section is an architectural depiction of what takes place in the system when a single web page is being analyzed. The engine is a set of components that correlate textual information from the following inputs:

1) Sets of texts obtained by scanning/recognizing a web page as an image
2) Searchable DOM B-Tree fragments by querying the browser
3) Parsed HTML fragments by processing HTML stream

The recognition is a multipass process in which new, DOM-tweaked image representations are fed back into the system for refinement.

The current system is based on Internet Explorer. It has an internal knowledge base of inconsistencies (bugs) related to DOM. This KB is exercised at run time to adjust recognition process and to report results.

 

 

Home | Page Thumbnails by clickpreview.net
Copyright ⓒ [2006] Brightwater Software. All rights reserved.