Anatomy of a self-learning
system for visually detecting hidden
text in web pages
This section is an architectural
depiction of what takes place in
the system when a single web page is
being analyzed. The engine is a set
of components that correlate textual information from the
following inputs:
1) Sets of texts obtained by
scanning/recognizing a web page as
an image
2) Searchable DOM B-Tree fragments by
querying the browser
3) Parsed HTML fragments by
processing HTML stream
The recognition is a multipass
process in which new, DOM-tweaked
image representations are fed back
into the system for refinement.
The current system is based on
Internet Explorer. It has an
internal knowledge base of
inconsistencies (bugs) related to DOM.
This KB is exercised at run time to
adjust recognition process and
to report results. |
 |