A service to detect hidden text
on web pages
We
have created the
software that will actually read and
analyze HTML, corresponding scripts
and stylesheets, and will "look" at this page
as it appears in browser and
correlate the results. |
|
The software is able to detect 100% of known
techniques to hide text. This is
not something that search engines
are doing now, but it's something
that they are working towards. |
|
Figuring out if a web page has
hidden or barely visible text is a
serious challenge. It requires in-
depth knowledge of several computer
technologies. That's why
we're pursuing it! After all, this
is something a few PhD's should be
able to figure out... |
The process
The engine takes quite a bit
of time to process a web page.
Currently, it's anywhere from 30
seconds to 5 minutes. Processing
time depends on the size and
complexity of a the page. A machine running the
recognition engine is seriously busy - CPU
nears 100% when not idle.
Currently, all users share a
single engine so the wait can be
lengthy if a lot of requests
have been queued.
This is
why the service cannot display
results immediately after a user submits a URL. All requests
are queued and processed on
FIFO basis. Users are notified
via email once their URLs have
been processed. |
 |
The visibility report
After
a page has been
finally processed a lot of useful
information gets channeled into
a report. This is what is being
reported.
|
Summary Section |
Number of
HTML fragments that
searchable, but not visible to human
viewers |
|
DOM Hidden Fragments |
These DOM nodes
have been made
invisible/disabled/hidden
by various scripting
means |
|
Hidden Fragments |
These HTML
fragments have been
made completely
invisible by
non-scripting means |
|
Poor Fragments |
These HTML
fragments have been
made poorly visible
by non-scripting
means |
|
Details Section |
Column |
|
Tag |
Tag name of an
HTML fragment in
question |
|
Preview |
Image fragment
or exact DOM reason
for not being
visible |
|
Guess |
Allegedly spammy
fragments marked
with icons |
|
Text |
Indexable text |
|
Fragment |
HTML in question |
|
Score |
Visibility score |
|
Visibility |
Good/Partial/Poor/Hidden |
|
Capabilities
Most of the text
hiding techniques described
here are detected. Two
major categories of reporting are
DOM-tricks and COLOR/BKG/LAYER-tricks.
DOM-tricks are the easiest to detect,
and accurate reporting is
in place. Other types are
reported as invisible or poorly
visible.
|
DOM-tricks |
DOM
(Removed), DOM (Display:None),
DOM (Visibility:Hidden),
DOM
(Hidden-Ancestor),
DOM (Off Screen
Spam) |
How does it work? An
overview
There is quite a bit of stuff
going on behind the scenes.
1) The processing starts off
with an instance of Internet
Explorer navigating to a URL
2) Then a page is printed to an
image for visual processing
3) A DOM snapshot is recorded
and broken on distinct
searchable fragments, structured
as B-Tree
4) The optimized, fragmented
image is fed into recognizer and
results are correlated with
newly acquired searchable bits
into a report

Technologies
and techniques employed by the
system
- HTML parsing, DOM to
B-Tree transformation
- Internet Explorer
DOM-related bug knowledge
base
- Image enhancing and
processing
- Robotic vision and OCR
- Fuzzy text algorithms
- Fancy heuristic-based
search algorithm
- Statistics-driven
elimination algorithm
|
|
For a more detailed
explanation click
here >> |