:: Home ::   | Service | Partners | Company |
 

 

A service to detect hidden text on web pages

We have created the software that will actually read and analyze HTML, corresponding scripts and stylesheets, and will "look" at this page as it appears in browser and correlate the results.

The software is able to detect 100% of known techniques to hide text. This is not something that search engines are doing now, but it's something that they are working towards.
Figuring out if a web page has hidden or barely visible text is a serious challenge. It requires in- depth knowledge of several computer technologies. That's why we're pursuing it! After all, this is something a few PhD's should be able to figure out...

The process

The engine takes quite a bit of time to process a web page. Currently, it's anywhere from 30 seconds to 5 minutes. Processing time depends on the size and complexity of a the page. A machine running the recognition engine is seriously busy - CPU nears 100% when not idle. Currently, all users share a single engine so the wait can be lengthy if a lot of requests have been queued.

This is why the service cannot display results immediately after a user submits a URL. All requests are queued and processed on FIFO basis. Users are notified via email once their URLs have been processed.

The visibility report

After a page has been finally processed a lot of useful information gets channeled into a report. This is what is being reported.
 
Summary Section Number of HTML fragments that searchable, but not visible to human viewers
DOM Hidden Fragments These DOM nodes have been made invisible/disabled/hidden by various scripting means
Hidden Fragments These HTML fragments have been made completely invisible by non-scripting means
Poor Fragments These HTML fragments have been made poorly visible by non-scripting means

 

Details Section Column
Tag Tag name of an HTML fragment in question
Preview Image fragment or exact DOM reason for not being visible
Guess Allegedly spammy fragments marked with icons
Text Indexable text
Fragment HTML in question
Score Visibility score
Visibility Good/Partial/Poor/Hidden

Capabilities

Most of the text hiding techniques described here are detected. Two major categories of reporting are DOM-tricks and COLOR/BKG/LAYER-tricks. DOM-tricks are the easiest to detect, and accurate reporting is in place. Other types are reported as invisible or poorly visible.
 
DOM-tricks
DOM (Removed), DOM (Display:None), DOM (Visibility:Hidden), DOM (Hidden-Ancestor),
DOM (Off Screen Spam)

How does it work? An overview

There is quite a bit of stuff going on behind the scenes.

1) The processing starts off with an instance of Internet Explorer navigating to a URL
2) Then a page is printed to an image for visual processing
3) A DOM snapshot is recorded and broken on distinct searchable fragments, structured as B-Tree
4) The optimized, fragmented image is fed into recognizer and results are correlated with newly acquired searchable bits into a report

 

Technologies and techniques employed by the system

  • HTML parsing, DOM to B-Tree transformation
  • Internet Explorer DOM-related bug knowledge base
  • Image enhancing and processing
  • Robotic vision and OCR
  • Fuzzy text algorithms
  • Fancy heuristic-based search algorithm
  • Statistics-driven elimination algorithm
For a more detailed explanation click here >>
Home | Page Thumbnails by clickpreview.net
Copyright ⓒ [2006] Brightwater Software. All rights reserved.