Pagelyzer is a tool which compares two web pages versions and decides if they are similar or not.
It is based on:
- a combination of structural and visual comparison methods embedded in a statistical discriminative model,
- a visual similarity measure designed for Web pages that improves change detection,
- a supervised feature selection method adapted to Web archiving.
We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real Web archives validate our approach.
Installation manual can be found here
How does it work?
Step 1: For each url given as inputs, it gets screen capture in PNG format and also produces an HTML document with the visual cues integrated, called Decorated HTML. This allows to save the state of a browser at the moment of capture and permits to decouple the solution from a particular browser.
Step 2: In this step each page is segmented based on based on their DOM tree information and their visual rendering. In the previous version of the tool, called Marcalizer, VIPS is used to segment web pages. However, we developed a new algorithm called Page-o-Metric , which removes the VIPS restriction of using IE as a web browser and also enhances the precision of visual block extraction and the hierarchy construction. At the end of this step, two XML trees, called Vi-XML representing the segmented web pages are returned. . The details of this approach can be found in .
Step 3: In this step, visual and structural descriptors are extracted. Images (snapshots) are first described by color descriptors and also by SIFT descriptors . For image representation, Bag of Words(BoWs) representation is used. Structural descriptors are based on Jaccard indices and also based on the Vi-XML files differences . The structural and visual differences are merged to obtain a similarity vector according to .
- D. Cai, S. Yu, J.R. Wen, and W.Y. Ma. VIPS: a Vision-based Page Segmentation Algorithm. Technical report, Microsoft Research, 2003.
- Saad M.B., Gançarski S., Pehlivan Z.. A Novel Web Archiving Approach based on Visual Pages Analysis. In 9th International Web Archiving Workshop (IWAW), ECDL 2009
- Sanoja, Gançarski S. “Yet another Web Page Segmentation Tool”. Proceedings iPRES 2012. Toronto. Canada, 2012
- D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60, 2004
- Pehlivan Z., Saad M.B. , Gançarski S..Understanding Web Pages Changes. DEXA 2010: 1-15
- M. Teva Law, C. Sureda, N. Thome, S. Gançarski, M. Cord. Structural and Visual Similarity Learning for Web Page Archiving, Workshop CBMI 2012
Failed to load RSS feed from https://github.com/openplanets/pagelyzer/releases.atom: Error fetching URL: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
Failed to load RSS feed from https://github.com/openplanets/pagelyzer/commits/master.atom: Error fetching URL: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version