Description

	Suite of tools for detecting changes in web pages and their rendering
Homepage:	https://github.com/openplanets/pagelyzer
Function:	Metadata Extraction,Quality Assurance
Content type:	Web

Pagelyzer is a tool which compares two web pages versions and decides if they are similar or not.

It is based on:

a combination of structural and visual comparison methods embedded in a statistical discriminative model,
a visual similarity measure designed for Web pages that improves change detection,
a supervised feature selection method adapted to Web archiving.

We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real Web archives validate our approach.

Installation manual can be found here

How does it work?

Step 1: For each url given as inputs, it gets screen capture in PNG format and also produces an HTML document with the visual cues integrated, called Decorated HTML. This allows to save the state of a browser at the moment of capture and permits to decouple the solution from a particular browser.

Step 2: In this step each page is segmented based on based on their DOM tree information and their visual rendering. In the previous version of the tool, called Marcalizer, VIPS is used to segment web pages. However, we developed a new algorithm called Page-o-Metric , which removes the VIPS restriction of using IE as a web browser  and also enhances the precision of visual block extraction and the hierarchy construction. At the end of this step, two XML trees, called Vi-XML representing the segmented web pages are returned. . The details of this approach can be found in .

Step 3: In this step, visual and structural descriptors are extracted. Images (snapshots) are first described by color descriptors and also by SIFT descriptors .  For image representation, Bag of Words(BoWs) representation is used. Structural descriptors are based on Jaccard indices and also    based on the Vi-XML files differences . The structural and visual differences are merged to obtain a similarity vector according to .

References

D. Cai, S. Yu, J.R. Wen, and W.Y. Ma. VIPS: a Vision-based Page Segmentation Algorithm. Technical report, Microsoft Research, 2003.
Saad M.B., Gançarski S., Pehlivan Z.. A Novel Web Archiving Approach based on Visual Pages Analysis. In 9th International Web Archiving Workshop (IWAW), ECDL 2009
Sanoja, Gançarski S. “Yet another Web Page Segmentation Tool”. Proceedings iPRES 2012. Toronto. Canada, 2012
D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60, 2004
Pehlivan Z., Saad M.B. , Gançarski S..Understanding Web Pages Changes. DEXA 2010: 1-15
M. Teva Law, C. Sureda, N. Thome, S. Gançarski, M. Cord. Structural and Visual Similarity Learning for Web Page Archiving, Workshop CBMI 2012

User Experiences

SP:SO18 Comparing two web page versions for web archiving

Development Activity

Releases

Failed to load RSS feed from https://github.com/openplanets/pagelyzer/releases.atom: Error parsing XML for RSS

Commits

2014-06-30 15:03:44: [tag:github.com,2008:Grit::Commit/accc4c16f1dd5e655d9d7efcdc74ad89e12d883e Updates]; by zeynep.pehlivan@lip6.fr
2014-06-30 14:59:53: [tag:github.com,2008:Grit::Commit/718945cbac4aaa95657f77120852bd61f10025bb Adding javadoc comments]; by zeynep.pehlivan@lip6.fr
2014-06-27 10:42:07: [tag:github.com,2008:Grit::Commit/2fa821278a10119c02166331e4b6de39f456182d Remove system.out.println to Jpagelyzer]; by zeynep.pehlivan@lip6.fr
2014-06-26 13:44:14: [tag:github.com,2008:Grit::Commit/5326c92605b0d66f2e017e5db15fbf858e38551d adding project descriptions]; by zeynep.pehlivan@lip6.fr
2014-06-23 12:27:26: [tag:github.com,2008:Grit::Commit/2fc327477651c4ce099899af425710de198d69eb readme, xml properties]; by asanoja https://github.com/asanoja

Pagelyzer

Contents