Difference between revisions of "Pagelyzer"
Line 45: | Line 45: | ||
= Development Activity = | = Development Activity = | ||
− | https://github.com/openplanets/pagelyzer/commits/master.atom | + | <rss>https://github.com/openplanets/pagelyzer/commits/master.atom</rss> |
Revision as of 23:15, 24 October 2013
Description
Pagelyzer is a tool which compares two web pages versions and decides if they are similar or not.
It is based on:
- a combination of structural and visual comparison methods embedded in a statistical discriminative model,
- a visual similarity measure designed for Web pages that improves change detection,
- a supervised feature selection method adapted to Web archiving.
We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real Web archives validate our approach.
Installation manual can be found here
How does it work?
Step 1: For each url given as inputs, it gets screen capture in PNG format and also produces an HTML document with the visual cues integrated, called Decorated HTML. This allows to save the state of a browser at the moment of capture and permits to decouple the solution from a particular browser.
Step 2: In this step each page is segmented based on based on their DOM tree information and their visual rendering. In the previous version of the tool, called Marcalizer, VIPS is used to segment web pages. However, we developed a new algorithm called Page-o-Metric , which removes the VIPS restriction of using IE as a web browser and also enhances the precision of visual block extraction and the hierarchy construction. At the end of this step, two XML trees, called Vi-XML representing the segmented web pages are returned. . The details of this approach can be found in .
Step 3: In this step, visual and structural descriptors are extracted. Images (snapshots) are first described by color descriptors and also by SIFT descriptors . For image representation, Bag of Words(BoWs) representation is used. Structural descriptors are based on Jaccard indices and also based on the Vi-XML files differences . The structural and visual differences are merged to obtain a similarity vector according to .
References
- D. Cai, S. Yu, J.R. Wen, and W.Y. Ma. VIPS: a Vision-based Page Segmentation Algorithm. Technical report, Microsoft Research, 2003.
- Saad M.B., Gançarski S., Pehlivan Z.. A Novel Web Archiving Approach based on Visual Pages Analysis. In 9th International Web Archiving Workshop (IWAW), ECDL 2009
- Sanoja, Gançarski S. “Yet another Web Page Segmentation Tool”. Proceedings iPRES 2012. Toronto. Canada, 2012
- D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60, 2004
- Pehlivan Z., Saad M.B. , Gançarski S..Understanding Web Pages Changes. DEXA 2010: 1-15
- M. Teva Law, C. Sureda, N. Thome, S. Gançarski, M. Cord. Structural and Visual Similarity Learning for Web Page Archiving, Workshop CBMI 2012
User Experiences
Development Activity
- 2014-06-30 15:03:44
- [tag:github.com,2008:Grit::Commit/accc4c16f1dd5e655d9d7efcdc74ad89e12d883e Updates]
- by zeynep.pehlivan@lip6.fr
- 2014-06-30 14:59:53
- [tag:github.com,2008:Grit::Commit/718945cbac4aaa95657f77120852bd61f10025bb Adding javadoc comments]
- by zeynep.pehlivan@lip6.fr
- 2014-06-27 10:42:07
- [tag:github.com,2008:Grit::Commit/2fa821278a10119c02166331e4b6de39f456182d Remove system.out.println to Jpagelyzer]
- by zeynep.pehlivan@lip6.fr
- 2014-06-26 13:44:14
- [tag:github.com,2008:Grit::Commit/5326c92605b0d66f2e017e5db15fbf858e38551d adding project descriptions]
- by zeynep.pehlivan@lip6.fr
- 2014-06-23 12:27:26
- [tag:github.com,2008:Grit::Commit/2fc327477651c4ce099899af425710de198d69eb readme, xml properties]
- by asanoja https://github.com/asanoja
- 2014-06-19 10:21:26
- [tag:github.com,2008:Grit::Commit/716b1190dbd3f3783ec7d8314c05eed9481d65af Update README.md]
- by zeynep.pehlivan@lip6.fr
- 2014-06-19 09:42:31
- [tag:github.com,2008:Grit::Commit/169fce0b1a930ef69d6c0ba17be37827d646464c Update README.md]
- by zeynep.pehlivan@lip6.fr
- 2014-06-19 09:41:36
- [tag:github.com,2008:Grit::Commit/ec732d3599fa8d406d33d07434143985d126165a Update README.md]
- by zeynep.pehlivan@lip6.fr
- 2014-05-25 17:21:09
- [tag:github.com,2008:Grit::Commit/3715fd878e0f7140a8d319c92908f32a36074c8a pom update for train]
- by pehlivanz@pehlivanz-SATELLITE-R830
- 2014-05-23 14:29:45
- [tag:github.com,2008:Grit::Commit/c41679de1208ae2cfbe74814722d407f513aa559 Updating -5000 score as a result of structural dissimilarity also to -1]
- by pehlivanz@pehlivanz-SATELLITE-R830
- 2014-05-22 13:32:16
- [tag:github.com,2008:Grit::Commit/413c7321f01fd1647ae0049b45fcb226e4dd748d Updating test cleaning code]
- by pehlivanz@pehlivanz-SATELLITE-R830
- 2014-05-22 13:26:51
- [tag:github.com,2008:Grit::Commit/b6d3548a8bb8eecb5716aa0367eb7c269a1c3a7f Adding arguments details]
- by pehlivanz@pehlivanz-SATELLITE-R830
- 2014-05-22 13:25:41
- [tag:github.com,2008:Grit::Commit/49d97e44aaf5332a486a9833f9fe60cc092d9dc2 Removing the iteration limits]
- by pehlivanz@pehlivanz-SATELLITE-R830
- 2014-05-22 13:25:14
- [tag:github.com,2008:Grit::Commit/21372bc8a0f31b272df6bba0daaea00a9f0aec3f Updating URL split function: problem detected with index pages based on]
- by pehlivanz@pehlivanz-SATELLITE-R830
- 2014-05-22 13:21:24
- [tag:github.com,2008:Grit::Commit/bfd8d9d4ba2a0e1d94f8dae4b4fa31da4676bdd0 Removing hard coded main function]
- by pehlivanz@pehlivanz-SATELLITE-R830
- 2014-05-18 21:34:27
- [tag:github.com,2008:Grit::Commit/630b996daeba6b34c06880b9391f5dbc9507f0fc removing hard coded part to clean url]
- by pehlivanz@pehlivanz-SATELLITE-R830
- 2014-05-18 20:51:29
- [tag:github.com,2008:Grit::Commit/3dc9ab72a7d2dd425225680d8917bad110c2b79a last pom (I hope)]
- by pehlivanz@pehlivanz-SATELLITE-R830
- 2014-05-18 20:42:38
- [tag:github.com,2008:Grit::Commit/f3c9b0fb3c1a64322f92fb859ec7b054a0b3a71d Removing mode option from commandline because all other svm files etc.]
- by pehlivanz@pehlivanz-SATELLITE-R830
- 2014-05-18 20:11:45
- [tag:github.com,2008:Grit::Commit/0b5cb86bb30c5872500e5c0574ba1d767907ac22 JPagelyzer: adding seperated capture function to use in train]
- by pehlivanz@pehlivanz-SATELLITE-R830
- 2014-05-18 20:04:47
- [tag:github.com,2008:Grit::Commit/2893e20d9e379ce2017ac54d77e99126768f9722 Removing options related to reading files from disk]
- by pehlivanz@pehlivanz-SATELLITE-R830