Difference between revisions of "Pagelyzer"

From COPTR
Jump to navigation Jump to search
Line 46: Line 46:
 
*[http://wiki.opf-labs.org/display/SP/SO18+Comparing+two+web+page+versions+for+web+archiving SP:SO18 Comparing two web page versions for web archiving]
 
*[http://wiki.opf-labs.org/display/SP/SO18+Comparing+two+web+page+versions+for+web+archiving SP:SO18 Comparing two web page versions for web archiving]
  
= Development Activity =
+
= Activity =
{{Tool_activity
+
== Releases ==
|releaseFeed=https://github.com/openplanets/pagelyzer/releases.atom
+
<rss>https://github.com/openplanets/pagelyzer/releases.atom</rss>
|commitFeed=https://github.com/openplanets/pagelyzer/commits/master.atom
+
== Commits ==
|issueFeed=
+
<rss>https://github.com/openplanets/pagelyzer/commits/master.atom</rss>
|ohlohID=
 
}}
 

Revision as of 14:45, 27 October 2013

Pagelyzer
Suite of tools for detecting changes in web pages and their rendering
Homepage:https://github.com/openplanets/pagelyzer
Source Code:https://github.com/openplanets/pagelyzer
Cost:Free

Description

Pagelyzer is a tool which compares two web pages versions and decides if they are similar or not.

It is based on:

  • a combination of structural and visual comparison methods embedded in a statistical discriminative model,
  • a visual similarity measure designed for Web pages that improves change detection,
  • a supervised feature selection method adapted to Web archiving.

We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real Web archives validate our approach.

Installation manual can be found here

How does it work?

Step 1: For each url given as inputs, it gets screen capture in PNG format and also produces an HTML document with the visual cues integrated, called Decorated HTML. This allows to save the state of a browser at the moment of capture and permits to decouple the solution from a particular browser.

Step 2: In this step each page is segmented based on based on their DOM tree information and their visual rendering. In the previous version of the tool, called Marcalizer, VIPS is used to segment web pages. However, we developed a new algorithm called Page-o-Metric , which removes the VIPS restriction of using IE as a web browser &nbsp;and also enhances the precision of visual block extraction and the hierarchy construction. At the end of this step, two XML trees, called Vi-XML representing the segmented web pages are returned. . The details of this approach can be found in .

Step 3: In this step, visual and structural descriptors are extracted. Images (snapshots) are first described by color descriptors and also by SIFT descriptors . &nbsp;For image representation, Bag of Words(BoWs) representation is used. Structural descriptors are based on Jaccard indices and also &nbsp; &nbsp;based on the Vi-XML files differences . The structural and visual differences are merged to obtain a similarity vector according to .

References

  • D. Cai, S. Yu, J.R. Wen, and W.Y. Ma. VIPS: a Vision-based Page Segmentation Algorithm. Technical report, Microsoft Research, 2003.
  • Saad M.B., Gançarski S., Pehlivan Z.. A Novel Web Archiving Approach based on Visual Pages Analysis. In 9th International Web Archiving Workshop (IWAW), ECDL 2009
  • Sanoja, Gançarski S. “Yet another Web Page Segmentation Tool”. Proceedings iPRES 2012. Toronto. Canada, 2012
  • D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60, 2004
  • Pehlivan Z., Saad M.B. , Gançarski S..Understanding Web Pages Changes. DEXA 2010: 1-15
  • M. Teva Law, C. Sureda, N. Thome, S. Gançarski, M. Cord. Structural and Visual Similarity Learning for Web Page Archiving, Workshop CBMI 2012

User Experiences

Activity

Releases

Failed to load RSS feed from https://github.com/openplanets/pagelyzer/releases.atom: Error parsing XML for RSS

Commits

2014-06-30 15:03:44
[tag:github.com,2008:Grit::Commit/accc4c16f1dd5e655d9d7efcdc74ad89e12d883e Updates]
by zeynep.pehlivan@lip6.fr
2014-06-30 14:59:53
[tag:github.com,2008:Grit::Commit/718945cbac4aaa95657f77120852bd61f10025bb Adding javadoc comments]
by zeynep.pehlivan@lip6.fr
2014-06-27 10:42:07
[tag:github.com,2008:Grit::Commit/2fa821278a10119c02166331e4b6de39f456182d Remove system.out.println to Jpagelyzer]
by zeynep.pehlivan@lip6.fr
2014-06-26 13:44:14
[tag:github.com,2008:Grit::Commit/5326c92605b0d66f2e017e5db15fbf858e38551d adding project descriptions]
by zeynep.pehlivan@lip6.fr
2014-06-23 12:27:26
[tag:github.com,2008:Grit::Commit/2fc327477651c4ce099899af425710de198d69eb readme, xml properties]
by asanoja https://github.com/asanoja
2014-06-19 10:21:26
[tag:github.com,2008:Grit::Commit/716b1190dbd3f3783ec7d8314c05eed9481d65af Update README.md]
by zeynep.pehlivan@lip6.fr
2014-06-19 09:42:31
[tag:github.com,2008:Grit::Commit/169fce0b1a930ef69d6c0ba17be37827d646464c Update README.md]
by zeynep.pehlivan@lip6.fr
2014-06-19 09:41:36
[tag:github.com,2008:Grit::Commit/ec732d3599fa8d406d33d07434143985d126165a Update README.md]
by zeynep.pehlivan@lip6.fr
2014-05-25 17:21:09
[tag:github.com,2008:Grit::Commit/3715fd878e0f7140a8d319c92908f32a36074c8a pom update for train]
by pehlivanz@pehlivanz-SATELLITE-R830
2014-05-23 14:29:45
[tag:github.com,2008:Grit::Commit/c41679de1208ae2cfbe74814722d407f513aa559 Updating -5000 score as a result of structural dissimilarity also to -1]
by pehlivanz@pehlivanz-SATELLITE-R830
2014-05-22 13:32:16
[tag:github.com,2008:Grit::Commit/413c7321f01fd1647ae0049b45fcb226e4dd748d Updating test cleaning code]
by pehlivanz@pehlivanz-SATELLITE-R830
2014-05-22 13:26:51
[tag:github.com,2008:Grit::Commit/b6d3548a8bb8eecb5716aa0367eb7c269a1c3a7f Adding arguments details]
by pehlivanz@pehlivanz-SATELLITE-R830
2014-05-22 13:25:41
[tag:github.com,2008:Grit::Commit/49d97e44aaf5332a486a9833f9fe60cc092d9dc2 Removing the iteration limits]
by pehlivanz@pehlivanz-SATELLITE-R830
2014-05-22 13:25:14
[tag:github.com,2008:Grit::Commit/21372bc8a0f31b272df6bba0daaea00a9f0aec3f Updating URL split function: problem detected with index pages based on]
by pehlivanz@pehlivanz-SATELLITE-R830
2014-05-22 13:21:24
[tag:github.com,2008:Grit::Commit/bfd8d9d4ba2a0e1d94f8dae4b4fa31da4676bdd0 Removing hard coded main function]
by pehlivanz@pehlivanz-SATELLITE-R830
2014-05-18 21:34:27
[tag:github.com,2008:Grit::Commit/630b996daeba6b34c06880b9391f5dbc9507f0fc removing hard coded part to clean url]
by pehlivanz@pehlivanz-SATELLITE-R830
2014-05-18 20:51:29
[tag:github.com,2008:Grit::Commit/3dc9ab72a7d2dd425225680d8917bad110c2b79a last pom (I hope)]
by pehlivanz@pehlivanz-SATELLITE-R830
2014-05-18 20:42:38
[tag:github.com,2008:Grit::Commit/f3c9b0fb3c1a64322f92fb859ec7b054a0b3a71d Removing mode option from commandline because all other svm files etc.]
by pehlivanz@pehlivanz-SATELLITE-R830
2014-05-18 20:11:45
[tag:github.com,2008:Grit::Commit/0b5cb86bb30c5872500e5c0574ba1d767907ac22 JPagelyzer: adding seperated capture function to use in train]
by pehlivanz@pehlivanz-SATELLITE-R830
2014-05-18 20:04:47
[tag:github.com,2008:Grit::Commit/2893e20d9e379ce2017ac54d77e99126768f9722 Removing options related to reading files from disk]
by pehlivanz@pehlivanz-SATELLITE-R830