Editing Pagelyzer

Jump to navigation Jump to search

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
{{Infobox tool
+
=== Summary ===
|image=Pagelyzer_small.png
+
 
|purpose=Suite of tools for detecting changes in web pages and their rendering
+
'''Purpose:''' {excerpt}Suite of tools for detecting changes in web pages and their rendering<br />
|homepage=https://github.com/openplanets/pagelyzer
+
{excerpt}<br />
|function=Metadata Extraction, Quality Assurance
+
'''Homepage:''' [[https://github.com/openplanets/pagelyzer https://github.com/openplanets/pagelyzer]]<br />
|content=Web
+
'''Source Code:''' [[https://github.com/openplanets/pagelyzer https://github.com/openplanets/pagelyzer]]<br />
}}
+
'''License:''' None<br />
{{Infobox tool details}}
+
'''Cost:''' Free<br />
= Description =
+
'''Platform:''' Unix<br />
 +
[[Image:http://www.webaddress.logo/Insert_Logo_URL_Here|http://www.webaddress.logo/Insert_Logo_URL_Here]]
 +
 
 +
[[Category:Characterisation]]
 +
 
 +
=== Description ===
 +
 
 
Pagelyzer is a tool which compares two web pages versions and decides if they are similar or not.
 
Pagelyzer is a tool which compares two web pages versions and decides if they are similar or not.
  
Line 17: Line 23:
 
We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real Web archives validate our approach.
 
We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real Web archives validate our approach.
  
Installation manual can be found [http://wiki.opf-labs.org/download/attachments/12059037/installation+manual-1-pagealyzer.pdf?version=4&amp;modificationDate=1354896471000 here]
+
Installation manual can be found [http://wiki.opf-labs.org/download/attachments/12059037/installation+manual-1-pagealyzer.pdf?version=4&amp;modificationDate=1354896471000|here]
  
== How does it work? ==
+
=== How does it work? ===
  
 
Step 1: For each url given as inputs, it gets screen capture in PNG format and also produces an HTML document with the visual cues integrated, called Decorated HTML. This allows to save the state of a browser at the moment of capture and permits to decouple the solution from a particular browser.
 
Step 1: For each url given as inputs, it gets screen capture in PNG format and also produces an HTML document with the visual cues integrated, called Decorated HTML. This allows to save the state of a browser at the moment of capture and permits to decouple the solution from a particular browser.
Line 27: Line 33:
 
Step 3: In this step, visual and structural descriptors are extracted. Images (snapshots) are first described by color descriptors and also by SIFT descriptors . &amp;nbsp;For image representation, Bag of Words(BoWs) representation is used. Structural descriptors are based on Jaccard indices and also &amp;nbsp; &amp;nbsp;based on the Vi-XML files differences . The structural and visual differences are merged to obtain a similarity vector according to .
 
Step 3: In this step, visual and structural descriptors are extracted. Images (snapshots) are first described by color descriptors and also by SIFT descriptors . &amp;nbsp;For image representation, Bag of Words(BoWs) representation is used. Structural descriptors are based on Jaccard indices and also &amp;nbsp; &amp;nbsp;based on the Vi-XML files differences . The structural and visual differences are merged to obtain a similarity vector according to .
  
== References ==
+
'''References:+'''
 +
 
 +
D. Cai, S. Yu, J.<s>R. Wen, and W.</s>Y. Ma. VIPS: a Vision-based Page Segmentation Algorithm. Technical report, Microsoft Research, 2003.
 +
 
 +
&amp;nbsp;Saad M.B., Gançarski S., Pehlivan Z.. A Novel Web Archiving Approach based on Visual Pages Analysis. In 9th International Web Archiving Workshop (IWAW), ECDL 2009
 +
 
 +
Sanoja, Gançarski S. “Yet another Web Page Segmentation Tool”. Proceedings iPRES 2012. Toronto. Canada, 2012
 +
 
 +
D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60, 2004
 +
 
 +
Pehlivan Z., Saad M.B. , Gançarski S..Understanding Web Pages Changes. ''DEXA 2010: 1-15''
  
* D. Cai, S. Yu, J.R. Wen, and W.Y. Ma. VIPS: a Vision-based Page Segmentation Algorithm. Technical report, Microsoft Research, 2003.
+
M. Teva Law, C. Sureda, N. Thome, S. Gançarski, M. Cord. Structural and Visual Similarity Learning for Web Page Archiving, Workshop CBMI 2012
* Saad M.B., Gançarski S., Pehlivan Z.. A Novel Web Archiving Approach based on Visual Pages Analysis. In 9th International Web Archiving Workshop (IWAW), ECDL 2009
 
* Sanoja, Gançarski S. “Yet another Web Page Segmentation Tool”. Proceedings iPRES 2012. Toronto. Canada, 2012
 
* D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60, 2004
 
* Pehlivan Z., Saad M.B. , Gançarski S..Understanding Web Pages Changes. ''DEXA 2010: 1-15''
 
* M. Teva Law, C. Sureda, N. Thome, S. Gançarski, M. Cord. Structural and Visual Similarity Learning for Web Page Archiving, Workshop CBMI 2012
 
  
= User Experiences =
+
=== User Experiences and Test Data ===
  
*[http://wiki.opf-labs.org/display/SP/SO18+Comparing+two+web+page+versions+for+web+archiving SP:SO18 Comparing two web page versions for web archiving]
+
[SP:SO18 Comparing two web page versions for web archiving]
  
= Development Activity =
+
=== Development Activity ===
== Releases ==
+
https://github.com/openplanets/pagelyzer/commits/master.atom
<rss max=5>https://github.com/openplanets/pagelyzer/releases.atom</rss>
 
== Commits ==
 
<rss max=5>https://github.com/openplanets/pagelyzer/commits/master.atom</rss>
 

Please note that all contributions to COPTR are considered to be released under the Attribution-ShareAlike 3.0 Unported (see COPTR:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To edit this page, please answer the question that appears below (more info):

Cancel Editing help (opens in new window)