Editing Pagelyzer
Jump to navigation
Jump to search
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 1: | Line 1: | ||
− | + | === Summary === | |
− | + | ||
− | + | '''Purpose:''' {excerpt}Suite of tools for detecting changes in web pages and their rendering<br /> | |
− | + | {excerpt}<br /> | |
− | | | + | '''Homepage:''' [[https://github.com/openplanets/pagelyzer https://github.com/openplanets/pagelyzer]]<br /> |
− | + | '''Source Code:''' [[https://github.com/openplanets/pagelyzer https://github.com/openplanets/pagelyzer]]<br /> | |
− | + | '''License:''' None<br /> | |
− | + | '''Cost:''' Free<br /> | |
− | = Description = | + | '''Platform:''' Unix<br /> |
+ | [[Image:http://www.webaddress.logo/Insert_Logo_URL_Here|http://www.webaddress.logo/Insert_Logo_URL_Here]] | ||
+ | |||
+ | [[Category:Characterisation]] | ||
+ | |||
+ | === Description === | ||
+ | |||
Pagelyzer is a tool which compares two web pages versions and decides if they are similar or not. | Pagelyzer is a tool which compares two web pages versions and decides if they are similar or not. | ||
Line 17: | Line 23: | ||
We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real Web archives validate our approach. | We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real Web archives validate our approach. | ||
− | Installation manual can be found [http://wiki.opf-labs.org/download/attachments/12059037/installation+manual-1-pagealyzer.pdf?version=4&modificationDate=1354896471000 here] | + | Installation manual can be found [http://wiki.opf-labs.org/download/attachments/12059037/installation+manual-1-pagealyzer.pdf?version=4&modificationDate=1354896471000|here] |
− | == How does it work? == | + | === How does it work? === |
Step 1: For each url given as inputs, it gets screen capture in PNG format and also produces an HTML document with the visual cues integrated, called Decorated HTML. This allows to save the state of a browser at the moment of capture and permits to decouple the solution from a particular browser. | Step 1: For each url given as inputs, it gets screen capture in PNG format and also produces an HTML document with the visual cues integrated, called Decorated HTML. This allows to save the state of a browser at the moment of capture and permits to decouple the solution from a particular browser. | ||
Line 27: | Line 33: | ||
Step 3: In this step, visual and structural descriptors are extracted. Images (snapshots) are first described by color descriptors and also by SIFT descriptors . &nbsp;For image representation, Bag of Words(BoWs) representation is used. Structural descriptors are based on Jaccard indices and also &nbsp; &nbsp;based on the Vi-XML files differences . The structural and visual differences are merged to obtain a similarity vector according to . | Step 3: In this step, visual and structural descriptors are extracted. Images (snapshots) are first described by color descriptors and also by SIFT descriptors . &nbsp;For image representation, Bag of Words(BoWs) representation is used. Structural descriptors are based on Jaccard indices and also &nbsp; &nbsp;based on the Vi-XML files differences . The structural and visual differences are merged to obtain a similarity vector according to . | ||
− | + | '''References:+''' | |
+ | |||
+ | D. Cai, S. Yu, J.<s>R. Wen, and W.</s>Y. Ma. VIPS: a Vision-based Page Segmentation Algorithm. Technical report, Microsoft Research, 2003. | ||
+ | |||
+ | &nbsp;Saad M.B., Gançarski S., Pehlivan Z.. A Novel Web Archiving Approach based on Visual Pages Analysis. In 9th International Web Archiving Workshop (IWAW), ECDL 2009 | ||
+ | |||
+ | Sanoja, Gançarski S. “Yet another Web Page Segmentation Tool”. Proceedings iPRES 2012. Toronto. Canada, 2012 | ||
+ | |||
+ | D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60, 2004 | ||
+ | |||
+ | Pehlivan Z., Saad M.B. , Gançarski S..Understanding Web Pages Changes. ''DEXA 2010: 1-15'' | ||
− | + | M. Teva Law, C. Sureda, N. Thome, S. Gançarski, M. Cord. Structural and Visual Similarity Learning for Web Page Archiving, Workshop CBMI 2012 | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | = User Experiences = | + | === User Experiences and Test Data === |
− | + | [SP:SO18 Comparing two web page versions for web archiving] | |
− | = Development Activity = | + | === Development Activity === |
− | + | https://github.com/openplanets/pagelyzer/commits/master.atom | |
− | |||
− | |||
− |