Edit Tool: Pagelyzer

To edit this page, please answer the question that appears below (more info):

What short name does OAIS use for an information package that is used for archiving?

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

Image: In another tab, use the "Upload file" link to upload a logo image. Paste the name of the image here:
Purpose* A single sentence describing the function of the tool. Keep it factual and concise:	Suite of tools for detecting changes in web pages and their rendering
Homepage The URL of the homepage:
Sourcecode The URL for the source code:
License:
Cost:
Platforms:
Language:
Wikidata ID:
Input formats Start typing and select from the available list. Only add a new format if necessary:
Output Formats: Start typing and select from the available list. Only add a new format if necessary:
Function*: Start typing and select one or more functions from the available list. Only add a new function if absolutely necessary:
Content: Start typing and select one or more content types from the available list. Only add a new content type if absolutely necessary:

OpenHub ID:
Mailing_lists:
Releases_rss:
Issues_rss:

Free text:

= Description = Pagelyzer is a tool which compares two web pages versions and decides if they are similar or not. It is based on: * a combination of structural and visual comparison methods embedded in a statistical discriminative model,<br /> * a visual similarity measure designed for Web pages that improves change detection,<br /> * a supervised feature selection method adapted to Web archiving. We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real Web archives validate our approach. Installation manual can be found [http://wiki.opf-labs.org/download/attachments/12059037/installation+manual-1-pagealyzer.pdf?version=4&modificationDate=1354896471000 here] == How does it work? == Step 1: For each url given as inputs, it gets screen capture in PNG format and also produces an HTML document with the visual cues integrated, called Decorated HTML. This allows to save the state of a browser at the moment of capture and permits to decouple the solution from a particular browser. Step 2: In this step each page is segmented based on based on their DOM tree information and their visual rendering. In the previous version of the tool, called Marcalizer, VIPS is used to segment web pages. However, we developed a new algorithm called Page-o-Metric , which removes the VIPS restriction of using IE as a web browser &nbsp;and also enhances the precision of visual block extraction and the hierarchy construction. At the end of this step, two XML trees, called Vi-XML representing the segmented web pages are returned. . The details of this approach can be found in . Step 3: In this step, visual and structural descriptors are extracted. Images (snapshots) are first described by color descriptors and also by SIFT descriptors . &nbsp;For image representation, Bag of Words(BoWs) representation is used. Structural descriptors are based on Jaccard indices and also &nbsp; &nbsp;based on the Vi-XML files differences . The structural and visual differences are merged to obtain a similarity vector according to . == References == * D. Cai, S. Yu, J.R. Wen, and W.Y. Ma. VIPS: a Vision-based Page Segmentation Algorithm. Technical report, Microsoft Research, 2003. * Saad M.B., Gançarski S., Pehlivan Z.. A Novel Web Archiving Approach based on Visual Pages Analysis. In 9th International Web Archiving Workshop (IWAW), ECDL 2009 * Sanoja, Gançarski S. “Yet another Web Page Segmentation Tool”. Proceedings iPRES 2012. Toronto. Canada, 2012 * D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60, 2004 * Pehlivan Z., Saad M.B. , Gançarski S..Understanding Web Pages Changes. ''DEXA 2010: 1-15'' * M. Teva Law, C. Sureda, N. Thome, S. Gançarski, M. Cord. Structural and Visual Similarity Learning for Web Page Archiving, Workshop CBMI 2012 = User Experiences = *[http://wiki.opf-labs.org/display/SP/SO18+Comparing+two+web+page+versions+for+web+archiving SP:SO18 Comparing two web page versions for web archiving] = Development Activity = == Releases == <rss max=5>https://github.com/openplanets/pagelyzer/releases.atom</rss> == Commits == <rss max=5>https://github.com/openplanets/pagelyzer/commits/master.atom</rss>

Summary:

Cancel

Edit Tool: Pagelyzer

Navigation menu

Search