Skip to content

How does the Grapher work?

chrisnewtn edited this page Sep 2, 2012 · 1 revision

Inside grapher.js exists the Grapher class. This class expects a URL, which is used as the entry point to a graph, along with various options including:

  • strict boolean If set to true then the toJSON method will not render unverified links.
  • crawlLimit integer This number represents the depth at which a chain of links can be crawled without having found a verified link before the chain is abandoned and crawling of it ceased.
  • stripDeeperLinks boolean If set to true then only the joint shallowest paths of each domain are rendered when toJSON is called.

The only above option that actually impacts the construction of the graph is crawlLimit. A number too low and the whole graph may not be rendered. A number too high and the graph may do a ton of unnecessary crawling that doesn't lead anywhere.

An example initialization is shown below :

var url = "http://premasagar.com",
    options, grapher;

options = {
  strict: true,
  crawlLimit: 3
};

grapher  = new Grapher(url, options);

The main graph construction loop

Once the Grapher has been initialized, call its build method do begin the crawling process. Depending on the size of the graph, this may take a minute or so.

This process begins with the build method using the graph's rootUrl (set at the initialization of the Grapher) to initialize a Page object.

This Page has it verified property manually set to true by the build method as the entry point into the graph is treated as being inherently valid.

Once this is done, the Page is added to the Grapher's pages array and the fetchPages method is called for the first time.

Side node: When the build method is first called, you pass it a callback function that will be executed upon completion of the graph's construction. This callback is passed to the fetchPages method after the initial Page has been added to the pages array.

fetchPages

At the core of the fetchPages method is an each loop that goes through every page in the pages array and checks if any have a status of "unfetched". If any do then they are checked to see if their depth exceeds the Grapher's crawlLimit and provided they don't, their fetch method is called.

If by the end of the loop every Page has been fetched then each are verified one last time before executing the callback method originally passed in by build.

page.fetch

The first thing the fetch method does when called is updates the Page's status to "fetching". This informs the fetchPages method of the Grapher that the graph is still being built and to not execute the callback passed in by build.

Next the method checks the cache to see if the Page's url has already been crawled, if it hasn't then the scraper is used to acquire it. The Scraper then caches any data it scrapes. Regardless of whether the data comes from the cache or the scraper, the same callback is then executed.

This callback (populate) does a few very important things.

  1. Firstly it populates the Page object with the data retrieved from the scraper/cache.
  2. Secondly it calls the Grapher's verifyPages method which will attempt to verify itself as well as any other pages within the Grapher's pages array.
  3. Next it takes any urls found on it that haven't already been crawled and uses them to initialize new Page objects which are then added to the Grapher's pages array ready for crawling.
  4. Next it updates its status from "fetching" to "fetched".
  5. Finally it executes the whenPageIsFetched callback passed to it from the Grapher's fetchPages method.

The whenPageIsFetched callback checks if every page in the graph has been fetched. If they have then the callback passed in by the build method originally, is then executed. If they haven't, then it calls fetchPages, starting the crawling-verification loop again.

Clone this wiki locally