Class: CrawlKit

CrawlKit

The CrawlKit base class. This is where the magic happens.

Constructor

new CrawlKit(urlopt, nameopt)

Create a CrawlKit instance
Parameters:
Name Type Attributes Description
url String <optional>
The start URL. Sets the CrawlKit#url.
name String <optional>
The instance name of the crawler. Used for logging purposes.
Source:

Members

(non-null) browserCookies :Array.<Object>

Getter/setter for the cookies to set within PhantomJS. Each entry is supposed to be an object following the PhantomJS spec.
Type:
  • Array.<Object>
Source:

(non-null) concurrency :integer

Getter/setter for the concurrency of the crawler. This controls the amount of PhantomJS instances that will be spawned and used to work on found websites. Adapt this to the power of your machine. Values under one are set to one.
Type:
  • integer
Default Value:
  • 1 (No concurrency)
Source:

(non-null) followRedirects :boolean

Getter/setter for whether to follow redirects or not. When following redirects, the original page is not processed.
Type:
  • boolean
Default Value:
  • false
Source:

(non-null) phantomPageSettings :Object.<String, *>

Getter/setter for the map of settings to pass to an opened page. You can use this for example for Basic Authentication. For a list of options, please refer to the PhantomJS documentation. Nested settings can just be provided in dot notation as the key, e.g. 'settings.userAgent'.
Type:
  • Object.<String, *>
Source:

(non-null) phantomParameters :Object.<String, String>

Getter/setter for the map of parameters to pass to PhantomJS. You can use this for example to ignore SSL errors. For a list of parameters, please refer to the PhantomJS documentation.
Type:
  • Object.<String, String>
Source:

redirectFilter :function

Getter/setter for the filter that is applied to redirected URLs. With this filter you can prevent the redirect or rewrite it. The filter callback gets two arguments. The first one is the target URL the scond one the source URL. Return false for preventing the redirect. Return a String (URL) to follow the redirect.
Type:
  • function
Source:

(non-null) timeout :integer

Getter/setter for overall timeout for one website processing (opening page, evaluating runners and finder functions). The timeout starts fresh for each website. Values under zero are set to zero.
Type:
  • integer
Default Value:
  • 30000 (30 seconds)
Source:

(non-null) tries :integer

Getter/setter for the number of tries when a PhantomJS instance crashes on a page or CrawlKit#timeout is hit. When a PhantomJS instance crashes whilst crawling a webpage, this instance is shutdown and replaced by a new one. By default the webpage that failed in such a way will be re-queued. If the finders and runners did not respond within the defined timeout, it will be tried to run them again as well. This member controls how often that re-queueing happens. Values under zero are set to zero.
Type:
  • integer
Default Value:
  • 3 (read: try two more times after the first failure, three times in total)
Source:

url :String

Getter/setter for the start URL of the crawler. This is the URL that will be used as an initial endpoint for the crawler. If the protocol is omitted (e.g. URL starts with //), the URL will be rewritten to http://
Type:
  • String
Source:

Methods

addRunner(keynon-null, runnernon-null, …runnableParamsopt)

Allows you to add a runner that is executed on each crawled page. The returned value of the runner is added to the overall result. Runners run sequentially on each webpage in the order they were added. If a runner is crashing PhantomJS more than CrawlKit#tries times, subsequent Runners are not executed.
Parameters:
Name Type Attributes Description
key String The runner identificator. This is also used in the result stream/object.
runner Runner The runner instance to use for discovery.
runnableParams * <optional>
<repeatable>
These parameters are passed to the function returned by Runner#getRunnable at evaluation time.
Source:
See:
  • For an example see `examples/simple.js`. For an example using parameters, see `examples/advanced.js`.

crawl(shouldStreamopt) → {Stream|Promise.<Object>}

This method starts the crawling/scraping process.
Parameters:
Name Type Attributes Default Description
shouldStream boolean <optional>
false Whether to stream the results or use a Promise
Source:
Returns:
By default a Promise object is returned that resolves to the result. If streaming is enabled it returns a JSON stream of the results.
Type
Stream | Promise.<Object>

setFinder(findernon-null, …runnableParamsopt)

With this method a Finder instance can be set for the crawler. A finder is used for link discovery on a website. It is run directly after page load and is optional (e.g. if you want to only work on a single page).
Parameters:
Name Type Attributes Description
finder Finder The finder instance to use for discovery.
runnableParams * <optional>
<repeatable>
These parameters are passed to the function returned by Finder#getRunnable at evaluation time.
Source: