JSDoc: Class: CrawlKit

Constructor

new CrawlKit(urlopt, nameopt)

Create a CrawlKit instance

Parameters:

Name	Type	Attributes	Description
`url`	String	<optional>	The start URL. Sets the `CrawlKit#url`.
`name`	String	<optional>	The instance name of the crawler. Used for logging purposes.

Source:

src/index.js, line 99

Members

(non-null) browserCookies :Array.<Object>

Getter/setter for the cookies to set within PhantomJS. Each entry is supposed to be an object following the PhantomJS spec.

Type:

Array.<Object>

Source:

src/index.js, line 309

(non-null) concurrency :integer

Getter/setter for the concurrency of the crawler. This controls the amount of PhantomJS instances that will be spawned and used to work on found websites. Adapt this to the power of your machine. Values under one are set to one.

Type:

integer

Default Value:

1 (No concurrency)

Source:

src/index.js, line 150

(non-null) followRedirects :boolean

Getter/setter for whether to follow redirects or not. When following redirects, the original page is not processed.

Type:

boolean

Default Value:

false

Source:

src/index.js, line 292

(non-null) phantomPageSettings :Object.<String, *>

Getter/setter for the map of settings to pass to an opened page. You can use this for example for Basic Authentication. For a list of options, please refer to the PhantomJS documentation. Nested settings can just be provided in dot notation as the key, e.g. 'settings.userAgent'.

Type:

Object.<String, *>

Source:

src/index.js, line 274

(non-null) phantomParameters :Object.<String, String>

Getter/setter for the map of parameters to pass to PhantomJS. You can use this for example to ignore SSL errors. For a list of parameters, please refer to the PhantomJS documentation.

Type:

Object.<String, String>

Source:

src/index.js, line 255

redirectFilter :function

Getter/setter for the filter that is applied to redirected URLs. With this filter you can prevent the redirect or rewrite it. The filter callback gets two arguments. The first one is the target URL the scond one the source URL. Return false for preventing the redirect. Return a String (URL) to follow the redirect.

Type:

function

Source:

src/index.js, line 332

(non-null) timeout :integer

Getter/setter for overall timeout for one website processing (opening page, evaluating runners and finder functions). The timeout starts fresh for each website. Values under zero are set to zero.

Type:

integer

Default Value:

30000 (30 seconds)

Source:

src/index.js, line 129

(non-null) tries :integer

Getter/setter for the number of tries when a PhantomJS instance crashes on a page or CrawlKit#timeout is hit. When a PhantomJS instance crashes whilst crawling a webpage, this instance is shutdown and replaced by a new one. By default the webpage that failed in such a way will be re-queued. If the finders and runners did not respond within the defined timeout, it will be tried to run them again as well. This member controls how often that re-queueing happens. Values under zero are set to zero.

Type:

integer

Default Value:

3 (read: try two more times after the first failure, three times in total)

Source:

src/index.js, line 210

url :String

Getter/setter for the start URL of the crawler. This is the URL that will be used as an initial endpoint for the crawler. If the protocol is omitted (e.g. URL starts with //), the URL will be rewritten to http://

Type:

String

Source:

src/index.js, line 167

Methods

addRunner(keynon-null, runnernon-null, …runnableParamsopt)

Allows you to add a runner that is executed on each crawled page. The returned value of the runner is added to the overall result. Runners run sequentially on each webpage in the order they were added. If a runner is crashing PhantomJS more than CrawlKit#tries times, subsequent Runners are not executed.

Parameters:

Name	Type	Attributes	Description
`key`	String		The runner identificator. This is also used in the result stream/object.
`runner`	Runner		The runner instance to use for discovery.
`runnableParams`	*	<optional> <repeatable>	These parameters are passed to the function returned by `Runner#getRunnable` at evaluation time.

Source:

src/index.js, line 232

See:

For an example see `examples/simple.js`. For an example using parameters, see `examples/advanced.js`.

crawl(shouldStreamopt) → {Stream|Promise.<Object>}

This method starts the crawling/scraping process.

Parameters:

Name	Type	Attributes	Default	Description
`shouldStream`	boolean	<optional>	false	Whether to stream the results or use a Promise

Source:

src/index.js, line 352

Returns:

By default a Promise object is returned that resolves to the result. If streaming is enabled it returns a JSON stream of the results.

Type: Stream | Promise.<Object>

setFinder(findernon-null, …runnableParamsopt)

With this method a Finder instance can be set for the crawler. A finder is used for link discovery on a website. It is run directly after page load and is optional (e.g. if you want to only work on a single page).

Parameters:

Name	Type	Attributes	Description
`finder`	Finder		The finder instance to use for discovery.
`runnableParams`	*	<optional> <repeatable>	These parameters are passed to the function returned by `Finder#getRunnable` at evaluation time.

Source:

src/index.js, line 186