Constructor
new CrawlKit(urlopt, nameopt)
Create a CrawlKit instance
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
url |
String |
<optional> |
The start URL. Sets the CrawlKit#url . |
name |
String |
<optional> |
The instance name of the crawler. Used for logging purposes. |
- Source:
Members
(non-null) browserCookies :Array.<Object>
Getter/setter for the cookies to set within PhantomJS.
Each entry is supposed to be an object following the PhantomJS spec.
Type:
- Array.<Object>
- Source:
(non-null) concurrency :integer
Getter/setter for the concurrency of the crawler.
This controls the amount of PhantomJS instances that will be spawned
and used to work on found websites. Adapt this to the power of your machine.
Values under one are set to one.
Type:
- integer
- Default Value:
- 1 (No concurrency)
- Source:
(non-null) followRedirects :boolean
Getter/setter for whether to follow redirects or not.
When following redirects, the original page is not processed.
Type:
- boolean
- Default Value:
- false
- Source:
(non-null) phantomPageSettings :Object.<String, *>
Getter/setter for the map of settings to pass to an opened page.
You can use this for example for Basic Authentication.
For a list of options, please refer to the PhantomJS documentation.
Nested settings can just be provided in dot notation as the key, e.g. 'settings.userAgent'.
Type:
- Object.<String, *>
- Source:
(non-null) phantomParameters :Object.<String, String>
Getter/setter for the map of parameters to pass to PhantomJS.
You can use this for example to ignore SSL errors.
For a list of parameters, please refer to the PhantomJS documentation.
Type:
- Object.<String, String>
- Source:
redirectFilter :function
Getter/setter for the filter that is applied to redirected URLs.
With this filter you can prevent the redirect or rewrite it.
The filter callback gets two arguments. The first one is the target URL
the scond one the source URL.
Return false for preventing the redirect. Return a String (URL) to follow the redirect.
Type:
- function
- Source:
(non-null) timeout :integer
Getter/setter for overall timeout for one website processing (opening page, evaluating runners and finder functions).
The timeout starts fresh for each website.
Values under zero are set to zero.
Type:
- integer
- Default Value:
- 30000 (30 seconds)
- Source:
(non-null) tries :integer
Getter/setter for the number of tries when a PhantomJS instance crashes on a page
or
CrawlKit#timeout
is hit.
When a PhantomJS instance crashes whilst crawling a webpage, this instance is shutdown
and replaced by a new one. By default the webpage that failed in such a way will be
re-queued.
If the finders and runners did not respond within the defined timeout,
it will be tried to run them again as well.
This member controls how often that re-queueing happens.
Values under zero are set to zero.
Type:
- integer
- Default Value:
- 3 (read: try two more times after the first failure, three times in total)
- Source:
url :String
Getter/setter for the start URL of the crawler.
This is the URL that will be used as an initial endpoint for the crawler.
If the protocol is omitted (e.g. URL starts with //), the URL will be rewritten to http://
Type:
- String
- Source:
Methods
addRunner(keynon-null, runnernon-null, …runnableParamsopt)
Allows you to add a runner that is executed on each crawled page.
The returned value of the runner is added to the overall result.
Runners run sequentially on each webpage in the order they were added.
If a runner is crashing PhantomJS more than
CrawlKit#tries
times, subsequent Runner
s are not executed.
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
key |
String | The runner identificator. This is also used in the result stream/object. | |
runner |
Runner | The runner instance to use for discovery. | |
runnableParams |
* |
<optional> <repeatable> |
These parameters are passed to the function returned by Runner#getRunnable at evaluation time. |
- Source:
- See:
-
- For an example see `examples/simple.js`. For an example using parameters, see `examples/advanced.js`.
crawl(shouldStreamopt) → {Stream|Promise.<Object>}
This method starts the crawling/scraping process.
Parameters:
Name | Type | Attributes | Default | Description |
---|---|---|---|---|
shouldStream |
boolean |
<optional> |
false | Whether to stream the results or use a Promise |
- Source:
Returns:
By default a Promise object is returned that resolves to the result. If streaming is enabled it returns a JSON stream of the results.
- Type
- Stream | Promise.<Object>
setFinder(findernon-null, …runnableParamsopt)
With this method a
Finder
instance can be set for the crawler.
A finder is used for link discovery on a website. It is run directly after page load
and is optional (e.g. if you want to only work on a single page).
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
finder |
Finder | The finder instance to use for discovery. | |
runnableParams |
* |
<optional> <repeatable> |
These parameters are passed to the function returned by Finder#getRunnable at evaluation time. |
- Source: