Represents a static list of URLs to crawl.The URLs can be provided either in code or parsed from a text file hosted on the web.RequestList
is used by BasicCrawler, CheerioCrawler, PuppeteerCrawlerand PlaywrightCrawler as a source of URLs to crawl.
Each URL is represented using an instance of the Request class.The list can only contain unique URLs. More precisely, it can only contain Request
instanceswith distinct uniqueKey
properties. By default, uniqueKey
is generated from the URL, but it can also be overridden.To add a single URL to the list multiple times, corresponding Request objects will need to have differentuniqueKey
properties. You can use the keepDuplicateUrls
option to do this for you when initializing theRequestList
from sources.
RequestList
doesn't have a public constructor, you need to create it with the asynchronous RequestList.open function. Afterthe request list is created, no more URLs can be added to it.Unlike RequestQueue, RequestList
is static but it can contain even millions of URLs.
Note that
RequestList
can be used together withRequestQueue
by the same crawler.In such cases, each request fromRequestList
is enqueued intoRequestQueue
first and then consumed from the latter.This is necessary to avoid the same URL being processed more than once (from the list first and then possibly from the queue).In practical terms, such a combination can be useful when there is a large number of initial URLs,but more URLs would be added dynamically by the crawler.
RequestList
has an internal state where it stores information about which requests were already handled,which are in progress and which were reclaimed. The state may be automatically persisted to the defaultKeyValueStore by setting the persistStateKey
option so that if the Node.js process is restarted,the crawling can continue where it left off. The automated persisting is launched upon receiving the persistState
event that is periodically emitted by EventManager.
The internal state is closely tied to the provided sources (URLs). If the sources change on crawler restart, the state will become corrupted andRequestList
will raise an exception. This typically happens when the sources is a list of URLs downloaded from the web.In such case, use the persistRequestsKey
option in conjunction with persistStateKey
,to make the RequestList
store the initial sources to the default key-value store and load them after restart,which will prevent any issues that a live list of URLs might cause.
Basic usage:
const requestList = await RequestList.open('my-request-list', [
'http://www.example.com/page-1',
{ url: 'http://www.example.com/page-2', method: 'POST', userData: { foo: 'bar' }},
{ requestsFromUrl: 'http://www.example.com/my-url-list.txt', userData: { isFromUrl: true } },
]);
Advanced usage:
const requestList = await RequestList.open(null, [
// Separate requests
{ url: 'http://www.example.com/page-1', method: 'GET', headers: { ... } },
{ url: 'http://www.example.com/page-2', userData: { foo: 'bar' }},
// Bulk load of URLs from file `http://www.example.com/my-url-list.txt`
// Note that all URLs must start with http:// or https://
{ requestsFromUrl: 'http://www.example.com/my-url-list.txt', userData: { isFromUrl: true } },
], {
// Persist the state to avoid re-crawling which can lead to data duplications.
// Keep in mind that the sources have to be immutable or this will throw an error.
persistStateKey: 'my-state',
});
Index
Methods
- fetchNextRequest
- getState
- handledCount
- isEmpty
- isFinished
- length
- markRequestHandled
- persistState
- reclaimRequest
- open
Methods
fetchNextRequest
- fetchNextRequest(): Promise<null | Request<Dictionary>>
Gets the next Request to process. First, the function gets a request previously reclaimedusing the RequestList.reclaimRequest function, if there is any.Otherwise it gets the next request from sources.
See AlsoHow to do web crawling in PythonWhat is a Crawler? Explaining the Mechanism and How to Achieve High SEO Rankings|Entering Japan in SEO by TOKYO SEO MAKER-Your Native Japanese SEO CompanyUnleashing the Power of Web Crawler 2024: Discovering Hidden Online GemsHow to Specify a Canonical with rel="canonical" and Other Methods | Google Search Central | Documentation | Google for DevelopersThe function's
Promise
resolves tonull
if there are no morerequests to process.Returns Promise<null | Request<Dictionary>>
getState
- getState(): RequestListState
Returns an object representing the internal state of the
RequestList
instance.Note that the object's fields can change in future releases.Returns RequestListState
handledCount
- handledCount(): number
Returns number of handled requests.
Returns number
isEmpty
- isEmpty(): Promise<boolean>
Resolves to
true
if the next call to RequestList.fetchNextRequest functionwould returnnull
, otherwise it resolves tofalse
.Note that even if the list is empty, there might be some pending requests currently being processed.Returns Promise<boolean>
isFinished
- isFinished(): Promise<boolean>
Returns
true
if all requests were already handled and there are no more left.Returns Promise<boolean>
length
- length(): number
Returns the total number of unique requests present in the
RequestList
.Returns number
markRequestHandled
- markRequestHandled(request: Request<Dictionary>): Promise<void>
Marks request as handled after successful processing.
Parameters
request: Request<Dictionary>
Returns Promise<void>
persistState
- persistState(): Promise<void>
Persists the current state of the
RequestList
into the default KeyValueStore.The state is persisted automatically in regular intervals, but calling this method manuallyis useful in cases where you want to have the most current state available after you pauseor stop fetching its requests. For example after you pause or abort a crawl. Or just beforea server migration.Returns Promise<void>
reclaimRequest
- reclaimRequest(request: Request<Dictionary>): Promise<void>
Reclaims request to the list if its processing failed.The request will become available in the next
this.fetchNextRequest()
.Parameters
request: Request<Dictionary>
Returns Promise<void>
staticopen
- open(listNameOrOptions: null | string | RequestListOptions, sources?: RequestListSource[], options?: RequestListOptions): Promise<RequestList>
Opens a request list and returns a promise resolving to an instanceof the RequestList class that is already initialized.
RequestList represents a list of URLs to crawl, which is always stored in memory.To enable picking up where left off after a process restart, the request list sourcesare persisted to the key-value store at initialization of the list. Then, while crawling,a small state object is regularly persisted to keep track of the crawling status.
For more details and code examples, see the RequestList class.
Example usage:
const sources = [
'https://www.example.com',
'https://www.google.com',
'https://www.bing.com'
];
const requestList = await RequestList.open('my-name', sources);Parameters
listNameOrOptions: null | string | RequestListOptions
Name of the request list to be opened, or the options object. Setting a name enables the
RequestList
's state to be persisted in the key-value store. This is useful in case of a restart or migration. SinceRequestList
is only stored in memory, a restart or migration wipes it clean. Setting a name will enable theRequestList
's state to survive those situations and continue where it left off.The name will be used as a prefix in key-value store, producing keys such as
NAME-REQUEST_LIST_STATE
andNAME-REQUEST_LIST_SOURCES
.If
null
, the list will not be persisted and will only be stored in memory. Process restart will then cause the list to be crawled again from the beginning. We suggest always using a name.optionalsources: RequestListSource[]
An array of sources of URLs for the RequestList. It can be either an array of strings, plain objects that define at least the
url
property, or an array of Request instances.IMPORTANT: The
sources
array will be consumed (left empty) after RequestList initializes. This is a measure to prevent memory leaks in situations when millions of sources are added.Additionally, the
requestsFromUrl
property may be used instead ofurl
, which will instruct RequestList to download the source URLs from a given remote location. The URLs will be parsed from the received response. In this case you can limit the URLs usingregex
parameter containing regular expression pattern for URLs to be included.For details, see the RequestListOptions.sources
optionaloptions: RequestListOptions = {}
The RequestList options. Note that the
listName
parameter supersedes the RequestListOptions.persistStateKey and RequestListOptions.persistRequestsKey options and thesources
parameter supersedes the RequestListOptions.sources option.
Returns Promise<RequestList>