Retry strategy for partners running their own caching tier or using Hosted API
The most common (and ideal) scenario of an API request with a "good" (processable) URL generating a cache miss goes like this:
Request 1) The API returns a transient error (code "999998")
Request 2) The API returns an environment-level response (fetched from subsequent caching tier) with a short TTL of 30 seconds (indicating that the underlying error is transient
Request 3) The API returns a page-level response
A cache miss is immediately forwarded to the back-end for crawling and processing. The overall processing time mostly depends on the availability of the publishing host as well as the current request load for this particular domain. In general a URL's content is downloaded in 1 to 3 seconds. The processing itself takes just milliseconds.
If the URL ultimately generates a permanent error (timeout, not found, not enough content for analysis, unsupported language, etc.) the returned TTL is 3 days. In this case the API either returns an environment-level response (if present) or an error code. We see a large portion of URLs throwing permanent errors becoming available again. The rationale is that if an ad generated an impression with a referring URL, the corresponding page should be accessible.
As long as the API continues to respond with a short TTL, you should continue to retry up to 20 times. We recommend to honor the TTL or - in case you need to move faster - wait for at least 5 seconds.
A potential crawl frequency restriction for a specific domain might delay the result delivery significantly. In this case the crawl request gets queued and it takes much longer (minutes to even hours) until the queue has freed up. Unfortunately it is not yet possible to return corresponding TTLs depending on the queue size.
Long-tail Filtering
To exclude the long-tail inventory (URLs with very few impressions) from processing only send API requests of URLs which at least generated 10 impressions. This has several benefits such as prioritizing high-volume URLs, reducing API request volumes, using the internal caching tier more efficiently, and reducing crawl frequency restrictions.
We highly recommend not to apply any normalization, truncation, parameter stripping, desessionizing, etc. to a URL, except URL-encoding.
0 Comments