View Source GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyData (google_api_content_warehouse v0.4.0)
Fetcher -> FetchClient FetchReplyData is the metadata for a reply from a FetchRequest. For metadata + document body, FetchReply is further below. NOTE: FetchReplyData (and FetchReply) is the output interface from Multiverse. Teams outside Multiverse/Trawler should not create fake FetchReplies. Trawler: When adding new fields here, it is recommended that at least the following be rebuilt and pushed: - cron_fetcher_index mapreduces: so that UrlReplyIndex, etc. retain the new fields - tlookup, tlookup_server: want to be able to return the new fields - logviewer, fetchutil: annoying to get back 'tag88:' in results -------------------------- Next Tag: 125 -----------------------
Attributes
-
CrawlTimes
(type:GoogleApi.ContentWarehouse.V1.Model.TrawlerCrawlTimes.t
, default:nil
) - -
ThrottleClient
(type:GoogleApi.ContentWarehouse.V1.Model.TrawlerThrottleClientData.t
, default:nil
) - If present, Client API will enforce the contained constraints -
DNSHost
(type:String.t
, default:nil
) - Sometimes the hostid and destination IP in the FetchReplyData are not for the hostname in the url. If that's the case DNSHost will be the host that we have used when resolving hostid and DNS. Right now there are two cases: (1) malware team provides a proxy IP:Port to us, so DNSHost will be the proxy IP; and (2) PSS team provides a reference DNS host; so DNSHost will be the reference DNS host. -
HostBucketData
(type:GoogleApi.ContentWarehouse.V1.Model.TrawlerHostBucketData.t
, default:nil
) - Data about the host bucket this request is in (if desired) Please talk with Trawler team before considering using this, since what we fill in here is subject to change. -
DownloadTime
(type:integer()
, default:nil
) - The download time for this fetch (ms). This is the RTT time between fetcher and HOPE, note it does not include time from redirects, just initial hop. If you want the sum of the DownloadTime values for all fetches in the redirect chain, then use the DownLoadTime value in the FetchStats. -
partialresponse
(type:GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyDataPartialResponse.t
, default:nil
) - -
ID
(type:String.t
, default:nil
) - Same as the ID of the matching request (used for matching internal fetchclient data in request/reply) -
originalProtocolUrl
(type:String.t
, default:nil
) - If the input url in FetchRequest is Amazon S3 protocol or Apple Itunes protocol, we will translate it into https url and log it as https url. In the meantime we will store the original s3/itunes url in this field. Before sending back to client, the Url will be translated back to s3 and this field will be cleard. -
FlooEgressRegion
(type:String.t
, default:nil
) - If present, fetch was conducted using floonet and this is the location of floonet egress point we used. -
UseHtmlCompressDictionary
(type:boolean()
, default:nil
) - Use the special compression dictionary for uncompressing this. (trawler::kHtmlCompressionDict. Use trawler::FetchReplyUncompressor to uncompress; crawler/trawler/public/fetchreply-util.h) -
Events
(type:list(GoogleApi.ContentWarehouse.V1.Model.TrawlerEvent.t)
, default:nil
) - -
HTTPTrailers
(type:list(GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyDataHTTPHeader.t)
, default:nil
) - The received HTTP trailers if available. -
webioInfo
(type:GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyDataWebIOInfo.t
, default:nil
) - -
Endpoints
(type:GoogleApi.ContentWarehouse.V1.Model.TrawlerTCPIPInfo.t
, default:nil
) - ------- If fetched, the IP from which we fetched, as well as source IP and ports. It is recommended to use trawler::DestinationIP()/HasDestinationIP() accessors, which return a proper IPAddress. -
GeoCrawlEgressRegion
(type:String.t
, default:nil
) - If present, the last hop of the fetch was conducted using floonet and this is the location of floonet egress point. It is different from EgressRegion and FlooEgressREgion because it is a Trawler transparent routing configured in the geo crawl rules(go/da-geo-crawl). -
HSTSInfo
(type:String.t
, default:nil
) - Set to: o HSTS_STATUS_NONE if there was no HSTS policy match for the URL's host. o HSTS_STATUS_AVAILABLE if there was an HSTS policy, but the URL was not rewritten from HTTP to HTTPS because enable_hsts was not set in client capability config. o HSTS_STATUS_REWRITTEN if the HSTS policy was followed and url was rewritten from HTTP to HTTPS. This field only pertains to the current URL fetch and does not explain a redirect's HSTS status. However, FetchReplyData.Redirects have their own HSTSInfo. -
FetchPatternFp
(type:String.t
, default:nil
) - With the introduction of fetch pattern based hostload exceptions, one hostid may have multiple hostload buckets, each with its own hostload. In this case, FetchPatternFp will be set to identify the hostload bucket within the hostid. Note this field is only meaningful for the HostBucketData which is recorded only when the client requests to have as part of reply. However, this field is useful for certain stats gathering, so we choose to always record it if its value is available during the fetch. -
UrlEncoding
(type:integer()
, default:nil
) - Encoding info for the original url itself. Bitfield encoding; see UrlEncoding::{Set,Get}Value in webutil/urlencoding. -
PostData
(type:String.t
, default:nil
) - If the fetch uses HTTP POST, PUT, or PATCH protocol, and WantPostData is true, the POST data will be copied here. This is only for initial hop. If there are redirects, HTTP POST will be changed to GET on subsequent hops, and the PostData will be cleared. There is only one exception, if the HTTP response code to the POST request is 307 (a new code introduced in RFC7321, sec. 6.4.7), we will preserve the request method and the PostData for the next hop. -
TransparentRewrites
(type:list(String.t)
, default:nil
) - If the url got rewriten by transparent rewrites, here it is the series of rewrites it got through. The fetched one is the last -
HopCacheKeyForUpdate
(type:String.t
, default:nil
) - -
GeoCrawlFallback
(type:boolean()
, default:nil
) - Whether we fallback from geo crawl to local crawl during fetch. The fallback could happen in any hops and there can be at most one fallback because once fallback happens, we will not try geo-crawl anymore. -
ReuseInfo
(type:String.t
, default:nil
) - -------- Returns trawler::ReuseInfo with status of IMS/IMF/cache query. Consider using HopReuseInfo instead, which has per-redirect hop detail. If there's URL redirection, this field stores the reuse info of the last hop. For example, if the and URL redirect chain is [URL A] --> [URL B] --> [URL C], this field stores the reuse info of [URL C]. -
TimestampInMS
(type:String.t
, default:nil
) - When this reply came back from fetcher NOTE: TimestampInMS is used for internal debugging. To see when a document was crawled, check CrawlDates. -
CompressedBody
(type:boolean()
, default:nil
) - Is the associated body compressed ? -
HttpResponseHeaders
(type:String.t
, default:nil
) - HTTP headers from the response (initial hop). Trawler does not fill this in; this is intended as a placeholder for crawls like webmirror that fill in and want to track this across redirect hops. -
RedirectSourceFetchId
(type:String.t
, default:nil
) - If this fetch was a result of a redirect, we populate the parent ID here. -
RobotsStatus
(type:GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchStatus.t
, default:nil
) - Status of the robots.txt fetch. Currently, this is present if: - Certain robots error cases, such as URL_TIMEOUT-TIMEOUT_ROBOTS or URL_UNREACHABLE-UNREACHABLE_ROBOTS_ERROR. - If WantRobotsBody is set in the FetchParams. -
trafficType
(type:String.t
, default:nil
) - Traffic type of this fetch. -
HostId
(type:String.t
, default:nil
) - If known, the trawler::HostId that identifies the host (initial hop). -
HopRobotsInfo
(type:integer()
, default:nil
) - Extra information in robots.txt for this page (integer: or'ed together of type trawler::RobotsInfo) on a per-hop basis (initial hop) -
TotalFetchedSize
(type:String.t
, default:nil
) - How many raw bytes we read from the connection to the server when we fetched the page. Includes everything: HTTP headers, overhead for HTTP chunked encoding, whatever compressed/uncompressed form (i.e. gzip/deflate accept-encoding) the content was sent in, etc. This is NOT the same as the size of the uncompressed FetchReply::Body - if the webserver used gzip encoding, this value might be much smaller, since it only counts the compressed wire size. To illustrate, think of 3 sizes: 1) TotalFetchedSize - amount Trawler read over the wire from the server. If they used gzip/deflate, this might be 4-5x smaller than the body. 2) UnTruncatedSize/CutoffSize - how big is the full document, after uncompressing any gzip/deflate encoding? If truncated, this is reflected in CutoffSize. 3) FetchReply::Body size - most crawls enable Trawler compression to save storage space (gzip + a google html dictionary). The body size that the end Trawler client sees is post-compression. -
GeoCrawlLocationAttempted
(type:String.t
, default:nil
) - Set only when GeoCrawlFallback is true. Logs the geo crawl location we attempted but failed for this request. -
HttpVersion
(type:String.t
, default:nil
) - Stores the HTTP version we used in the final hop. -
HopCacheKeyForLookup
(type:String.t
, default:nil
) - Returns the cache key used when doing cache lookup/update, on a per-hop basis (initial hop) Note this field will not be set if cache lookup/update is disabled/skipped. -
EgressRegion
(type:String.t
, default:nil
) - If present, the edge region that we have used. -
RobotsInfo
(type:integer()
, default:nil
) - Extra information in robots.txt for this page (ORed together bits from trawler::RobotsInfo). e.g. nosnippet vs. noarchive vs nofollow vs noindex vs disallow Consider using HopRobotsInfo instead, which has per-redirect hop detail. -
HopReuseInfo
(type:String.t
, default:nil
) - Returns trawler::ReuseInfo with status of IMS/IMF/cache query, on a per-hop basis (initial hop) For example, if the URL redirect chain is [URL A] --> [URL B] --> [URL C], this field stores the reuse info of [URL A]. -
RequestorIPAddressPacked
(type:String.t
, default:nil
) - Machine that sent Trawler this request, for logging. An IPAddress object, packed as a string. -
redirects
(type:list(GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyDataRedirects.t)
, default:nil
) - -
ProtocolVersionFallback
(type:boolean()
, default:nil
) - Whether we fallback from HTTP/2 to HTTP/1.1 during fetch. The fallback could happen in any hops and there can be at most one fallback because once fallback happens, we will not try HTTP/2 anymore. -
BadSSLCertificate
(type:String.t
, default:nil
) - This field, if non-empty, contains the SSL certificate chain from the server. The filed should be serialized SSLCertificateInfo protobuf, although it used to be text format. Hence, one should ideally use trawler::CertificateUtil to check this field and understand in more detail. This field is populated in two cases: (1) something is wrong with the server certificate and we cannot verify the server's identity. In this case the URL most likely won't display in a browser; (2) if you turned on WantSSLCertificateChain in the FetchRequest. In this case the server certificate may be perfectly fine (despite the field name). This is for the initial hop; additional hops are in Redirects group. -
ClientServiceInfo
(type:GoogleApi.ContentWarehouse.V1.Model.TrawlerClientServiceInfo.t
, default:nil
) - Some client specific data that's set by client in the FetchRequest, and we just copy it here verbatim. This is similar to ClientInfo that we copy from FetchRequest to FetchReply, but this is copied to FetchReplyData, thus stored in trawler logs so can be useful for debugging cases. -
deliveryReport
(type:GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyDataDeliveryReport.t
, default:nil
) - Transfer operation detailed report. -
Url
(type:String.t
, default:nil
) - The original url in the request we are answering. Even though "optional," url must be filled in on all well-formed replies. Trawler guarantees that it is filled in, and basically every client expects it (CHECKs in some cases). -> Not filling this field in is a bug, if you share this data with other crawls/pipelines. You should expect everybody else to require a url. -
ThrownAwayBytes
(type:String.t
, default:nil
) - Sometimes we throw away content because we cannot store it in the internal buffers. These is how many bytes we have thrown away for this factor. -
HttpRequestHeaders
(type:String.t
, default:nil
) - The HTTP headers we sent to fetch this URL (initial hop). Not normally filled in, unless FetchParams.WantSentHeaders is set. -
LastUrlStatus
(type:GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchStatus.t
, default:nil
) - Crawl status of the last url on chain -
protocolresponse
(type:GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyDataProtocolResponse.t
, default:nil
) - -
EligibleGeoCrawlEgressRegion
(type:String.t
, default:nil
) - If present, it means this host might be eligible for geo crawl. However, this does not mean we enable geo-crawl for this request. Check "GeoCrawlEgressRegion" instead to see if this fetch is conducted via geo crawl. -
Status
(type:GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchStatus.t
, default:nil
) - Status of the fetch - refers to the final status at the end of the redirect chain. -
RequestorID
(type:String.t
, default:nil
) - RequestorId is the same on as in the request that triggers this reply -- mainly for diagnostics purpose -
TrawlerPrivate
(type:GoogleApi.ContentWarehouse.V1.Model.TrawlerTrawlerPrivateFetchReplyData.t
, default:nil
) - For logging only; not present in the actual fetcher response -
crawldates
(type:GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyDataCrawlDates.t
, default:nil
) - -
PolicyData
(type:list(GoogleApi.ContentWarehouse.V1.Model.TrawlerPolicyData.t)
, default:nil
) - Trawler can optionally add a policy label to a FetchReply. Some uses: - "spam" label via trawler_site_info - "roboted:googlebot" label as a signal to crawls supporting multiple useragents that it's not safe to share the fetch replies with googlebot crawls. -
HttpProtocol
(type:String.t
, default:nil
) - The http protocol we send to fetch this URL. This will only be set if the request is using http -
PredictedDownloadTimeMs
(type:integer()
, default:nil
) - This is available only if a fetch results in TIMEOUT_WEB, and we were able to predict, based on content length and bandwidth we were using, how much time (in ms) would be needed to download the entire content. -
fetchstats
(type:GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyDataFetchStats.t
, default:nil
) - -
RobotsTxt
(type:String.t
, default:nil
) - The robots.txt we used for this URL (initial hop). Not normally filled in unless WantRobotsBody is set. This is mostly for debugging purposes and should not be used for large volumes of traffic.
Summary
Functions
Unwrap a decoded JSON object into its complex fields.
Types
@type t() :: %GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyData{ BadSSLCertificate: String.t() | nil, ClientServiceInfo: GoogleApi.ContentWarehouse.V1.Model.TrawlerClientServiceInfo.t() | nil, CompressedBody: boolean() | nil, CrawlTimes: GoogleApi.ContentWarehouse.V1.Model.TrawlerCrawlTimes.t() | nil, DNSHost: String.t() | nil, DownloadTime: integer() | nil, EgressRegion: String.t() | nil, EligibleGeoCrawlEgressRegion: String.t() | nil, Endpoints: GoogleApi.ContentWarehouse.V1.Model.TrawlerTCPIPInfo.t() | nil, Events: [GoogleApi.ContentWarehouse.V1.Model.TrawlerEvent.t()] | nil, FetchPatternFp: String.t() | nil, FlooEgressRegion: String.t() | nil, GeoCrawlEgressRegion: String.t() | nil, GeoCrawlFallback: boolean() | nil, GeoCrawlLocationAttempted: String.t() | nil, HSTSInfo: String.t() | nil, HTTPTrailers: [GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyDataHTTPHeader.t()] | nil, HopCacheKeyForLookup: String.t() | nil, HopCacheKeyForUpdate: String.t() | nil, HopReuseInfo: String.t() | nil, HopRobotsInfo: integer() | nil, HostBucketData: GoogleApi.ContentWarehouse.V1.Model.TrawlerHostBucketData.t() | nil, HostId: String.t() | nil, HttpProtocol: String.t() | nil, HttpRequestHeaders: String.t() | nil, HttpResponseHeaders: String.t() | nil, HttpVersion: String.t() | nil, ID: String.t() | nil, LastUrlStatus: GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchStatus.t() | nil, PolicyData: [GoogleApi.ContentWarehouse.V1.Model.TrawlerPolicyData.t()] | nil, PostData: String.t() | nil, PredictedDownloadTimeMs: integer() | nil, ProtocolVersionFallback: boolean() | nil, RedirectSourceFetchId: String.t() | nil, RequestorID: String.t() | nil, RequestorIPAddressPacked: String.t() | nil, ReuseInfo: String.t() | nil, RobotsInfo: integer() | nil, RobotsStatus: GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchStatus.t() | nil, RobotsTxt: String.t() | nil, Status: GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchStatus.t() | nil, ThrottleClient: GoogleApi.ContentWarehouse.V1.Model.TrawlerThrottleClientData.t() | nil, ThrownAwayBytes: String.t() | nil, TimestampInMS: String.t() | nil, TotalFetchedSize: String.t() | nil, TransparentRewrites: [String.t()] | nil, TrawlerPrivate: GoogleApi.ContentWarehouse.V1.Model.TrawlerTrawlerPrivateFetchReplyData.t() | nil, Url: String.t() | nil, UrlEncoding: integer() | nil, UseHtmlCompressDictionary: boolean() | nil, crawldates: GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyDataCrawlDates.t() | nil, deliveryReport: GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyDataDeliveryReport.t() | nil, fetchstats: GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyDataFetchStats.t() | nil, originalProtocolUrl: String.t() | nil, partialresponse: GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyDataPartialResponse.t() | nil, protocolresponse: GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyDataProtocolResponse.t() | nil, redirects: [GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyDataRedirects.t()] | nil, trafficType: String.t() | nil, webioInfo: GoogleApi.ContentWarehouse.V1.Model.TrawlerFetchReplyDataWebIOInfo.t() | nil }
Functions
Unwrap a decoded JSON object into its complex fields.