PDFs and media files can slip into search results even when we never meant them to. That usually happens because we apply HTML rules to files that are not HTML, and that only gets us so far.
The fix is straightforward once we know where to look. The X-Robots-Tag header gives us direct control over PDFs, images, videos, and other non-HTML files, so we can block indexing, allow indexing, or tighten how search engines handle each asset.
When we set it up well, we clean up search visibility without guessing. That matters whether we are keeping internal documents out of Google or making sure the right file is the one that ranks. First, let’s look at what the header actually does.
What the X-Robots-Tag Header Does
The X-Robots-Tag is an HTTP response header. That means it travels with the file when the server sends it to a crawler. We use it when the asset itself needs instructions, not the HTML page around it.
That matters because PDFs, images, and videos do not give us an HTML <meta> robots tag. The header fills that gap. Google documents this behavior in its robots meta tag specifications, and its page-level granularity update explains why the header exists in the first place.
HTTP/1.1 200 OK
Content-Type: application/pdf
X-Robots-Tag: noindex, nofollow
Cache-Control: public, max-age=3600
That kind of response tells a crawler what to do with the file before anything is rendered. MDN’s X-Robots-Tag reference is also useful when we want a plain-language recap of the header and its common directives.

The main idea is simple. If the crawler can fetch the file, it can read the header. If it cannot fetch the file, it cannot read the instruction.
How We Implement X-Robots-Tag for PDFs
For PDFs, we usually set the header at the server, CDN, or application layer. The PDF file does not need HTML. It only needs the right response headers when the request is made.
That is why PDF handling feels different from page SEO. If we are used to HTML pages, it helps to compare this with our noindex tag implementation guide, because the goal is similar even though the delivery method is different. On a page, we place a meta tag in the head. On a PDF, we send a header with the file.
The most common setup is simple:
X-Robots-Tag: noindex
If we want to stop the file from appearing in search results, that is usually the cleanest approach. If we also want to reduce link following inside the file, we can add nofollow, although support can vary by crawler and document type. We should test it, not assume it behaves exactly the same everywhere.

Here is a quick look at the directives we reach for most often.
| Directive | Best use | What it changes |
|---|---|---|
noindex | PDFs we do not want in search results | Keeps the file out of the index after crawl |
nofollow | Files with links we do not want crawled through | Tells supported crawlers not to follow links in the file |
nosnippet | Assets where we want to limit preview text | Reduces or removes snippets in results |
indexifembedded | PDFs that are meant to be embedded on a page | Lets the file be indexed when it appears in an approved embed |
The big takeaway is this: the directive needs to match the job. If we want the file removed from search results, noindex is the starting point. If we want the PDF to support a page, not compete with it, we need to be more careful with how the file is exposed.
X-Robots-Tag for Images, Videos, and Other Media
The same header works for other non-HTML assets too. That includes images, videos, and some document formats. This is where the header becomes especially useful, because media files rarely have their own HTML wrapper.
If we run a gallery, media library, or video archive, we often have two separate goals. One is to keep the media file under control. The other is to let the supporting page rank. Those are not the same thing.
For example, an image file may need noindex, but the HTML product page that uses that image may still need to rank. In that case, we control the file, not the page. That is a good fit for modern Google guidance on non-HTML content, and it is one reason the X-Robots-Tag header keeps showing up in technical audits.

This is also where rendering gets tricky. Search engines do not render a JPG, MP4, or PDF the same way they render a page. They fetch the asset, read the response, and decide what to do next. So if the media file is blocked by auth, hidden behind the wrong rule, or stripped by the CDN, the crawler may never see the header at all.
That is why we treat the file, the page, and the delivery layer as a set. If one part is out of sync, the whole setup gets messy.
How It Fits with Robots.txt, Canonicals, and Crawl Budget
It helps to separate the tools. They solve related problems, but they do not do the same job.
| Tool | Best use | What it does not do |
|---|---|---|
robots.txt | Stop crawling of private or low-value paths | It does not remove indexed URLs by itself |
X-Robots-Tag | Control indexing for PDFs, images, videos, and other non-HTML files | It does not block crawling if the file is accessible |
| Canonical tags | Consolidate duplicate versions of a page or file | They do not block indexing on their own |
If we are still shaping crawl access, our robots.txt SEO best practices guide is the right companion piece. robots.txt can keep crawlers out, but it cannot tell them what to do with a file they already found.
Canonicalization is the same kind of separate step. If the same PDF exists at multiple URLs, we need to decide which version is preferred. Our canonical SEO for indexing guide covers the page side of that problem, and the same thinking applies to file libraries. A canonical helps consolidate signals. It does not replace noindex.
This is where crawl budget enters the picture too. A large media library can eat crawl time fast, especially when duplicates or dated files pile up. Our crawl budget optimization strategies guide pairs well with this topic because the more noise we remove, the more likely search engines are to spend time on the assets that matter.
Troubleshooting When Files Still Show Up in Search
If a PDF or media file still appears in search results after we set the header, we usually have a delivery problem, not a search problem.
The first thing we check is the final response. The header has to be on the response that returns the file, not only on a redirect or on the page that links to the file. If a CDN, storage bucket, or application layer strips the header, the crawler never gets the message.
Next, we check access. If robots.txt blocks the crawler before it can fetch the file, it may never read the header at all. That is why blocking and deindexing are different steps. If we want Google to see the instruction, we usually need to allow the crawl first.
Then we look for duplicates. A file can live in more than one place, and one copy may still be indexable. If that happens, we need to clean up the extra URLs or point them to the preferred version.
Finally, we give search engines time. Even when the header is correct, cached results can stick around until the next crawl. That is normal. The important part is making sure the live response is right.
A Quick Checklist for PDFs and Media Files
Before we ship a file setup, we usually run through a short list.
- Use
noindexon PDFs, images, or videos we do not want in search results. - Keep the file crawlable if we want search engines to read the header.
- Put the header on the final response, especially after redirects.
- Keep canonical signals aligned when the same file exists at multiple URLs.
- Check the CDN, object storage, and server config after each deployment.
- Review media libraries when crawl activity looks wasteful or uneven.
That checklist keeps us from mixing up crawling, indexing, and duplication. It also makes troubleshooting much easier later, because we know which layer is responsible for which decision.
Conclusion
The X-Robots-Tag header gives us control over files that HTML tags can’t handle well. That makes it one of the cleanest ways to manage PDFs, images, videos, and other non-HTML assets.
If we remember one thing, it’s this, the file has to be crawlable before the crawler can read the instruction. Once we get that part right, we can keep the right assets visible and keep the wrong ones out of search. That is a simple fix with a big payoff.




