This article deals with problem how to detect that request on your website is made by google bot. Based on that you can for example display different content to google crawler than to human. This practice is generally called cloaking and it is usually penalized by search engines. But there are exceptions when cloaking is allowed and convenient. One of the good examples is allowing google bot to index content which is behind paywall.
My solution is for django framework but can be transformed to any web application written in python. I use several PIP packages:
- user-agents together with django-user_agents to detect type of user agent
- django-ipware for detecting request's IP address
You can install these packages using this command:
pip install pyaml ua-parser user-agents django-user_agents django-ipware
Google recommends to verify its crawlers by performing DNS lookups. My solution follows this recommendation. However I don't want to perform DNS lookup for each request coming to my website and that's why I check agent type first. Here comes the solution:
import socket from django_user_agents.utils import get_user_agent from ipware import get_client_ip def is_google_bot(request): user_agent = get_user_agent(request) if not user_agent.is_bot: return False ip, _ = get_client_ip(request) try: host = socket.gethostbyaddr(ip) except (socket.herror, socket.error): return False domain_name = ".".join(host.split('.')[1:]) if domain_name not in ['googlebot.com', 'google.com']: return False host_ip = socket.gethostbyname(host) return host_ip == ip
You can decide to solve it without the external libraries that I've used but check the code of these libraries first. The problems they deal with are not that much straightforward as you might think in the beginning.
Be aware that function is_google_bot is going to return False for requests coming from tools like Google testing tool for structured data for two reasons:
- user-agents library is not detecting them as bots
- detected IP address belongs to user of such tool, not the tool itself
Therefore testing this function might be tricky. I put code for unit test skeleton that I've used for testing but I don't recommend you to keep such test in codebase because it hardcodes google IP address and performs DNS queries.
from django.test import RequestFactory, TestCase from utils import is_google_bot class IsGoogleBotTestCase(TestCase): def setUp(self): self.factory = RequestFactory() self.ua_google_bot = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' def create_request(self, ip=None, ua=None): request = self.factory.get('/') if ua: request.META['HTTP_USER_AGENT'] = ua if ip: request.META['REMOTE_ADDR'] = ip return request def test_is_google_bot(self): r = self.create_request('18.104.22.168', self.ua_google_bot) # google bot ip address comes from https://support.google.com/webmasters/answer/80553?hl=en self.assertTrue(is_google_bot(r)) r = self.create_request('127.0.0.1', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36 ') self.assertFalse(is_google_bot(r))