25. 5. 2018 v IT

Detecting Google bot with Python and Django

This article deals with problem how to detect that request on your website is made by google bot. Based on that you can for example display different content to google crawler than to human. This practice is generally called cloaking and it is usually penalized by search engines. But there are exceptions when cloaking is allowed and convenient. One of the good examples is allowing google bot to index content which is behind paywall

My solution is for django framework but can be transformed to any web application written in python. I use several PIP packages:

You can install these packages using this command:

pip install pyaml ua-parser user-agents django-user_agents django-ipware

Google recommends to verify its crawlers by performing DNS lookups. My solution follows this recommendation. However I don't want to perform DNS lookup for each request coming to my website and that's why I check agent type first. Here comes the solution: 

import socket

from django_user_agents.utils import get_user_agent
from ipware import get_client_ip

def is_google_bot(request):
    user_agent = get_user_agent(request)
    if not user_agent.is_bot:
        return False
    ip, _ = get_client_ip(request)
        host = socket.gethostbyaddr(ip)[0]
    except (socket.herror, socket.error):
        return False
    domain_name = ".".join(host.split('.')[1:])
    if domain_name not in ['googlebot.com', 'google.com']:
        return False
    host_ip = socket.gethostbyname(host)
    return host_ip == ip

You can decide to solve it without the external libraries that I've used but check the code of these libraries first. The problems they deal with are not that much straightforward as you might think in the beginning. 

Be aware that function is_google_bot is going to return False for requests coming from tools like Google testing tool for structured data for two reasons:

  • user-agents library is not detecting them as bots
  • detected IP address belongs to user of such tool, not the tool itself

Therefore testing this function might be tricky. I put code for unit test skeleton that I've used for testing but I don't recommend you to keep such test in codebase because it hardcodes google IP address and performs DNS queries.  

from django.test import RequestFactory, TestCase
from utils import is_google_bot

class IsGoogleBotTestCase(TestCase):

    def setUp(self):
        self.factory = RequestFactory()
        self.ua_google_bot = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'

    def create_request(self, ip=None, ua=None):
        request = self.factory.get('/')
        if ua:
            request.META['HTTP_USER_AGENT'] = ua
        if ip:
            request.META['REMOTE_ADDR'] = ip
        return request

    def test_is_google_bot(self):
        r = self.create_request('', self.ua_google_bot)  # google bot ip address comes from https://support.google.com/webmasters/answer/80553?hl=en

        r = self.create_request('', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36 ')

Čtěte dále