Web scraping
Even though some sites offer APIs, most websites are designed mainly for human eyes and only provide HTML pages formatted for humans. If we want a program to fetch some data from such a website, we have to parse the markup to get the information we need. Web scraping is the method of using a computer program to analyze a web page and get the data needed.
There are many methods to fetch the content from the site with Python modules:
- Use
urllib
/urllib2
to create an HTTP request that will fetch the webpage, and usingBeautifulSoup
to parse the HTML - To parse an entire website we can use Scrapy (http://scrapy.org), which helps to create web spiders
- Use requests module to fetch and lxml to parse
urllib / urllib2 module
Urllib is a high-level module that allows us to script different services such as HTTP, HTTPS, and FTP.
Useful methods of urllib/urllib2
Urllib/urllib2 provide methods that can be used for getting resources from URLs, which includes opening web pages, encoding arguments, manipulating and creating headers, and many more. We can go through some of those useful methods as follows:
- Open a web page using
urlopen()
. When we pass a URL tourlopen()
method, it will return an object, we can use theread()
attribute to get the data from this object in string format, as follows:
import urllib url = urllib.urlopen("http://packtpub.com/") data = url.read() print data
- The next method is parameter encoding:
urlencode()
. It takes a dictionary of fields as input and creates a URL-encoded string of parameters:
import urllib fields = { 'name' : 'Sean', 'email' : 'Sean@example.com' } parms = urllib.urlencode(fields) print parms
- The other method is sending requests with parameters, for example, using a GET request: URL is crafted by appending the URL-encoded parameters:
import urllib fields = { 'name' : 'Sean', 'email' : 'Sean@example.com' } parms = urllib.urlencode(fields) u = urllib.urlopen("http://example.com/login?"+parms) data = u.read() print data
- Using the POST request method, the URL-encoded parameters are passed to the method
urlopen()
separately:
import urllib fields = { 'name' : 'Sean', 'email' : 'Sean@example.com' } parms = urllib.urlencode(fields) u = urllib.urlopen("http://example.com/login", parms) data = u.read() print data
- If we use response headers then the HTTP response headers can be retrieved using the
info()
method, which will return a dictionary-like object:
u = urllib.urlopen("http://packtpub.com", parms) response_headers = u.info() print response_headers
- The output will look as follows:
- We can also use
keys()
to get all the response header keys:
>>> print response_headers.keys() ['via', 'x-country-code', 'age', 'expires', 'server', 'connection', 'cache-control', 'date', 'content-type']
- We can access each entry as follows:
>>>print response_headers['server'] nginx/1.4.5
Note
Urllib does not support cookies and authentication. Also, it only supports GET and POST requests. Urllib2 is built upon urllib and has many more features.
- We can get the status codes with the code method:
u = urllib.urlopen("http://packtpub.com", parms) response_code = u.code print response_code
- We can modify the request headers with
urllib2
as follows:
headers = { 'User-Agent' : 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0' } request = urllib2.Request("http://packtpub.com/", headers=headers) url = urllib2.urlopen(request) response = url.read()
- Cookies can be used as follows:
fields = { 'name' : 'sean', 'password' : 'password!', 'login' : 'LogIn' } # Here we creates a custom opener with cookies enabled opener = urllib2.build_opener( urllib2.HTTPCookieProcessor() ) # creates request request = urllib2.Request( "http://example.com/login", urllib.urlencode(fields)) # Login request sending url = opener.open(request) response = url.read() # Now we can access the private pages with the cookie # got from the above login request url = opener.open("http://example.com/dashboard") response = url.read()
Requests module
We can also use the requests module instead of urllib
/urllib2
, which is a better option as it supports a fully REST API and it simply takes a dictionary as an argument without any parameters encoded:
import requests response = requests.get("http://packtpub.com", parms) # Response print response.status_code # Response Code print response.headers # Response Headers print response.content # Response Content # Request print response.request.headers # Headers we sent
Parsing HTML using BeautifulSoup
The preceding modules are only useful to fetch files. If we want to parse HTML obtained via urlopen
, we have to use the BeautifulSoup
module. BeautifulSoup
takes raw HTML and XML files from urlopen
and pulls data out of it. To run a parser, we have to create a parser object and feed it some data. It will scan through the data and trigger the various handler methods. Beautiful Soup 4 works on both Python 2.6+ and Python 3.
The following are some simple examples:
- To prettify the HTML, use the following code:
from bs4 import BeautifulSoup parse = BeautifulSoup('<html><head><title>Title of the page</title></head><body><p id="para1" align="center">This is a paragraph<b>one</b><a href="http://example1.com">Example Link 1</a> </p><p id="para2">This is a paragraph<b>two</b><a href="http://example2.com">Example Link 2</a></p></body> </html>') print parse.prettify()
- The output will be as follows:
- Some example ways to navigate through the HTML with
BeautifulSoup
are as follows:
parse.contents[0].name >>> u'html' parse.contents[0].contents[0].name >>> u'head' head = soup.contents[0].contents[0] head.parent.name >>> u'html' head.next >>> <title>Page title</title> head.nextSibling.name >>> u'body' head.nextSibling.contents[0] >>> <p id="para1" align="center">This is a paragraph<b>one</b><a href="http://example1.com">Example Link 1</a> </p> head.nextSibling.contents[0].nextSibling >>> <p id="para2">This is a paragraph<b>two</b><a href="http://example2.com">Example Link 2</a></p>
- Some ways to search through the HTML for tags and properties are as follows:
parse.find_all('a') >>> [<a href="http://example1.com">Example Link 1</a>, <a href="http://example2.com">Example Link 2</a>] parse.find(id="para2") >>> <p id="para2">This is a paragraph<b>two</b><a href="http://example2.com">Example Link 2</a></p>
Download all images on a page
Now we can write a script to download all images on a page and save them in a specific location:
# Importing required modules import requests from bs4 import BeautifulSoup import urlparse #urlparse is renamed to urllib.parse in Python # Get the page with the requests response = requests.get('http://www.freeimages.co.uk/galleries/food/breakfast/index.htm') # Parse the page with BeautifulSoup parse = BeautifulSoup(response.text) # Get all image tags image_tags = parse.find_all('img') # Get urls to the images images = [ url.get('src') for url in image_tags] # If no images found in the page if not images: sys.exit("Found No Images") # Convert relative urls to absolute urls if any images = [urlparse.urljoin(response.url, url) for url in images] print 'Found %s images' % len(images) # Download images to downloaded folder for url in images: r = requests.get(url) f = open('downloaded/%s' % url.split('/')[-1], 'w') f.write(r.content) f.close() print 'Downloaded %s' % url