How do we extract data from a site without any knowledge of it’s API? Python, urllib and BeautifulSoup are commonly described as standard tools for webscraping although they lack the features and rendering qualities of real browsers. There seems to be a race for an ultimate webscraping Python package, although the web is constantly changing and most data are text based.
In the sections below I will cover three different data acquisition strategies. In conclusion I suggest you combine tools and prepare to learn the intricacies of regular expressions.
We will use AutoHotKeys, spreadsheets and Python with packages such as re, requests and csv.
Scraping Bilbasen.dk: AutoHotKeys, Python packages requests, re and csv
You can download script based automation software on this page: autohotkey.com
Once installed you write AutoHotKey scripts in a text editor and save it as a file with postfix .ahk
A webpage can be saved using the syntax UrlDownloadToFile, http://websiteurl, sourcefilename.txt
Just put lines in a file to extract several webpages, then compile and run from a Windows folder using the right button mouse menu. Text files resembles a direct download of html files and are thus a high quality source of information for many purposes. Combine with a spreadsheet and the CONCATENATE text function to generate your ahk script file.
UrlDownloadToFile, http://www.bilbasen.dk, BilbasenFrontPage.txt
In practice we search for a particular brand of car (Volkswagen) on Bilbasen.dk and traverse through search results, which gives us 119 pages to scrape for links to specific cars. The search result pages have very similar structure making it easy to prepare lines for a AutoHotKey script in a spreadsheet.
From html source files we extract data in Python using simple iteration techniques and basic Python packages such as requests and re. The former Python package requests is preferred to urllib packages due to compatibility reasons. Perl like regular expressions are applicable with re.
import re resultpage=’1′
for i in range(2, 119)
print(‘Error reading file Bilbasen’+str(i)+’.txt’)
It can be useful to remove whitespace and other obnoxious characters before processing. This is done with the replace method. In the example above listhtml is a Python object with variables and methods to call. Remember Python is an object oriented scripting language a list of variables and methods are given with the dir(Object_Name) command. Indentation to indicate scope is a very import aspect of Python syntax, so please be careful when you copy-paste Python code. There are many pitfalls for newbies.
We proceed to the actual webscraping process, in which I use simple regular expressions to extract data. Variables for each car is collected in a list and stored in a comma separated file.
import csv Varnames=”manufactorer;model;variant;fuel;year;listingID;listingValue;dealerID;price;avrprice;km;postnr;by;tlf;udstyr” with open(‘vw.csv’, ‘w’) as csvfile: specswriter = csv.writer(csvfile, delimiter=’;’,quotechar=’|’, quoting=csv.QUOTE_MINIMAL) specswriter.writerow([Varnames]) for urlbil in link: carinfo=”” try: url=’http://www.bilbasen.dk’+urlbil html=requests.get(url) source=html.text.replace(“\””,”).replace(“‘”,”).replace(“\s”,”).replace(“\t”,”) keywords=re.search(r'(?<=)',source,re.M) except: print(“Failed to read presentation page”)
try: manufactorer=re.search(r'[^,]*(?=,)’,keywords.group(),re.M).group().replace(” “,”) carinfo=carinfo+manufactorer+’;’ except: print(“Could not retrieve manufactorer”) carinfo=carinfo+”;” try: model=re.search(r'(?<=,)[^,]*(?=,)',keywords.group(),re.M).group().replace(" ",'') carinfo=carinfo+model+’;’ except: print(“Could not retrieve model”) carinfo=carinfo+”;” try: variant=re.search(r'(?<=%s,).*(?=,\sBenzin|,\sDiesel)' % model,keywords.group()… carinfo=carinfo+variant+’;’ except: print(“Could not retrieve variant”) carinfo=carinfo+”;” try: fuel=re.search(r’Benzin|Diesel’,keywords.group(),re.M).group() carinfo=carinfo+fuel+’;’ except: print(“Could not retrieve fuel”) carinfo=carinfo+”;” try: year=re.search(r'(?<=Benzin,|Diesel,)[^,]*(?=,)',keywords.group(),re.M).group()… except: print(“Could not retrieve year”) carinfo=carinfo+”;” try: avrprice=re.search(r'(?<= except: print(“Could not retrieve average price”) carinfo=carinfo+”;” try: Udstyr=re.search(r'(?)(.*?)… carinfo=carinfo+Udstyr+”;” except: print(“Could not retrieve equipment”) carinfo
Scraping Bilzonen.dk: Selenium vs AutoHotkeys
Selenium is marketed as an attractive multi-platform webscraping tool.However, the slow down due to overhead in communication phases may require optimization of Python code, an expensive and a time consuming process vulnerable to software changes and upgrades.
from selenium import webdriver url=’http://www.bilzonen.dk/brugt-bil/2499275/VW-Golf-Plus-2-0-TDi-Sportline-DSG/’ driver=webdriver.Firefox() driver.get(url) htmltext=driver.page_source
There are loads of webpages with tutorials on optimal Selenium development environments. Many are continuously updated, most require local customization of guidelines, software packages and operating system.
A data acquisition comparable to the initial attempt on Bilbasen.dk is obtained if automation is placed outside your Python environment. It may be your most import phase in the initial data extraction process and a faster alternative to Selenium.
Lets generate a list of urls to individual presentation pages in Python from initial websearch files gathered with AHK and Python.
resultpage=’0′ txtfile=open((‘Bilzonen’+resultpage+’.txt’),’r’,encoding=’utf-8′) listhtml=txtfile.read().replace(… link=re.findall(r'(?)’,listhtml,re.M) for i in range(2, 400): resultpage=str((i-1)*12) try: txtfile=open((‘Bilzonen’+resultpage+’.txt’),’r’,encoding=’utf-8′) listhtml=txtfile.read().replace(… txtfile.close() getlinks=re.findall(r'(?)’,listhtml,re.M) link=link+getlinks except: print(‘Error reading file Bilzonen’+str((i-1)*12)+’.txt’)
with open(‘VWpages.csv’, ‘w’) as csvfile: vwswriter = csv.writer(csvfile, delimiter=’;’,quotechar=’|’, quoting=csv.QUOTE_MINIMAL) count=1 for url in link: vwswriter.writerow([“UrlDownloadToFile, http://www.bilzonen.dk”+url+”, VW_”+str(count)+”.txt”])
Copy the generated file to a separate folder, compile and run using the right button menu.
Presentation pages subtracts are obtained within minutes and not hours in a comparison between AHK scripting and Selenium. Furthermore, obtained files contain needed variables in an easy-to-access format. A general rule-of-thumb would be to gather data source files beforehand and continuously outside the Python development environment and stick to the core of Python programming in combination with regular expressions in processing of data.