So a few months ago I started researching on how to create a webspider/crawling bot that would basically go out to a specific site (which will remain nameless since I want to keep using the site without being blocked) and rip events/dates and store it into a database (mysql). On top of that, I picked to use the python language which I’ve never used in my life. To take it a little farther, I decided to use Django as a web development platform just to see how easy it is.
After searching around the net for webspiders and crawler examples, I pieced together a decent looking script that uses an useragent function to fool the servers into thinking I’m a legit web browser accessing their webpages. This was a first step. Fortunately, many others have gone before me in exploring webcrawlers so the code was easy to find. So I did what every good developer does, I copy and pasted the function into my script and modified my code to work with it.
I then discovered BeautifulSoup. BeautifulSoup allows you to “prettify” html as well as parsing it, along with xml. Very handy if you need to pull data out of tables or list. Ran into some issues where I didn’t have the latest and greatest version and it would error out half way down a page of html, but it was fixed after installing the update. This usually happens due to the website’s html being very ugly. Tip: When looking for data that’s in a table or list, you have to do a find and store the results into a variable. Then you do a findAll on the variable to home in on the list or table in question. Use google’s web browser, chrome, and highlight the area you want to find. Right click on the area and click “inspect element”. Firefox has web developer tools/plugins as well.
Initially, I was storing the results into a text file and csv file and doing a mysql command to directly import the results. After I had the format the way I wanted, I used Django’s models which allows you to store data without doing sql statements. This eliminates frustration and security threats. Django uses the database info that you put in initially to install Django.
I must apologize though. I wish I could show you code but this is a project that I want to keep quiet for now. At the moment, I have my script running on a server as a cronjob. I might add some timers to the code to seem more random but the site might just end up blacklisting my if it causes too much of a usage spike (which I don’t think it does). I still haven’t figure out views in Django to return the results that I need. Mostly I haven’t read enough on the Django framework to wrap my head around it yet.
The bot as of right now, stores about 3500 entries into the database without hanging. Keep in mind, the site just has to change its structure and my code becomes worthless.
References:
VirtualBox
Ubuntu
Google Chrome
Python
BeautifulSoup
Django
Update: First time I ran it, I had 3400 entries. This time I had 7400.
