Data Mining Bot Works!

So a few months ago I started researching on how to create a webspider/crawling bot that would basically go out to a specific site (which will remain nameless since I want to keep using the site without being blocked) and rip events/dates and store it into a database (mysql). On top of that, I picked to use the python language which I’ve never used in my life. To take it a little farther, I decided to use Django as a web development platform just to see how easy it is.

After searching around the net for webspiders and crawler examples, I pieced together a decent looking script that uses an useragent function to fool the servers into thinking I’m a legit web browser accessing their webpages. This was a first step. Fortunately, many others have gone before me in exploring webcrawlers so the code was easy to find. So I did what every good developer does, I copy and pasted the function into my script and modified my code to work with it.

I then discovered BeautifulSoup. BeautifulSoup allows you to “prettify” html as well as parsing it, along with xml. Very handy if you need to pull data out of tables or list. Ran into some issues where I didn’t have the latest and greatest version and it would error out half way down a page of html, but it was fixed after installing the update. This usually happens due to the website’s html being very ugly. Tip: When looking for data that’s in a table or list, you have to do a find and store the results into a variable. Then you do a findAll on the variable to home in on the list or table in question. Use google’s web browser, chrome, and highlight the area you want to find. Right click on the area and click “inspect element”. Firefox has web developer tools/plugins as well.

Initially, I was storing the results into a text file and csv file and doing a mysql command to directly import the results. After I had the format the way I wanted, I used Django’s models which allows you to store data without doing sql statements. This eliminates frustration and security threats. Django uses the database info that you put in initially to install Django.

I must apologize though. I wish I could show you code but this is a project that I want to keep quiet for now. At the moment, I have my script running on a server as a cronjob. I might add some timers to the code to seem more random but the site might just end up blacklisting my if it causes too much of a usage spike (which I don’t think it does). I still haven’t figure out views in Django to return the results that I need. Mostly I haven’t read enough on the Django framework to wrap my head around it yet.

The bot as of right now, stores about 3500 entries into the database without hanging. Keep in mind, the site just has to change its structure and my code becomes worthless.

References:
VirtualBox
Ubuntu
Google Chrome
Python
BeautifulSoup
Django

Update: First time I ran it, I had 3400 entries. This time I had 7400.

iPad Taking on Criticism

No one really knows what the iPad is intended for. It reminds me of… a giant iTouch if you will. Many complaints have hit the internet by storm. No USB, no video out, no multitasking, and a touchscreen keypad are for starters. I would wait until this years netbooks hit the shelves since some are sporting dual core chips and 720p mini hdmi out. Here are some videos as seen on digg/reddit/slashdot tech sites. I am not by far interested in this product but I am not an anti-apple guy. I personally own an iPhone 3GS, Mac Mini (instead of the lame Apple TV), and a Macbook Pro. I also run Ubuntu 64 9.10 and Windows 7 64 Ultimate on my main desktop.

What have I been up to?

Most likely I’ve been up to no good. Since my luck has ran dry in the job hunt, I’ve took initiative to create something that hasn’t been made yet. I can’t tell you what the project is exactly, because I don’t want assholes ripping me off, but I’m probably about 80% done for the short term goals I’ve set, and 20% of my long term goals. Maybe I can make some cash off this to pay rent… but I doubt it. Ramblers, keep rambling.

Watch until about 2:20 in.

The Colbert Report Mon – Thurs 11:30pm / 10:30c
Move Your Money – Eugene Jarecki
www.colbertnation.com
Colbert Report Full Episodes Political Humor Economy

Happy New Year…

This year will be better than the last.

Tech Deal Websites

I’m a consumer whore sometimes when it comes to technology. Here are a few sites that I look at on a daily basis.
SlickDeals
Deal News

Kevin Smith is the man.

Funny Twilight comments.

Miles Fisher – This Must Be The Right Place (cover)

Crazy music video of a guy who is a mix between Christian Bale and Tom Cruise reenacting the entire American Psycho movie under 5 minutes. I think the video was done very well but who really cares about what I think.

A fresh new look.

After months of this domain sitting empty, I’ve decided to redesign it.  I’ve been messing with photoshop recently and created the banner.  Yes I know the font doesn’t match the rest of the design.  It still looks cool.  I will be tweaking the site over the next month or two but I highly doubt that anyone will stumble across it.  Now to find some cool plug-ins.