A rip roaring affair

So sometimes you get asked unusual questions in my inbox this morning was a letter from a nice person called Neyma I won’t publish all of it just the bit relevant to this post.

but for now, I have a theoretical question that you may be inclined to
answer -

Is it possible to rip the entire SU database? Can it be done in
one-click or would a bot have to be set? How hard would this be? what
sort of information could we get?
what about for some of the other Social sites?

For Neyma; not all, bot, not easy, stuff, yes

Now I’m sure my emailer has a perfect reasonable reason for theorising about wanting to rip the entire Stumbleupon database but I have a feeling Ebay would not be impressed. So I thought would discuss content scraping how its done and suggest some ways to prevent it.

Scraping or ripping

Web scraping is when a script is used to store information from a site or feed over http now this may sound rather nasty but by far the largest scraper on the Internet is the Google bot which “caches” copies of your site when it crawls that said there are many people who think that googles caching could be in the nefarious if not nasty category. Ripping normally refers to software ripping which is a byte by byte copy which would require server and database access given a hacker with that sort of access could simply copy the files when most people refer to web ripping they are talking about scraping.

Limitations of scraping

Scraping is limited only to publicly accessible information so, if a human can get it so can a bot this means that a bot could scrape any non password protected html page, any database generated content that returns results and probably far more things then you could think of :)

In the above I mentioned passwords but of course if the bot is programmed to bypass the password then it can get there to. oh don’t look so melodramatic its not that bad!

before we go further I’m going to demonstrate some simple scraping and a useful php script to boot.

Standard Venture Skills discalmer – This is for demo purposes, Scraping without permission is bad ok very naughty and could land you in court! what we are doing won’t but it could result in your IP being locked out from Google if you abuse it!

<?php
function exchangeRate( $amount, $currency, $exchangeIn )
{
$googleQuery = $amount . ' ' . $currency . ' in ' . $exchangeIn;
$googleQuery = urlEncode( $googleQuery );
$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, 'http://www.google.com/search?q=' . $googleQuery);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file_contents = curl_exec($ch);
$file_contents = strip_tags( $file_contents );
curl_close($ch);
$matches = array();
preg_match( '/= (([0-9]|\.|,|\ )*)/', $file_contents, $matches );
return $matches[1] ? $matches[1] : false;
}
?>
exchange_rate.php

Guessed what it does yet, its a simple currency converter it uses Google’s google conversion search to take our content and convert it to what ever we want. The code is PHP and uses CURL you could have used FOpen or similar but I like CURL.
<?php require( 'exchange_rate.php' ); echo '£50 ($' . exchangeRate( 50, 'GBP', 'usd' ) ')'; ?>

This code is used to call the conversion in your document, in the above example we wish to display £50 and its equivalent dollars
it does so by requesting http://www.google.com/search?hl=en&q=50GBP+in+USD from Google search this returns a single result which is easy to strip of content to get the answer.

RSS Scraping

Now most scrapers don’t actually scrape HTML sites, its far to time consuming with a lot of custom programming to get the right data, no they go for the XML feed these feeds have been standardised to help the feeds to be more easily parsed, the scraper simply parses the feed cacheing it then republishing it on a blog some where with a large quantity of adverts this sort of scraping is how most MFA sites are generated the more complicated part is not the scraping but finding the feeds and that takes only a few minutes after all how many of us pingshot services to contact feed directories?

Is feed scraping a problem?

From an SEO perspective feed and site scraping present little to no problem Google is now indexing pages very quickly and is pretty good at sorting out which is the original blog and where the content came from. Duplicate content is not the issue it once was and so Google just dumps the splog content into the sin bin.

It can be a problem for brand new sites where Google has not yet indexed yet their feeds are already being scraped which can lead to Google deciding that the publishing blog is the splog but such scenarios are very rare and pretty easy to solve.
The bigger problem is a reputation management issue, what if some one saw your amazing post on a splog and subscribed to their rss feed? or worse thought that it was so good that “I need to buy some Viagra of that guy!”

Other problems includes hotlinking issues with images but again this is easy to sort.

Fighting back

Hosted Blogs
Bloggers using blogger or wordpress.com have a much reduced arsenal when it comes to scraping as you have almost no control over your feed the simplest method is therefore to sign your posts some thing like

Blog post by Tim Nash at the Venture Skills Blog

You can get more exotic and create a page footer like we have here for example, the key is to not only name the blog but link to it, if you want to get really clever link to the home page and when published edit the link to be the post permalink. This means that at the very least the splog is providing a link back, just one small thing if your at wordpress.com and use the more tag your feed is cut at that point so you will want to put the link above that line ;) or it won’t be picked up. There is some more pretty generic advice for wordpress.com users at lorelles’ blog

Self hosted
The options for self hosted are much better then hosted blogs you control your feed and so can add and modify content some ideas to consider include;

Copyright

If you really feel that your business is being hurt you can go the legal route, I suggest that you don’t contact the site but go straight for the jugular with cease and desist at the host most hosting companies do not want the hassle of fighting take down notices and put up very little fight. SEOMoz guys did a whiteboard Friday on take down notices in the states and the information is pretty universal. That said this strategy should really be a last resort most scrapers do no damage and part from a pride thing are not harming you they are yet another part of the underside of the web that I’m afraid we will have to put up with.

If anybody else has suggestions for fighting scrapers why don’t you leave a comment ;) and let us all know the secret.

Get our Content via RSS feed using Feedburner


Subscribe to The Venture Skills Blog by Email

PodcastAll our Posts are audio subscribed for more information see here, and to access the podcast feed here

AddThis Social Bookmark Button


RSS icon This blog is moving soon, make sure you move with us by using our Feedburner RSS feed, if you have used the autodiscovery button in your browser you may need to swap feeds, simply delete the old feed and add, http://feeds.feedburner.com/VentureSkills For a more detailed explanation on feeds and recieving our content in various formats click here

5 Responses to “A rip roaring affair”

  1. webdigity Says:

    Nice post. BTW if you are going to do automated queries from Google be sure to add this line of code :

    curl_setopt($ch, CURLOPT_USERAGENT, “Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1″);

  2. Rekport Says:

    Good issue and good script.

  3. Tim Nash Says:

    cheers gents, and Nick your right sending out a user agent is a smart move and we covered using user agents previously most of my scrapers are currently being Iphones as sites are slowly modifying their pages to give a much more image reduced version to mobile devices which reduces the overheads

  4. Bobby Revell Says:

    This is a fascinating post! You have me so interested in your ideas and content I went and bought 6 new books yesterday!

    I clicked your subscribe button to make sure I had actually done it. I use bloglines because it’s fast and I like the way it works. When I looked at my account, the over 500 blogs I had previously subscribed to, were gone!

    I had 12 blogs listed I subscribed to. If I just read about some hack, it seems like something weird happens! Go figure! Now I have to go resubscribe to all my regular reads!

  5. Tim Nash Says:

    Thanks Bobby, sorry to hear about your bloglines problem I’m sure you will have fun re-finding all those sites.

    By the way and this is an open call to any one, if their is a subject that you would like us to cover then why not drop us a line


Comments are closed.