Distil Networks has released its latest report on web scraping. In “The 2016 Economics of Web Scraping” (registration required) it highlights the ease with which web scraping is taking place. It also claims that 46% of web traffic is now bots that are stealing data from websites by web scraping.
According to Rami Essaid, CEO and co-founder of Distil Networks: “If your content can be viewed on the web, it can be scraped. Not only does web scraping pose a critical challenge to a website’s brand, it can threaten sales and conversions, lower SEO rankings, or undermine the integrity of content that took considerable time and resources to produce. Understanding the pervasive nature of today’s web scraping economy not only raises awareness about this growing challenge, it also allows website owners to take action in the protection of their proprietary information.”
Social media companies struggling with screen scraping
With the number of companies looking for content to fill web pages, screen scraping is on the increase. It is not just content that is stolen off of websites. LinkedIn has recently launched a lawsuit to identify over 100 fake profiles that are scraping data. These profiles are then used to connect to LinkedIn members and steal data such as their connections. Those details are then used by recruitment company HiringSolved to contact LinkedIn members and offer them jobs.
LinkedIn is not the only social media company with this problem. Facebook has been trying to crack down on screen scrapers who capture user IDs. Those are then sold to marketing teams so that they can target them with ads. They are also used to find out who is interacting with competitors Facebook pages. This allows a company to directly target a competitors customer base. The impact on Facebook is that companies no longer need to purchase its advertising. They can go after customers using targeted posts which are more likely to get a response.
Screen scraping is big business
Distil Networks claims that 2% of online revenue is lost due to web scraping. It is difficult to get an accurate figure on what the global online revenue is. Online statistics company Statista reports that online retail sales in 2015 were more than US$1.5 trillion. 2% of this represents a staggering $30 billion.
However Distil Networks are talking about all online revenue so that includes money spent on advertising and other things online. In its latest look at global Internet advertising revenue, PWC estimates over $180 billion will be spent in 2016. Companies are using screen scraping to steal content for their site so that they can increase visitor numbers. If that is then reflected in a redirection of advertising it is costing companies over $3.6 billion in lost revenue.
Distil Networks says that the industries affected the most are real estate, digital publishing, travel, online directories, e-commerce, marketplaces and classified ad sites. For some industries such as real estate and recruitment losses can force companies out of business.
What does it cost and how much are people earning?
The report highlights one particular site:Guru.com. That site offers 1,800 web scraping services according to the report. These services have a cost that varies widely, depending upon the data that is gathered. It is also dependent on whether the scraping can be fully automated with bots or if there is a need for human intervention.While some services are free, Distil Networks says that the average web scraping service is as little as $3.33 per hour. Entire projects average as little as $135. At that price it should come as no surprise that companies are using it regularly. Many of the “customer” lists that are offered for sale on the Internet are created from scraped data. The companies offering the lists do so at what seems a very low price due to the low costs of data acquisition.
What is interesting is the salaries that Distil Networks says web scrapers can earn. Despite saying that some earn as little as $3.33 per hour, others can earn up to $128,000 per year. This sets the average salary at around $58,000, enough to attract a lot of people into the industry.
Challenge or Opportunity
Marketing managers will rub their hands with glee at the thought of accurate targets. Web scraping is not illegal in itself and with the compute power and analytics available it will improve with accuracy. What Robocoq inc is doing is more questionable. The issue is that companies are merely collating publicly available information, there is nothing wrong with that. Registering false identities however is more grey.
Marketing teams need leads, and will seek good leads wherever they can. Until scraping is either made illegal (unlikely) or unethical (possible) it is battle that companies, especially social media ones will continue to face. The battle against the bots trying to scrape data on social media sites is therefore ongoing. Social media companies need to balance the privacy of the individual against the need to grow numbers. In making less information publicly available they are less likely to grow.
What is clear from this research is that web scraping works and marketing teams should be considering it as a source for leads. As analytics improves, the chances are that cold call lead generation is as likely to be obtained through scraping as historic lists. The more companies look to source leads in this way the more revenue is obtained and the greater the investment in the technology.
The cost of web scraping continue to fall. This means that companies need to plan strategies to protect data on their websites. Currently many only protect IP held on servers. This will have to change to combat the number of bots attacking their websites and to protect their revenue streams.
It will be interesting to see what happens with the LinkedIn court case. If it wins it could lead to a change in tactics and new types of attacks by web scraping apps. At that point many companies will have to decide if having data publicly available is worth it. They will also need to create processes to validate any data on their websites. Best practice says that those processes should be there today but in the vast majority of cases they are not.
One solution is to only make data available to authenticated users. The problem is that users don’t like giving up their details to gain access to data. This can hurt website traffic which is already suffering from the explosion of ad blockers that also block cookies. For sites that rely on advertising revenue this is already hurting them. Paywalls have proven unpopular with some national newspapers abandoning their use of them in recent years.
For now, the best most companies can do is try and include combating web scraping into their security policies.
In an earlier version of this article we mistakenly linked the 2014 lawsuit between LinkedIn and Robocog Inc and the recent 2016 lawsuit against unnamed parties. We would like to make it clear that Robocog are not involved in the current lawsuit and continue to abide by the settlement it reached with LinkedIn. Our apologies to Robocog Inc.