In the era of digital transformation, web scraping by artificial intelligence (AI) models like ChatGPT has become a subject of ethical debate and growing concern for businesses, especially in the real estate sector. A recent incident has thrown these ethical considerations into sharp focus.
OpenAI, the creators of ChatGPT, has deployed its GPTbot, which is harvesting/scraping content from every website online without permission.
While some argue that such practices can enrich AI training for better predictive accuracy, the lack of an opt-in system creates an ethical quagmire and poses serious risks for businesses.
The release of GPTBot follows recent criticism of OpenAI previously scraping data without permission to train Large Language Models (LLMs) like ChatGPT. To address such concerns, the company updated its privacy policies in April.
Legal Grey Area and Competitive Risks
Given that there are currently no comprehensive laws or regulations stipulating the usage restrictions of data obtained through such scraping activities, you essentially lose control over how that data is subsequently used or manipulated. For example, a third-party platform could theoretically harness this data to build a similar or competitive service, thereby diluting your market share and unique value proposition.
In a sector like real estate, where exclusive listings and market insights are crucial, this could be disastrous. Thus, the absence of legal safeguards creates a competitive disadvantage that may be irrevocable, underscoring the need for regulatory oversight and informed consent in AI data scraping practices.
The primary ethical issue in this scenario is that of consent. Web scraping by AI models is usually an opt-out system, meaning the default is for the bot to scrape your website’s data unless you take active steps to block it. This lacks transparency and forces businesses to take technical measures to protect their information from third parties.
Web scraping bots like ChatGPT can potentially access and store sensitive data, including off-market listing data. This poses a risk for real estate agents who might not want this information publicly available, let alone used for AI training purposes.
Other Potential Risks for Real Estate Businesses
Web scraping bots can be resource-intensive, monopolising CPU and bandwidth resources to crawl through websites. This can slow down a site’s performance and impact real users who are attempting to access information or services.
For every additional second your website takes to load, your conversion rate decreases.
If the data scraped is used in an open AI model, competitors could benefit from the unique insights and listings specific to a particular real estate agency. It becomes even more problematic when AI tools are designed to provide sophisticated market analysis based on the ingested data that might be proprietary and used for your own competitive advantage.
Third-party bots can sometimes access areas of a website that are meant to be hidden or secure. In the case of real estate agencies, this might include off-market listings (or potentially client data if your web host is lacking the appropriate security measures).
What Companies Need to Do
Take Technical Measures
Companies that want to protect their data and stop ChatGPT’s bot from scraping their website will need to update their
robots.txt file or set up a firewall to block these AI bots. Ask your website team for help if you’re not sure.
The support page of the said bot provides a way to block its server from scraping a website. All you need to do is modify your site’s robots.txt file. Simply add the following lines:
OpenAI also states that admins can stop GPTBot from some areas of the site in robots.txt using these tokens:
Although you can modify the said text file on your site, it is not clear if doing so will completely prevent or stop the bot from being included in the training data.
Adopt an Opt-In Approach
We have taken the measure of blocking ChatGPT’s bot at a firewall level, which prevents it from accessing any website whatsoever. Clients who wish to opt-in and allow ChatGPT’s bot to scrape their website can have this block removed on request, thus allowing an opt-in approach rather than opt-out.
An opt-in approach will give businesses the control and choice over their data. It is more ethically sound and allows companies to better assess the risks and rewards of contributing to AI training.
Conduct Regular Audits
Businesses should regularly audit their website logs to detect any unauthorised scraping and take prompt action.
While AI has immense potential to revolutionise real estate, ethical considerations cannot be sidelined.
Consent and data privacy should be at the forefront of AI’s expansive role in today’s digital world. By being aware and taking the necessary steps, businesses can safeguard their interests while still exploring the benefits that AI has to offer.
What do you think?
Are you comfortable allowing AI tools to scrape your website without telling you in the first place? Leave a comment below.