Let the big AI companies "steal" like this, we may not see free websites

2023-07-12 02:02:10

Original source: Bad review

Image source: Generated by Unbounded AI‌

A few days ago, Google suddenly updated its privacy policy, making it clear that it will use all public data on the Internet to train its own AI model.

In other words, according to the new policy, any information you post publicly on the Internet may be crawled by Google, including but not limited to your posts, keywords you search for, and videos you watch.

Isn't this appropriate Internet streaking!

Not long after OpenAI was sued for data infringement, Google was in a hurry to hit the gun.

At this juncture, there is a high probability that it has nothing to do with data charges. If Google doesn't collect this wave of free wool, it is very likely that it will not be able to collect it in the future. **

This matter has never stopped since ChatGPT became popular.

Shichao will give you guys a rundown of the timeline first.

In March of this year, Musk took the lead in firing the first shot at data charges, declaring that Twitter's API interface was no longer free.

Immediately afterwards, Reddit, the US version of the post bar, couldn't bear it anymore.

Last month, Reddit's "blackout" campaign was a protest against the official API charging policy.

When Shichao wrote about this before, he was still guessing whether Reddit officials would make concessions in the end.

Judging from the current follow-up, most third-party software has been confirmed to be shut down, and Reddit is determined to charge for data.

During this period of time, Twitter has adjusted the rate limit again. Accounts that do not spend money to authenticate can only read 600 posts per day. The purpose is also to prevent robots from grabbing user data.

Is data so valuable?

Shi Chao felt that it was still the fault of **AI. **

If the big AI model wants to become smarter, it needs a steady stream of data to "feed".

Those who can make large models now, either have their own data, such as Baidu, Ali and Tencent; or crawl other people's data, here is the name OpenAI.

Because many websites have open and free API interfaces, giants such as Microsoft and OpenAI have been given an opportunity.

But today is different from the past. After AI re-endows data value, platforms with chips in their hands are of course absolutely unwilling to be prostituted for nothing. **

Even Reddit's CEO Hoffman made it clear: he just doesn't want to provide data to the giants for free.

Therefore, the prosecution of OpenAI is probably due to the fact that the platforms have united to "kill chickens and monkeys" and cure the unhealthy tendencies of AI.

However, it is hard to say whether the law will stand on OpenAI's side this time.

Because data copyright involves 3 key issues:

**1. Is the behavior of the data crawler itself legal? **

**2. Is the data protected by copyright? **

**3. Are works generated from data protected by copyright? **

First of all, the first question, to obtain data, is nothing more than paying for purchases, or collecting publicly available data on the Internet.

However, it should be noted that disclosed data does not equate to authorized use, and it also depends on whether the website has relevant clauses that restrict the behavior of data crawlers.

If the consent of the copyright owner is directly exceeded, or the data is obtained forcibly by bypassing the website restrictions, it is a crime of illegally obtaining computer information system data.

Even if OpenAI claims to crawl data from public websites, whether the data crawling behavior itself is legal depends on whether the copyright owner has given authorization.

Second, about whether the data itself is subject to copyright.

According to US copyright law, if the data used for AI model training falls within the scope of "fair use", it will not constitute infringement.

But the problem lies in this "fair use".

The constituent elements of "fair use" include whether commercial use is involved, whether the work itself is protected by copyright law, the number of parts used, and the impact on the work itself after use.

Like news reports and academic research, appropriate citations are completely ok.

Can the data usage of hundreds of millions of levels on AI models and commercialized AI software still count as "fair use"?

Finally, there is the copyright issue of AI generated works.

Because the copyright of training data is not clear, the content generated by AI will naturally have copyright disputes. A few days ago, Steam also removed a game generated using AIGC on the grounds that there was a copyright issue.

Let's take AI painting as an example. Image generation is equivalent to a process of splitting and reorganizing. Although the final result is completely "new", it still retains some characteristics of the training image.

However, whether this situation is regarded as an infringement or not, there are different opinions from various countries.

Because the training data belongs to others, the U.S. Copyright Office determined that the works generated by AI are not protected by copyright law, and may even infringe copyright.

The attitude of the Japanese government is quite different, saying that Japanese law does not protect the copyright of the data used for AI training.

At least under the current legal framework, it is difficult to get a unified answer to the above questions.

Since the supervision is not strong enough, the copyright owner has no choice but to do it himself. If the fee should be charged, the one that should be recovered should be recovered quickly.

▼OpenAI Sued Documents

It is foreseeable that after Twitter and Reddit, there may be more content copyright parties erecting high walls.

This matter, for the platform, is of course a new way to make money. No matter how bad the technology giants are, they will spend more money.

But for the Internet as a whole, it is not a good thing.

At that time, the Internet was born with the gene of open sharing, such as Wikipedia and Twitter, which provided API interfaces for free all the year round, making it very convenient for developers to call data.

But now if data charges are allowed to be implemented like this, it is hard to say what the result will be.

After all, small developers do not have the ability to pay huge data fees. If innovation only occurs in giants, isn’t this a pure monopoly?

The most important thing is that many websites that can be viewed for free now may have to be viewed later. This is the real crit for ordinary users like us.

In fact, the data charges cannot be entirely blamed on the platform. It really makes AI giants afraid of "robbing", which is a helpless move for self-protection.

Although Google has a "privacy policy" this time, it's hard to say what the result will be.

Therefore, the key is to see when the sledgehammer of supervision will fall.

Clarifying data copyright is a hurdle that cannot be avoided in the development of AI, and now, it seems to be also related to the future direction of the Internet.

I wonder if the AI ship will push us into a more open or closed era?

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes

Reward
1
Comment
Repost
Share

Comment

0/400

No comments

Topic
#Gate Square Qixi Celebration
16k Popularity
#Commerce Dept. Goes On-Chain
565 Popularity
#Google Cloud Unveils L1 chain GCUL
489 Popularity
#Trump Removes Fed Governor Cook
14k Popularity
#Gate Alpha Peak Trade Phase 2
16k Popularity

Sitemap