There was a recent lawsuit filed against Perplexity by the Wall Street Journal and the New York Post, saying that Perplexity is committing copyright infringement and is tarnishing the name of both news outlets. I read through all 42 pages of the complaint and there were some interesting statements made. If this case makes it to a jury trial, it could become case law for similar cases and could change how AI is developed and used. If you'd like to read the entire complaint, the article I linked to gives access to the document.
If you've not used Perplexity before, a good way to describe it would be as an AI search engine. You can use it for free or sign up for a pro account for $20/month. The free search ability is nice, whereas the paid option gives more details and let's you pick the LLM model. I recommend giving it a try. Perplexity positions themselves as being a way to search for good information and not getting trash for results.
I recommend watching the video above to get a good idea of what's going on with Google Search. The quality of it has gone to pot - much of which is a problem with how Google manages their product. For example, if you search for reviews of good VPNs, you'll see a pile of results showing "reviews." When you click on 98% of them, you'll see half a dozen VPNs listed with sloppy reviews that all look similar. When you hover over links to the providers, you can see they're affiliate links. The hyper optimized search engine optimization (SEO) game has caused people to create these garbage dump sites that take some of the top spots in search results.
This is where Perplexity comes in. I've rarely had issues with getting good sources from them. When I want to research a topic, they'll list 3-6 sources with direct links to articles, while filtering out the trash search results. I've replaced many of my Google searches with Perplexity, since it makes the research process a lot smoother.
The short summary of why the lawsuit was filed is because the WSJ and NYP allege that Perplexity has copied entire articles into their database, which is then served up when someone makes a relevant search. I'll be pasting some of the noteworthy statements from the complaint in this letter and doing a breakdown of them.
"Perplexity’s business is fundamentally distinct from that of traditional search engines that also copy a vast amount of content into their indices but do so merely to provide links to the originating sites. In its traditional form, a search engine is a tool for discovery, pointing searchers to websites such as the pages of The Wall Street Journal or the New York Post, where the users can click to find the information and answers they seek. Those clicks in turn provide revenue for content producers. In part because traditional search engines that simply provide hyperlinks promote merely the discovery of copyrighted content, and not its substitution (and commercial monetization of that substitute), copyright holders historically have not sought to prevent or legally challenge the copying by traditional search engine companies. But unlike the business model of a traditional internet search engine, Perplexity’s business model does not drive business toward content creators. To the contrary, it usurps content creators’ monetization opportunities for itself."
This section of the complaint is highlighting to the court why Perplexity is so different from Google. Later on in the complaint, the plaintiffs talk about how people will get a summary of a topic from Perplexity, rather than click through to read the full article. It's a fair point, since many people want a quick summary many times (i.e. TikTok syndrome). Where they're going to have a hard time with this in a trial is convincing the jury that everyone does this. My personal preference with Perplexity is to have it surface good results without me wasting time sifting through Google, then researching those articles.
Also in the complaint were multiple examples of alleged copyright infringement. The plaintiffs searched a topic on Perplexity and asked it to provide full text from news articles, which it did, word for word. Using some sentences here and there or maybe a paragraph or two could be seen as reasonable, but Perplexity is going to have a harder time dealing with the full article reproduction and explaining why it isn't an issue.
As far as fair use goes, it's something which has to be proven in court. A person can't say they used a video or other content as fair use (e.g. YouTube DMCA counter claim), and cause the claim to go away entirely. If the copyright holder wants to pursue further action, it results in both parties having to go to court, where the court determines if something was fair use. There are some circumstances where a platform like YT can determine some claims are made in bad faith and rule in favor of the person who had a claim made against them, but it tends to be cases where DMCA abuse is clear cut.
"Generative AI technology has enormous potential to benefit society, but neither society nor AI companies will benefit if original content creators are harmed."
The plaintiffs make a great point with this statement. If things get to the point where the content creators can't make any money from what they produce, the vast majority of people will stop producing entirely. This would then cause LLMs to run out of new content to train on - it would also turn the internet into a soulless garbage dump if most people stopped making content. Then there's the aspect that people don't appreciate a multi-billion dollar company taking all their content and making a profit from it.
"For several reasons, including because Perplexity generates outputs for users on an individualized basis, it is effectively impossible for original content creators to uncover all of the ways in which their copyrighted works are being repackaged and sold to Perplexity’s users, short of legal proceedings compelling disclosures by Perplexity."
The plaintiffs are going to have a harder time explaining this in court and getting a jury to accept it. The statement here is saying people won't know when their content has been weaved into the outputs from Perplexity. The sticking point is going to be that humans in general are like this. We all learn from other people, we adapt that knowledge through our own lens, and then use the remixed knowledge for our own benefit and to pass it on to others.
"Initially, Perplexity attempted to justify its massive illegal copying of copyrighted material by touting repeatedly its “Cited Sources” feature, as though a citation to an original source with a link that Perplexity actively encourages users to “skip” would make up for stealing copyrighted material.
When asked at a technology conference how he will “deal with copyright issues . . .[because he] definitely will take data from the internet from some random website and the [ ] traffic will be reduced from their website,” Srinivas responded: “The way we are thinking about it is we are at least attributing every part of the answer. Where are we getting it from in terms of inline citations as well as the source panel at the top.”"
The statement from the CEO here is going to cause them problems in a jury trial. He acknowledged there are copyright issues with taking content from other sources (though Perplexity will likely make a case for fair use). Because there's a large group of people who use it to find good sources, that gives Perplexity a good angle to approach. I personally wouldn't bother using the service if I didn't get good sources. It would then be no more useful than any other LLM. That said, there's no clear guidance on whether citing a source helps with a fair use defence.
For reference, here are some of the factors for fair use considerations:
The purpose and character of the use.
The nature of the copyrighted work.
The amount of the work being used.
Whether the use is for commercial or non-commercial purposes.
Whether the use is transformative.
Whether the use may harm the current market or potential markets.
"Upon information and belief, Perplexity’s citations make users less inclined to visit the original content source, because, as Perplexity has boasted, citations make content appear more “Reliable,” allowing Perplexity’s readers to feel more confident that they can “Skip the Links” when they believe they are reading content from credible sources.
This observation is corroborated by Plaintiffs’ experience of detecting virtually no click-through traffic on their websites from “cited sources” links on Perplexity, despite Perplexity receiving approximately 250 million queries per month."
This was an interesting section to read from an SEO analytics perspective. In the discovery process, the plaintiffs and Perplexity are going to have to provide some hard numbers from their ends on how much of the plaintiffs content is being served, and how many people are clicking through to their sites. The 250 million monthly queries doesn't mean much when no one knows how many of those used content from the plaintiffs. WSJ and NYP wouldn't have that info and Perplexity would need to provide it from their end.
The flip side is WSJ and NYP are both going to have to show how many people are clicking through to their articles from Perplexity. This is part of the analytics data they have access to. I use basic analytics on my website, and even I can see where my visitors clicked from.
"According to Perplexity, the Publishers’ Program was developed “[t]o further support the vital work of media organizations and online creators,” because the company “need[s] to ensure publishers can thrive as Perplexity grows.” The program purports to share an unspecified portion of ad revenue from advertisements that Perplexity plans to host in “coming months.”
While Perplexity is correct in conceding that generative AI is unsustainable without a thriving market for publisher content, the Publishers’ Program is no solution or defense to Perplexity’s infringement. Rather, this program is Perplexity’s naked attempt to dictate unilaterally the terms of a license to owners of original content from whom Perplexity has already taken copyrighted material. In the negotiation of a valid license to copyrighted material, an infringer does not unilaterally dictate the terms of the license to its victims."
This part right here could be a big win for smaller content creators, depending on the outcome of this case. Perplexity isn't the one making the content, and while they are helping with the reach of it, it doesn't have the same effect as someone visiting a creator's YouTube page for example. The deal should make sense to both parties and there should be room for negotiation. Perplexity having all the power to make the terms of the deal isn't appropriate in this case.
Something else noteworthy I noticed about this complaint was at the end of the document. On the list of attorneys was William Barr, with an asterisks next to his name, which noted that a pro hac vice motion was going to be filed. Two things to point out here:
First, a pro hac vice motion is a request to the court for an attorney, not licensed in the jurisdiction, to be allowed to participate in that case. The phrase "pro hac vice" means "for this time." A motion being put forth doesn't mean it will be granted.
Second, William Barr is a heavyweight in the world of attorneys. He's worked in the field for decades and was the U.S. attorney general for President George Bush Sr. and President Donald Trump. He's also done a lot of other high profile work, which has brought him unique experience. When I noticed his name on the list of attorneys in the complaint, that was a signal that WSJ and NYP are really swinging for the fences with this case.
Right before finishing this letter, I noticed an article talking about how OpenAI won a copyright infringement case that was filed against them. The judge said the plaintiffs didn't show enough harm to support their case. I didn't read through all the documents to see how they compare to this case, but here's my observation - the plaintiffs in this case aren't nearly at the level as WSJ and NYP, and they don't have the same level of attorneys available to them.
If one of the AI companies loses a copyright infringement case, it could be used as case law, where other AI companies have to start walking a fine line to keep themselves out of trouble. One domino falling would start a cascade.
Anyway, that's a wrap for this week. Have a good weekend and I'll see you next Saturday! 🍻