Warning: The following contains a lot of strong opinions and #hottakes, and may contain errors of fact and misstatements of law. Any errors or mistakes are entirely my own, and nothing here should be construed as legal advice.
This week, Variety reported that Perplexity.ai is being suedby Dow Jones & Company and NYP Holding Company (which own the Wall Street Journal and New York Post) over their LLM-enhanced search platform. Perplexity (the search platform) is in the same family of products as Microsoft Bing AI and Google Search Labs. It marries a traditional internet search with large language models to generate a composite answer that leverages the linguistic capabilities of LLMs with up-to-date indexes and ranked relevance of modern search engines.
What started as a Discord bot connected to Bing search index, Perplexity has managed to generate a great deal of excitement, and produce what is considered a superior “Search+LLM” product. Compared to the competition, I believe Perplexity easily has the best product in the space. Unlike Microsoft Bing AI and Google Search Labs, it isn’t wedged into an existing legacy search app. Compared to SearchGPT, it has lower latency, provides better answers, and serves up companion images and contextual follow-up questions. It also offers “Perplexity Pro” searches, which helps you build a more detailed question, and then executes multiple queries to return more in-depth answers. With annualized revenue nearing $50M, and currently looking to raise nearly $500M at an $8B valuation, Perplexity’s growth and popularity is making them a darling among GenAI enthusiasts and VCs.
With attention comes scrutiny, and in this case, a legal complaint from prominent news outlets alleging that the technological underpinnings of Perplexity, a RAG system powered by an index of the internet, allows Perplexity to exploit the IP of major publishers, in violation of Copyright and Trademark law.
Perplexity’s Technology
As the founder and CEO Aravind Srinivas explains it, Perplexity first started by trying to build an Enterprise search product that could take natural language questions and convert them to SQL queries using LLMs. During their first six months, they built a series of Slack and Discord bots using OpenAI’s models to answer their own questions about integrating various technologies, and answering company-building admin questions like how to get his employees a healthcare plan. Frustrated by the inaccurate, outdated answers GPT-3 could return, his CTO connected GPT-3 to a RAG system that queried Microsoft Bing’s index, which dramatically improved answer quality.
Since then, Perplexity has improved their tech stack by adding their own crawlers and index, optimizing search result ranking, and application layer UI and UX enhancements, such as streaming follow up questions and images, providing more low-latency user interactivity. Their understanding of both search and large language model technologies, and how to build an exceptional user experience with them is first-rate. At the heart of their product is the custom index, providing real-world grounded information to augment their platform’s answers. In interviews, Srinivas repeatedly highlights the amount of effort that goes into crawling and re-crawling websites to get the most relevant and up-to-date information, as well as formatting and structuring of source texts to optimize the retrieval stage. I think it’s also likely that the index they’ve built is actually quite small, and that they have curated an exclusive set of extremely high-quality websites to crawl. By cleaning and formatting a smaller amount of indexed content into suitable bite-sized chunks for LLM context windows, they can reduce the amount of effort needed to keep their index current, maintain low latency, and improve reranking quality.
The Publishers’ Claims
Dow Jones and NYP allege that Perplexity’s service allows users to, in Perplexity’s own words, “Skip the Links” to original publishers’ websites, and still gain access to the content published within them. According to the complaint, Perplexity accomplishes this by reproducing the publishers’ content within a “massive ‘retrieval-augmented generation database” that “provides the information to a large language” which is capable of summarizing and paraphrasing the original content, or even “reproducing the content verbatim“.
The complaint alleges that Perplexity has committed copyright infringement in two ways: first by copying the publishers’ content into a RAG Index for the purpose of grounding their platform’s responses to user queries, and second, when the indexed content is copied into the context window of an LLM in order to generate answers.
Plaintiffs also bring Lanham Act claims for False Designation of Origin and Dilution, citing several examples where Perplexity’s LLM-generated answers include hallucinations, or fabricated text that purports to be from the source articles, but is not part of the original text. That’s right, even when an LLM’s answer is grounded in source context, they can still make up answers. Ordinarily, it might be challenging to argue (without pleading in the alternative) that an alleged infringer was both so faithful as to constitute direct copying, while also being so inaccurate as to be dilutive of the brand. The marvel of generative AI makes all things possible. ?
Search Engine vs. Answer Engine
Perplexity distinguishes their platform from traditional search by referring to it as an “answer engine”. In pulling data from disparate sources and combining them with an LLM, I think it’s fair to say that the combination presents both novel capabilities, as well as risks and exposures to new liabilities. Srinivas indicates that Perplexity has their own crawlers and indexes, and uses it’s own pagerank algorithm, and that part of the gains in relevance and accuracy of citations is due to leveraging LLMs to choose the best citations from the search results fed to the model.
However, what distinguishes Perplexity from a search engine is that the purpose of traditional search is to return a ranked list of results, directing the user to explore original content on a publisher’s website, driving traffic to their page. (Google’s own generativeAI and templated search page results increasingly interfere with this objective, but that is a separate matter). Whether it’s article content, image thumbnails, or digitized books, the fair use analysis has favored search engines so long as the purpose of the copying ultimately pointed users back to legitimate copies of a work. By contrast, Perplexity’s aggregation and use of the indexed content is much more likely to supplant the original work, diverting page traffic, by promising to combine the most relevant parts of each page into its generated answer, and largely eliminating the need to visit the cited pages. This is the rationale relied upon in the complaint, and in my view distinguishes Perplexity’s copying from that of search engine providers.
What about those Terms of Use agreements containing Indemnities?
Last year, during the first OpenAI DevDay, Sam Altman announced that OpenAI was revising their Terms of Service to add an indemnity provision for customers who are alleged to have violated IP laws as a result of their use of OpenAI’s API. This was following in the footsteps of Github, Adobe Firefly, and Shutterstock who implemented similar indemnity provisions for the GenAI features available on their respective platforms. Anthropic, makers of Claude, followed suit within a few weeks with a similar amendment. Both OpenAI’s and Anthropic’s indemnification provisions are full of caveats and carve-outs, so they are of relatively limited value as a practical matter. It was still a nice gesture.
To finally address the original question, “Perplexity got sued. What does that mean for OpenAI and Anthropic?”In short, unless I am missing something very large, Perplexity’s services fall into several of the existing exceptions to the indemnity clauses. Namely, both OpenAI and Anthropic’s indemnity provisions have carve outs for infringement claims that arise from the combination of their services with additional content or services (e.g RAG systems where the customer has added some content to the language model’s context window), which accurately describes Perplexity’s search platform.
But what about Perplexity’s Customers?
This got me thinking, does Perplexity indemnify its’ customers from allegations of IP infringement? Come to think of it, what does Perplexity’s agreements say anyway? Despite having used their API extensively (along with receiving multiple rate limit increases, and access to beta features), I’ve never actually read the Terms of Service. ? I’ve just been an admirer of the progress they’ve made in integrating a search index into an LLM API service, and its capabilities in automating internet search and information retrieval at scale.
I’m not the only admirer in the legal space, just last week, I attended a trial techniques lecture on Expert Witness cross examination where the lecturer explained how she used Perplexity Pro Enterprise as part of her trial prep. She used it to research studies cited by opposing Experts, as well as inquire about potential sources of bias, limitations, criticisms, flaws, in the study, including asking whether or not there were successful attempts to reproduce the experimental results. Claude/GPT-4 are both capable of producing many related questions and arguments based on the search results, and utilizing their vast parametric knowledge.
I thought these were fairly solid applications for a tool like Perplexity Pro, because the platform is very good for developing a line of questioning that is designed to expose the limitations of the Expert’s knowledge, and not necessarily contradict them. The list of counterpoints is also subject to curation prior to their use (attorney-in-the-loop!), so the technique also allows for adequate safeguards. I’d assumed that the instructor recommended the Enterprise version, presumably based on TOU/TOS that are favorable for customers in highly regulated industries. I was wrong.
In the inference market, I’ve become spoiled by the competition between Anthropic and OpenAI. Both companies’ desire to attract use cases in highly-regulated industries. Both have consumer-favorable policies with respect to data handling rights regarding inputs and outputs, privacy, retention, and reserved rights in their Terms of Service. They commit not to train on the data. They only monitor or use customer data to the extent necessary to deliver the service, and enforce applicable terms. They offer zero-retention versions of their respective platforms, and otherwise specify retention/deletion dates for all customer provided data. They want to be trusted partners, and have written their terms accordingly.
Not so for a lot of other gen AI startups. Among the TOU/TOS’s I reviewed in preparing this article, (including MistralAI, Together.ai, and Groq) Perplexity’s agreements are uniquely awful in terms of the rights and uses they reserve, and similarly the ownership they claim over outputs (sometimes referred to as “Perplexity Materials” within the terms). This is in part because they have to be awful from a customer perspective to protect the incredibly valuable indexed content their platform delivers (which they do not pay for). The core of Perplexity’s value rests on their index: the comprehensiveness of it, the quality of data it contains, and the retrieval and reranking performance that places the most relevant content into the LLM context windows to generate outputs.
This is so evident in their agreement language that I wish the Complaint had employed it in their arguments: <Advocacy Hat ?> “Perplexity believes that their indexing of copyright-protected content falls under Fair Use, so long as they provide citations to the original works. Yet, their User Agreements go to tortured lengths to protect their platform’s Outputs containing the Publishers’ content. They go so far as to claim exclusive ownership of Outputs as their own IP, and restrict downstream copying, distribution, or reselling of it by their customers. It is the content that brings commercial value to their service, yet Perplexity believes they can copy it for free.” Something like that. </Advocacy Hat ?>
Let’s Put on Our Product Hat ? and Look At the API TOS
This agreement is ridiculous.
API customers can’t do anything with the API responses to the extent that they contain “Perplexity Materials.” Given the near-infinite ways for large language models to combine, remix, synthesize, and regurgitate text, good luck understanding what part of a response you can actually use, and which components are restricted from all uses other than “receive and display”. You cannot build a product with this.
Wow, doesn’t 2.2.2 seem rather restrictive? If I can’t take actions beyond receiving and displaying the API responses, what does this service even do??? The potential of an API service from Perplexity is that it can automate certain kinds of online research activities, but only if I can take their data and transform it and still have some rights to use it. What am I going to do with “receive and display”? Put the information into my own brain? Gross.
Meanwhile, Perplexity retains a ton of rights regarding their allowable downstream uses, including any data you provide them for any business purpose, including aggregating and disclosing (e.g. selling). You already can’t build a Perplexity competitor per the terms. That doesn’t really matter – trying to compete with Perplexity while reselling their own service is competitively a non-starter. But what is left that you can build? Perhaps a decisioning system that researches a subject, draws a conclusion, then discards the original responses? It could still be challenging to ensure that the end product doesn’t contain any Perplexity Materials. And, if your idea becomes popular, what is stopping Perplexity from cutting you off and taking it?
The DPA clause here just seems oddly worded, like they are trying to transform the definition of Personal Information to exclude API activity from their own DPA language. I asked Carey Lening, CDPP about this, and she said she didn’t think that language works the way they want it to. Instead of attempting to transmute personal data into something else, through sheer force of will, maybe just require an agreement in place to process personal data? Stop being weird.
But What About Perplexity Enterprise?
I don’t understand why Perplexity has three separate agreements when practically every other genAI company has managed to cover all of their bases with two: either a Consumer (chat platform and any paid versions), or Commercial (Teams, Enterprise, and API products) agreement. But to some extent, I’m glad they did because, OH MY GOD SOMEHOW THE ENTERPRISE TERMS ARE EVEN WORSE ?!!!
Ok. There are insensitive research use cases that you can still perform under these terms, but I did not see this coming. This is nothing like the enterprise agreements from Anthropic or OpenAI that are far more restrictive with respect to your data. I don’t know why an “Enterprise” agreement would grant the Provider rights this broadly. This is just the Consumer agreement with a Data Processing Addendum, which covers personal data and personal information. For an enterprise agreement, I would prefer to have language that restricts the Vendor’s use of all data or content submitted to the service by the customer.
That last clause about training is silly because Perplexity is not a model company. They fine-tune open weights models to improve their performance with their RAG Index, but they very likely can do that without customer data. And, despite containing an opt-out toggle in their account settings, customers have no opt-out right in the Terms of Service.
Legal, Financial, and Other Highly Regulated Industries
This isn’t the standard we’ve become used to for legaltech or fintech tools, and falls far short of the user protections you’d expect from providers like OpenAI and Anthropic. Until they tighten up their customer data protections, they do not seem fit for legal use cases, particularly in light of what the competition offers.
Conclusions
Perplexity falls into the well-worn category of silicon valley disruptors who build first, raise second, and ask questions never. They’ve managed to build a better service than their trillion-dollar counterparts in a fraction of the time. From a functional perspective, I’ve been impressed by their API service, and how it (potentially) allows for the automation and scaling of performing indexed searches across a broad segment of the internet. I consider it best-in-class.
I hadn’t previously thought through the copyright implications of how their service works. That is my bad. I thought they were using the Google Search API, and would get sued (or maybe acquired) by a tech incumbent before the content folks caught wind. However, they built their own index, which streamlines the analysis considerably.
At a high level of abstraction, they collected a bunch of content from the internet, and now they are charging a monthly subscription fee for access to it. The inclusion of a large language model in the delivery chain (which may grind up, remix, or simply regurgitate the content), in my mind, doesn’t alter the analysis.
I hadn’t considered what Perplexity’s terms of service actually said. As someone who was enthusiastic about their tech and services, that’s also on me. So as penance, hopefully I’ve helped shed light on their terms, which allows others to make a more informed decision regarding whether or not they choose to use their platform in the delivery of legal or professional services, or as a developer.