When Chatbots Go Rogue: The Dangers Of Poorly Trained AI

ChatGPT Secret Training Data: the Top 50 Books AI Bots Are Reading

chatbot training dataset

GPT-4’s database is ginormous — up to a petabyte, by some accounts. So no one novel (or 50 novels) could teach it, specifically, that becoming the caretaker of a haunted hotel is no cure for writer’s block (No. 49), or that fear is the mind-killer (No. 13). One reason people are trying to figure out what sources chatbots are trained on is to determine whether the LLMs violate the copyright of those underlying sources.

  • Yet, when trained on vast internet datasets, these systems can produce nonsensical or harmful outputs, such as Gemini’s infamous “Please die” response.
  • In fact, it was almost as if it had studied the novel in advance.
  • Our sister community, Reworked, gathers the world’s leading employee experience and digital workplace professionals.
  • Supreme Court let stand lower court rulings that rejected copyright infringement claims.

Davis said libraries also hold “significant amounts of interesting cultural, historical and language data” that’s missing from the past few decades of online commentary that AI chatbots have mostly learned from. Fears of running out of data have also led AI developers to turn to “synthetic” data, made by the chatbots themselves and of a lower quality. Artificial intelligence chatbots are transforming industries and reshaping interactions — and as their adoption soars, glaring cracks in their design and training are emerging, revealing the potential for major harm from poorly trained AI systems.

Dan Balaceanu, co-founder of DRUID AI, highlights the need for rigorous testing and fine-tuning, saying the issue is in the varying levels of training data and algorithms used from model to model. In a recent Deloitte survey of companies adopting AI, 40% said data-related challenges — including thoroughly preparing and cleaning data — were among the top concerns hampering their AI initiatives. A separate poll of data scientists found that about 45% of scientists’ time is spent on data prep tasks, like “loading” and cleaning data.

chatbot training dataset

AI chatbots need more books to learn from. These libraries are opening their stacks

Lars Nyman, CMO of CUDO Compute, calls this phenomenon a “mirror reflecting humanity’s internet id” and warns of the rise of “digital snake oil” if companies neglect rigorous testing and ethical oversight. Today the CMSWire community consists of over 5 million influential customer experience, customer service and digital experience leaders, the majority of whom are based in North America and employed by medium to large organizations. CMSWire’s Marketing & Customer Experience Leadership channel is the go-to hub for actionable research, editorial and opinion for CMOs, aspiring CMOs and today’s customer experience innovators. Our dedicated editorial and research teams focus on bringing you the data and information you need to navigate today’s complex customer, organizational and technical landscapes. There has been an endless stream of coverage about all of the wonderful things chatbots are going to do for business by automating conversations with customers.

OpenAI agreed to pay Oracle $30B a year for data center services

chatbot training dataset

Ari Morcos, who’s worked in the AI industry for nearly a decade, wants to abstract away many of the data prep processes around AI model training — and he’s founded a startup to do just that. A well-architected system will let you train for both conversational and Q&A types of platforms in one system. Multiple languages should be automatically generated or updated if you have created a new scenario or just updated an existing one.

ChatGPT’s secret reading list

The issue, as several lawsuits argue, revolves around whether the bots make fair use of the material by transforming into something new, or whether they just memorize it whole and regurgitate it, without citation or permission. Harvard’s new AI training collection has an estimated 242 billion tokens, an amount that’s hard for humans to fathom but it’s still just a drop of what’s being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos. “A lot of the data that’s been used in AI training has not come from original sources,” said the data initiative’s executive director, Greg Leppert, who is also chief technologist at Harvard’s Berkman Klein Center for Internet & Society. This book collection goes “all the way back to the physical copy that was scanned by the institutions that actually collected those items,” he said.

Considering IVR is probably the most-hated piece of technology invented in the last 50 years, replicating that process is probably not a great idea. So you’re thinking of implementing a chatbot, like every other company on the planet. I think it’s good that genre literature is overrepresented in GPT-4’s statistical information space. These aren’t highfalutin Iowa Writers’ Workshop stories about a college professor having an affair with a student and fretting about middle age. Genre — sci-fi, mystery, romance, horror — is, broadly speaking, more interesting, partially because these books have plots where things actually happen. Bamman’s GPT-4 list is a Borgesian library of episodic connections, cliffhangers, third-act complications, and characters taking arms against seas of troubles (and whales).

What is the proper name that fills in the MASK token in it? This name is exactly one word long, and is a proper name (not a pronoun or any other word). One way to answer the question is to look for information that could have come from only one place. When prompted, for example, a GPT-3 writing aid called Sudowrite recognizes the specific sexual practices of a genre of fan-fiction writing called the Omegaverse. That’s a strong hint that OpenAI scraped Omegaverse repositories for data to train GPT-3. The chatbot’s GPT-4 version was amazingly accurate about the Bennet family tree.

The Silmarillion. Really?

chatbot training dataset

The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians. Wysa co-founder Aggarwal emphasizes the importance of creating a safe and trustworthy space for users, particularly in sensitive domains like mental health. “Each time an LLM generates a word, there is potential for error, and these errors auto-regress or compound, so when it gets it wrong, it doubles down on that error exponentially,” she says.

AI chatbots need more books to learn from. These libraries are opening their stacks

Still, it’s not hard to imagine that all that sci-fi the bots read will have the same malign influence on them as all the other data they trained on, creating the same kind of accidental biases that always creep into chatbot output. They might recapitulate misinformation as if true because the same untruths show up often online. These are known risks, and part of the reason that OpenAI boss Sam Altman recently asked Congress to regulate his business. Massive training datasets are the gateway to powerful AI models — but often, also those models’ downfall.

Be the first to comment

Leave a Reply

Your email address will not be published.


*