Tom Snyder: AI answers replace search engine links. Basis, bias of LLMs shape information

As social media continues to blur the lines between news and misinformation, Tom Snyder explores the crucial shift from traditional search engines to AI-driven answers in the digital landscape.

It is well understood that social media algorithms have fueled, and in fact amplified, the spread of misinformation throughout society. Under legal arguments based on the first amendment and populist messaging about freedom of speech, social media platforms have justified the spread of misinformation and resisted complex tasks of editorial filtering that credible journalists practice.

Major platforms like X, TikTok, LinkedIn and Instagram conduct some level of editorial control to protect against easily prosecutable infractions, like excess profanity and pornography. But often false, blatantly misleading and libelous content flows freely across these platforms. The algorithms that deliver what scrolls across our screens are optimized for commerce and to maximize engagement, delivering content that matches our personal preferences as they intersect with advertiser interests. No amount of Elon Musk's obfuscation changes that X is not a news platform, but rather hype and entertainment.

This is problematic for a society that increasingly turns to social media to gather news.

As the business model behind traditional journalism has broken down, most credible news is trapped behind paywalls, making it inaccessible to large swaths of society that can't afford the access. Most major global news sources cost between $10-20 per month for digital access, with a number of them trending even higher. Local news sources are dying out as they are acquired by big media companies that ultimately shut down local operations. [Note The Guardian is a unique exception, offering a "pay what you can" model, not requiring subscription fees, but encouraging those who can to pay extra to cover for those who cannot].

If you are like me, after learning about something new - often through social media - my next action is to search the web for more information. With thorough research, I can begin to understand what is real and what may have been hyperbole or outright falsehood in the initial clickbait reporting. Social media can be an aggregator without being a source of truth. Information on the web, carefully vetted, helps distill the signal from the noise.

What I've been concerned about recently is the evolution of search.

Since the earliest days of Archie and Altavista, Ask Jeeves and Lycos, "search" has been about matching websites to search terms. Search Engine Optimization, or SEO, was the science of putting content on a website that aligned with the web crawler algorithms behind the search bar. And more specifically, SEO is about gaming Google's algorithm. Google represents 90% of global search, with Bing (3.5%), Baidu (2.5%; mostly China), Yahoo (1.5%) and Yandex (1.5%; Russia) the only other search engines that capture a full percentage point of global search.

My workflow for news fact-checking is highly dependent on trusting websites that Google presents to me based on my search prompts. I suspect if readers are honest, you'll agree that you also have consciously or unconsciously put tremendous trust in a single tech company as an arbiter of truth sourcing. Google's search algorithm - we hope - is filtering out the craziness, lies and hyperbole that are rampant on social media. When Google first incorporated in 2004, they included the statement, "Don't be Evil. And if you see something that you think isn't right - speak up." in their code of conduct.

I wrote more than a year ago that I believe search is dead. We are moving from the era of SEO generated link lists to contextual answering of search prompts by generative AI. It is in Google's best interest to keep users on the Google platform, rather than to allow them to search and then jettison off Google and onto someone else's website. All of the large LLMs will behave this way, striving to provide all the context that a user is looking for directly on their own platforms, such that the platform provider can continue to capture your data (prompt query history) and to inject into forms of commerce where possible (advertising, purchasing, etc).

We've seen early stages of this, even in more traditional search. A few years back, if you searched for movie times, your search engine would provide the link to a local movie theater as the top result (along with paid-search results which were clearly marked as such). Today that search provides a list of movies and times directly from Google first and then you have to scroll much further down to find the actual theater's website. Google wants to know not only that you are looking for movie information, but also which movie you actually select, and at what location and time and price point. All of this data further trains AI that helps Google to tailor better and better responses to your prompts over time.

More recently, Google and other tools are now providing AI generated, contextual responses to search prompts as the top result of a query. Google is pulling information from 3rd party websites and other data sources to answer any question you may have without requiring (or suggesting) you actually visit that 3rd party website.

What happens when the search bar is completely replaced with the LLM prompt?

Suddenly my goal of researching facts from embellished information becomes more difficult. I need to put much more trust into whoever has trained the LLM that is generating AI responses to my prompts. Do those algorithms have bias? Are they hard coded to provide some information and not other information?

Consider the Associated Press, one of the oldest and most respected sources of factual, journalistic information for more than 175 years. If a journalist is using DeepMind (Google), CoPilot (Microsoft) or ChatGPT (OpenAI) for research, they are benefiting from an LLM trained on the full archive of the Associated Press, as AP has licensed their tech to the companies behind those LLMs. Other LLMs like LLaMa (Meta), Claude (Anthopic), Cohere and Mistral do not have any of that historical data, instead relying only on publicly available information for training.

Some LLM tools, like Perplexity do a really nice job of providing source links for generative AI responses. Using Perplexity feels a bit like using Wikipedia, where you can stay on-platform, but if you choose to leave for additional fact-checking, you have links at your fingertips. But most of the platforms are black-boxes, asking users to put full trust in the response.

Should we trust LLMs?

Over the first two years of the public acceleration of the use of generative AI and LLMs, the US has clearly been in the lead. At this time last year, experts estimated that China was about a year behind the US in LLM sophistication and accuracy. This fall I saw reports claiming China has closed the gap to about 5 months. Just last week, DeepSeek, a Chinese LLM tailored for code writing, published benchmark data demonstrating better performance than ChatGPT-4 and near equal performance to GPT-4 Turbo.

The problem is that we know that Chinese LLMs are hard coded to present results favorable to Chinese propaganda. If you ask Alibaba's primary LLM (Qwen), what happened in Beijing on June 4, 1989, it will not present any information about the Tiananmen Square massacre. The e-commerce giant (China's version of Amazon) is clearly following the government's direction in censoring their LLM.

Hugging Face is the world's biggest platform for AI models. It happens that the default LLM embedded into Hugging Face is Qwen2.5-72B-Instruct, another version of Qwen family of LLMs developed by Alibaba. This particular version does not appear to censor politically charged questions, but are there more subtle guardrails that have been built into the tool that are less easily detected?

The global competition for search was dominated by Google. The competition for capturing LLM prompts and responses is currently led by OpenAI and the various versions of ChatGPT. Early estimates of market share (as reported by ChatGPT....) are:

In the US, the common denominator is that all of the major LLMs are owned by large technology companies. How much will those companies be motivated to provide responses that align to their profitability goals? In nations like China that have strong government control over the AI tools being created, will we see people subtly influenced by propaganda in every prompt response? Will this generate a competitive response from the EU or US, creating a public AI with our own propaganda in an AI arms race?

For ordinary people like you and I who are simply trying to verify if a post on social media was true or not, will we be able to independently vet numerous independent sources online, or will we only get the information that the LLM provider wants to show us on their own platform response?

The future looks more and more monolithic, as AI threatens to reduce the vast diversity of websites to a few [they will argue convenient] AI tools. Never has there been a better time to remember that first-person sources are the best source of accurate information. The old fashioned meeting or phone call will remain critical, even in the presence of more and more powerful AI.

Vivid News Wave

Tom Snyder: AI answers replace search engine links. Basis, bias of LLMs shape information

POPULAR CATEGORY

corporate

tech

entertainment

research

misc

wellness

athletics