type
status
date
category
tags
slug
summary
password
AI summary
icon
page icon
All content on this blog is intended solely for the purpose of learning English and is not meant to infringe on any copyrights. All translations and citations are used for educational and non-commercial purposes. This blog is publicly accessible and free, with no profit being made from its content.
If you believe that any content on this blog infringes on your copyright, please contact me at the following email address, and I will promptly remove the content in question after verification.
Thank you for your understanding and support.
Contact: kechang.dev@gmail.com

AI firms will soon exhaust most of the internet’s data AI 公司将很快用尽互联网中的数据

IMAGE: MIKE HADDAD
IMAGE: MIKE HADDAD
Jul 23rd 2024
 
In 2006 fei-fei li, then at the University of Illinois, now at Stanford University, saw how mining the internet might help to transform ai research. Linguistic research had identified 80,000 “noun synonym sets”, or synsets: groups of synonyms that described the same sort of thing. The billions of images on the internet, Dr Li reckoned, must offer hundreds of examples of each synset. Assemble enough of them and you would have an ai training resource far beyond anything the field had ever seen. “A lot of people are paying attention to models,” she said. “Let’s pay attention to data.” The result was ImageNet.
在 2006 年,就读于 Illinois 大学(现在的 Stanford 大学)的 fei-fei li 看到了挖掘互联网信息也许能改变人工智能研究。语言学研究已经明确了 80,000 个 “名次同义词集合”或者说同义词(描述同一类事物的词组)。Li 博士估计,互联网上的数十亿张图片一定提供了数百个每个同义词集的例子。收集足够多的这样的资源,你将拥有一个远远超出该领域以往任何资源的人工智能训练资源。“很多人都在关注模型,”她说,“我们应该关注数据。” 结果就在 ImageNet。 synonym n. 同义词 synset n. 同义词集 assemble v. 装配;集合
 
In this series on artificial intelligence
  1. A short history of AI
  1. Mining the net*
 
The internet provided not only the images, but also the resources for labelling them. Once search engines had delivered pictures of what they took to be dogs, cats, chairs or whatever, these images were inspected and annotated by humans recruited through Mechanical Turk, a crowdsourcing service provided by Amazon which allows people to earn money by doing mundane tasks. The result was a database of millions of curated, verified images. It was through using parts of ImageNet for its training that, in 2012, a program called AlexNet demonstrated the remarkable potential of “deep learning”—that is to say, of neural networks with many more layers than had previously been used. This was the beginning of the ai boom, and of a labelling industry designed to provide it with training data.
互联网不仅提供图像,还提供标记图像的资源。一旦搜索引擎提供了他们认为是狗、猫、椅子或其他东西的图片,这些图像就会由通过 Mechanical Turk 招募的人员进行检查和注释。Mechanical Turk 是亚马逊提供的众包服务,允许人们通过完成日常任务来赚钱。最终,他们创建了一个包含数百万张经过筛选和验证的图像的数据库。2012 年,一个名为 AlexNet 的项目通过使用 ImageNet 的部分内容进行训练,展示了“深度学习”的巨大潜力——也就是说,神经网络的层数比以前使用的要多得多。这是人工智能繁荣的开始,也是标注行业(提供训练数据)的开始。 take to be 认为是;把 … 当作 annotate v. 注释;注解 mundane adj. 世俗的;平凡的 crowdsourcing n. 众包 curate n. 副牧师;助理牧师 | v. 组织;担任馆长 that is to say 也就是说
 
The later development of large language models (llms) also depended on internet data, but in a different way. The classic training exercise for an llm is not predicting what word best describes the contents of an image; it is predicting what a word cut from a piece of text is, on the basis of the other words around it.
后来的大语言模型(LLMs)的开发同样依赖于互联网的数据,但所使用的方式不同。LLM 的传统的训练不是预测哪个单词最能描述图片的内容,而是在它周围的其他单词的基础上来预测这一段从文本中切出的单词是什么。 on the basis of 基于… ; 在 … 的基础上
 
In this sort of training there is no need for labelled and curated data; the system can blank out words, take guesses and grade its answers in a process known as “self-supervised training”. There is, though, a need for copious data. The more text the system is given to train on, the better it gets. Given that the internet offers hundreds of trillions of words of text, it became to llms what aeons of carbon randomly deposited in sediments have been to modern industry: something to be refined into miraculous fuel.
在这种训练方式中不需要标记并归类后的数据。该系统可以删除单词、进行猜测并对答案进行评分,这个过程被称为“自我监督训练”。不过,这需要大量的数据。用于训练系统的文本越多,系统的效果就越好。鉴于互联网提供了数百万亿个单词的文本,它对于 LLMs 来说,就如同在漫长时期中随机沉积到沉积物中的大量碳对于现代工业一样:可以提炼成不可思议的燃料。 blank out v. 取消 copious adj. 大量的;丰富的 aeon n. 漫长的时期;永世;千万年 sediment n. 沉淀物;沉积物 refine v. 精炼;改进 miraculous adj. 奇迹般的;不可思议的;神奇的
 
Common Crawl, an archive of much of the open internet including 50bn web pages, became widely used in ai research. Newer models supplemented it with data from more and more sources, such as Books3, a widely used compilation of thousands of books. But the machines’ appetites for text have grown at a rate the internet cannot match. Epoch ai, a research firm, estimates that, by 2028, the stock of high-quality textual data on the internet will all have been used. In the industry this is known as the “data wall”. How to deal with this wall is one of ai’s great looming questions, and perhaps the one most likely to slow its progress.
Common Crawl 是一个包含 500 亿个网页的开放互联网档案库,在人工智能研究中得到了广泛的应用。较新的模型补充了来自越来越多来源的数据,例如 Books3,它是一个被广泛使用的由数千本书汇编出来的。但机器对文本的胃口增长速度是互联网无法比拟的。研究公司 Epoch ai 估计,到 2028 年,互联网上高质量的文本库将全部被使用。业内称之为“数据墙”。如何处理这堵墙是人工智能面临的迫切的问题之一,并且这也是最有可能减缓其进步的问题。 crawl v. 爬行 supplement n. 补充物;附录;附加费 | v. 补充;增加 compilation n. 编辑;编撰;汇编物 loom v. 迫近;隐约可见 | n. 织布机;若隐若现 looming adj. 逼近的;迫在眉睫的 | v-ing. 迫近
 
Source: EpochAI
Source: EpochAI
One approach is to focus on data quality rather than quantity. ai labs do not simply train their models on the entire internet. They filter and sequence data to maximise how much their models learn. Naveen Rao of Databricks, an ai firm, says that this is the “main differentiator” between ai models on the market. “True information” about the world obviously matters; so does lots of “reasoning”. That makes academic textbooks, for example, especially valuable. But setting the balance between data sources remains something of a dark art. What is more, the ordering in which the system encounters different types of data matters too. Lump all the data on one topic, like maths, at the end of the training process, and your model may become specialised at maths but forget some other concepts.
一个解决方案是专注于数据的质量而不是数量。人工智能实验室并不只是在整个互联网上训练他们的模型。他们过滤并排序数据以最大化他们的模型的学习效果。一家名为 Naveen Rao of Databricks 的 AI 公司说这就是市面上的模型 “主要的区别”。关于世界的 “真实信息” 显然是重要的,很多“推理”也是如此。比如,这使得学术教材显得更有价值。但在数据源之间设定平衡仍然是一门 “黑暗艺术”。此外,系统处理不同不同类型的数据的顺序也很重要,在训练过程结束时将所有数据集中在一个主题上,比如数学,你的模型可能会专注于数学,但会忘记其他一些概念。 differentiator n. 区别;微分器 textbook n. 教材;教科书;课本 what is more phrase. 此外;而且;更有甚者 lump n. 块;肿块;隆起 | v. 混在一起;结块;勉强接受 encounter v. 遭遇;遇到;邂逅 | n. 遭遇
 
These considerations can get even more complex when the data are not just on different subjects but in different forms. In part because of the lack of new textual data, leading models like Openai’s gpt-4o and Google’s Gemini are now let loose on image, video and audio files as well as text during their self-supervised learning. Training on video is hardest given how dense with data points video files are. Current models typically look at a subset of frames to simplify things.
当数据不仅涉及不同的主题而且具有不同的形式时,这些考虑会变得更加复杂。部分原因是缺乏新的文本数据,所以像 OpenAI 的 GPT-4o 和 Google 的 Gemini 这些领先模型现在在自监督学习过程中,不仅局限于文本,还被容许学习图像、视频和音频文件。由于视频文件中数据点的密度很大,视频训练是最困难的,当前的模型通常通过查看帧的子集以简化此过程。 dense adj. 密集的;稠密的;愚钝的
 
Whatever models are used, ownership is increasingly recognised as an issue. The material used in training llms is often copyrighted and used without consent from, or payment to, the rights holders. Some ai models peep behind paywalls. Model creators claim this sort of thing falls under the “fair use” exemption in American copyright law. ai models should be allowed to read copyrighted material when they learn, just as humans can, they say. But as Benedict Evans, a technology analyst, has put it, “a difference in scale” can lead to “a difference in principle”.
无论使用哪种模型,版权越来越被视为一个问题。用于训练 LLMs 的材料通常受版权保护,并且在未经版权所有者同意或未向其付费的情况下使用。一些人工智能模型在付费墙后窥视。模型创建者声称,这种事情属于美国版权法的“合理使用”豁免。他们说,人工智能模型在学习时应该被允许阅读受版权保护的材料,就像人类一样。但正如技术分析师 Benedict Evans 所说,“规模上的差异”可能导致“原则上的差异”。 peep v. 偷看 | n. 窥视 exemption n. 豁免;免除 scale n. 规模;等级;刻度 | v. 改变 … 的大小; 攀登 fall under phrase. 被归入;受到(影响);落在 ... 下面
 

The best labellers can earn up to $100 an hour 最好的标签师每小时能赚 100 美刀

Different rights holders are taking different tactics. Getty Images has sued Stability ai, an image-generation firm, for unauthorised use of its image store. The New York Times has sued Openai and Microsoft for copyright infringement of millions of articles. Other papers have struck deals to license their content. News Corp, owner of the Wall Street Journal, signed a deal worth $250m over five years. (The Economist has not taken a position on its relationship with ai firms.) Other sources of text and video are doing the same. Stack Overflow, a coding help-site, Reddit, a social-media site, and X (formerly Twitter) are now charging for access to their content for training.
不同的版权持有者采取了不同的策略。Getty Images 起诉图片生成公司 Stability ai 未经授权使用其图片库。The New York Times 起诉 Openai 和微软侵犯数百万篇文章的版权。其他报纸已达成内容授权协议。Wall Street Journal 的母公司 News Corp 签署了一份为期五年、价值 2.5 亿美元的协议。(The Economist 尚未就其与人工智能公司的关系表态。)其他文本和视频来源也在做同样的事情。编程帮助网站 Stack Overflow、社交媒体网站 Reddit 和 X(以前称为 Twitter)现在都通过起内容训练收费。 tactics n. 策略;战术 sue v. 起诉;控告;请求 infringement n. 违反(法规、规章制度等);侵犯(合法权益)
 
The situation differs between jurisdictions. Japan and Israel have a permissive stance to promote their ai industries. The European Union has no generic “fair use” concept, so could prove stricter. Where markets are set up, different types of data will command different prices: models will need access to timely information from the real world to stay up to date.
不同司法管辖区的情况有所不同。Japan 和 Israel 对 AI 行业呈开发态度。European Union 没有通用的 “合理使用” 概念,因此可能会更加严格。在市场建立的地方,不同类型的数据将会有不同的价格:模型需要获取来自现实世界的及时信息才能保持最新状态。 jurisdiction n. 管辖权;司法权;裁判权;管辖范围 permissive adj. 放任的;纵容的;许可的 stance n. 站姿 generic adj. 一般的;普通的;通用的 set up phrase. 建立;创立;设立;提出 timely adj. 适时的;及时的
 
Model capabilities can also be improved when the version produced by self-supervised learning, known as the pre-trained version, is refined through additional data in post-training. “Supervised fine-tuning”, for example, involves feeding a model question-and-answer pairs collected or handcrafted by humans. This teaches models what good answers look like. “Reinforcement-learning from human feedback” (rlhf), on the other hand, tells them if the answer satisfied the questioner (a subtly different matter).
当通过自我监督学习产生的版本(称为预训练版本)在 post-training 中通过额外的数据进行完善时,模型能力也可以得到提高。例如,“Supervised fine-tuning”涉及向模型提供由人类收集或手工制作的问答对,这教会模型什么是好的答案。另一方面,“从人类反馈中进行强化学习”(rlhf)会告诉模型答案是否让提问者满意(一个略有不同的问题)。 reinforcement n. 增强;强化物 subtly adv. 巧妙地;微妙地;隐约地
 
In rlhf users give a model feedback on the quality of its outputs, which are then used to tweak the model’s parameters, or “weights”. User interactions with chatbots, such as a thumbs-up or -down, are especially useful for rlhf. This creates what techies call a “data flywheel”, in which more users lead to more data which feeds back into tuning a better model. ai startups are keenly watching what types of questions users ask their models, and then collecting data to tune their models on those topics.
在 rlhf 中,用户向模型反馈其输出的质量,然后利用这些反馈来调整模型的参数或“权重”。用户与聊天机器人的交互(例如竖起大拇指或倒竖起大拇指)对于 rlhf 特别有用。这就产生了技术人员所说的 “数据飞轮”,其中更多的用户会带来更多的数据,这些数据会反馈到调整更好的模型。人工智能初创公司正在敏锐地观察用户向他们的模型提出什么类型的问题,然后收集数据来针对这些主题调整他们的模型。 tweak v. 拧;扭 | n. 拧;扭;微调 parameter n. 参数;参量;决定因素;界限 flywheel n. 飞轮 techie n. 技术人员 keenly adv. 敏锐地;强烈地
 

Scale it up 扩大规模

As pre-training data on the internet dry up, post-training is more important. Labelling companies such as Scale ai and Surge ai earn hundreds of millions of dollars a year collecting post-training data. Scale recently raised $1bn on a $14bn valuation. Things have moved on from the Mechanical Turk days: the best labellers earn up to $100 an hour. But, though post-training helps produce better models and is sufficient for many commercial applications, it is ultimately incremental.
随着互联网上的 pre-training 数据枯竭,post-training 变得更加重要。标签公司,比如 Scale ai 和 Surge ai 等每年通过收集 post-training 数据赚取数亿美元。Scale 最近以 140 亿美元的估值筹集了 10 亿美元。事情已经从 Mechanical Turk 时代发生了变化:最好的标签师每小时最多可赚 100 美元。但是,尽管 post-training 有助于生成更好的模型并且足以满足许多商业应用的需求,但它终究是逐步的。 dry up phrase. 干涸;枯竭 valuation n. 评估;评价;估值 ultimately adv. 最终;归根结底;根本上 incremental adj. 增长的;增加的
 
Rather than pushing the data wall back bit by bit, another solution would be to jump over it entirely. One approach is to use synthetic data, which are machine-created and therefore limitless. AlphaGo Zero, a model produced by DeepMind, a Google subsidiary, is a good example. The company’s first successful Go-playing model had been trained using data on millions of moves from amateur games. AlphaGo Zero used no pre-existing data. Instead it learned Go by playing 4.9m matches against itself over three days, noting the winning strategies. That “reinforcement learning” taught it how to respond to its opponent’s moves by simulating a large number of possible responses and choosing the one with the best chance of winning.
另一种解决方案是完全跳过数据墙,而不是一点一点地推倒它。一种方法是使用合成数据,这些数据是机器创建的,因此是无限的。谷歌子公司 DeepMind 制作的模型 AlphaGo Zero 就是一个很好的例子。该公司第一个成功的围棋游戏模型是使用数百万个业余比赛的数据进行训练的。AlphaGo Zero 没有使用任何预先存在的数据。相反,它通过在三天内与自己进行 490 万场比赛来学习围棋,并记录获胜策略。这种 “强化学习” 教会了它如何通过模拟大量可能的响应并选择最有可能获胜的反应来响应对手的动作。 synthetic adj. 合成的;人造的;综合的 subsidiary adj. 辅助的;附属的;次要的 | n. 附属事物;附属公司
 
A similar approach could be used for llms writing, say, a maths proof, step-by-step. An llm might build up an answer by first generating many first steps. A separate “helper” ai, trained on data from human experts to judge quality, would identify which was best and worth building on. Such ai-produced feedback is a form of synthetic data, and can be used to further train the first model. Eventually you might have a higher-quality answer than if the llm answered in one go, and an improved llm to boot. This ability to improve the quality of output by taking more time to think is like the slower, deliberative “system 2” thinking in humans, as described in a recent talk by Andrej Karpathy, a co-founder of Openai. Currently, llms employ “system 1” thinking, generating a response without deliberation, similar to a human’s reflexive response.
相似的方法可以使用在 LLMs 的写作,比如,逐步编写数学证明。一个 LLM 可能通过首先生成许多初始步骤建立答案,一个单独的 “helper” AI(由人类专家的数据进行训练)来判断其质量,以确定哪个是最好的,值得构建的。这种人工智能产生的反馈是一种合成数据的,可以用来进一步训练第一个模型。最终,您可能会一次性就得到更高质量的答案,并且还会得到一个改进的 LLM。这种通过花更多时间思考来提高产出质量的能力类似于人类较慢、深思熟虑的“系统 2”思维,正如 Openai 联合创始人 Andrej Karpathy 在最近的一次演讲中所描述的那样。目前,LLM 采用“系统 1”思维,无需深思熟虑即可做出反应,类似于人类的反射性反应。 reflexive adj. 条件反射的
 
The difficulty is extending the approach to settings like health care or education. In gaming, there is a clear definition of winning and it is easier to collect data on whether a move is advantageous. Elsewhere it is trickier. Data on what is a “good” decision are typically collected from experts. But that is costly, takes time and is only a patchy solution. And how do you know if a particular expert is correct?
困难在于将这种方法扩展到医疗保健或教育等领域。在游戏中,胜利有明确的定义,收集有关某一举动是否有利的数据更容易。在其他地方,这更棘手。有关什么是“好”决定的数据通常是从专家那里收集的。但这种方法成本高昂,耗时长,而且只是一个不完整的解决方案。你如何知道某位专家是否正确? advantageous adj. 有利的;有好处的 tricky adj. 难办的;棘手的;狡猾的;诡计多端的 patchy adj. 有补丁的;时好时坏的
 
It is clear that access to more data—whether culled from specialist sources, generated synthetically or provided by human experts—is key to maintaining rapid progress in AI. Like oilfields, the most accessible data reserves have been depleted. The challenge now is to find new ones—or sustainable alternatives. ■
显然,获取更多数据(无论是从专业来源筛选、合成还是由人类专家提供)是保持人工智能快速发展的关键。就像油田一样,最容易获取的数据储备已经枯竭。现在的挑战是寻找新的数据储备或可持续的替代品。■ cull v. 拣选;采集 sustainable adj. 可容忍的;可支持的;可持续的
The real problem with China’s economy 中国经济的真正问题A short history of AI 一则简短的 AI 历史
Loading...