Data labeling workers: training AI, replaced by AI

Growth and elimination occur simultaneously.

Author|Ma Hui

Edit|Chestnuts

Image source: Generated by Unbounded AI tool

** Prospects and destruction exist at the same time, and data labeling practitioners have never been so contradictory. **

Dai Yan, a 30-year-old Inner Mongolian, started his business earlier this year and formed an online labeling team of nearly 30 people. Previously, Daiyan worked on a crowdsourcing platform for data annotation for two years. He, who can be called a "skilled worker", is both looking forward to and nervous about the current situation.

He has been paying attention to ChatGPT since the beginning of the year. From the rapid growth of the number of registrations of AI companies, Dai Yan saw the explosion of the AI industry and the entrepreneurial opportunities of data labeling. **Tianyancha data shows that in the first quarter of this year alone, 170,000 artificial intelligence-related companies were newly registered, and the total number is now 2.67 million. **

He imagines that he can follow the industry and the company will grow to a scale of 100 people in the future. **But the current status quo is hard to support his expectations: the circle of data labeling will soon be broken through - a large number of labeling needs, labeling workers and middlemen will pour in together, and the unit price will be lower. **

Just like the engineering team can't get in touch with Party A who has construction needs, and can only take over the project from the contractor, the wages contacted by **Daiyan are getting lower and lower as the project changes hands. **He refused to do the labeling project where he could only get 30 yuan a day.

At the same time, ** Daiyan is also facing the embarrassment of no career promotion in the labeling industry, no contract guarantee, and no way to complain about being delayed. **He laughed at himself: "We are the data migrant workers of the new era."

But that's not the whole story. **The bigger problem is that automated labeling is also eating up the only projects they have. ** AI trained by data labelers like Dai Yan is learning and labeling itself under human supervision.

Automated labeling will greatly reduce enterprise costs, and has become the most promising direction in the data labeling market.

Daiyan had to prepare for "AI may completely replace people". He led the team to do teaching aid annotation and 3D point cloud annotation projects in the text annotation category at the same time. One is text and the other is image video. Dai Yan has made a plan that if a project is overturned by AI, he will immediately lead the team to transform to another field.

In addition, the team size should be reduced. Dai Yan crossed out the scale of the 100-person company imagined in his mind. He believes that in the end, only the experienced team of 20 people may be retained.

**These AIs trained by data labelers make them dream of earning more while forcing them to plan to be subverted. **

1. Marking, let AI open its eyes to see the world

In order for machines to understand text, voice, and pictures like humans, humans have created a machine learning chain: collecting physical images and sounds in the physical world, labeling and cleaning the data, converting the data into a series of codes and sending them to the machine .

AI scholars believe that three-year-old babies "shoot" hundreds of millions of pictures through their eyes, repeatedly understanding the world. So as long as enough data is infused into the machine, the machine can also learn to read and recognize sentences, and finally understand the deep meaning behind the language.

There are 15 million pictures in the labeled atlas ImageNet. This dataset has helped countless AI companies achieve breakthroughs in computer vision, such as face recognition and image search.

In order to build ImageNet, nearly 50,000 data labelers from 167 countries around the world have worked together for two and a half years. They all come from the crowdsourcing platform Mechanical Turk.

The labeling requirements are very simple. MTurk’s common job is to distinguish the color of the photo, or classify the animals appearing in the image, or use boxes to frame selected objects and label their names: this is a cake, this is a car, It's a cloud and so on.

Graph/Integer Intelligence

The 200,000 part-time workers on the platform are distributed in Africa and Southeast Asia, where labor costs are low, and even formed a characteristic "data annotation village". The data they mark supports the exploration of technology companies in AI.

In China, millions of annotators are distributed in second- and third-tier cities in Guizhou, Shanxi, Shandong, Henan and other provinces, and gradually penetrate into counties with lower labor costs. They either rely on online crowdsourcing platforms, or join offline data labeling companies and labeling bases. **

The annotation content is divided into text, image and voice according to the scene, corresponding to the functions of helping the machine to acquire literacy, picture recognition and listening to sound.

Early annotation projects focused on Internet companies, mainly annotating voice and text. Now it is turning to self-driving companies to label 3D scenes obtained by lidar scanning, such as point cloud labeling; or more vertical text and voice labeling directions: to help education companies provide teaching auxiliary labeling data for large models; or for medical institutions The big model provides collated medical data.

When AI enters the 2.0 era, ChatGPT has amazed investors, entrepreneurs and entrepreneurs. Everyone's expectation for AI is not just to recognize text, voice and picture information rigidly. People also hope that AI can truly understand the connection between things like humans, recognize subtle differences and emotions behind actions, and actively distinguish and collect information.

For example, let the self-driving car distinguish an empty plastic bag in front of it, rather than a stone of similar color and size; let the camera next to the swimming pool no longer just record what happened by the pool, but understand what happened, and when someone drowns Alert.

These still need to rely on data annotation, and put forward higher requirements for annotation-more vertical, more accurate, and more economical.

The upsurge of labeling market also started from this.

2. "There are too many orders to keep up"

It is difficult to have data directly explaining the surge in demand for new annotations, but it is not difficult to judge. Because in the first quarter of 2023 alone, China has added 170,000 artificial intelligence companies, and as long as it is a company that uses AI, it is bound to have a demand for data labeling.

The demand quickly spread to the data labeling market. In the post bar where data annotation practitioners gather, more than a dozen projects recruiting posts can be refreshed a day, including but not limited to text annotation, topic review, drone sales video annotation, 2D detection rod, 3D point cloud, etc. Annotation items for text-to-image videos.

A data labeler who has been in the industry for many years has noticed that this year's unmanned vehicle labeling projects have increased, and the large-scale model entrepreneurship in the vertical field spawned by the AI2.0 boom has allowed the originally declining text labeling projects to be subdivided into different tracks. , also increases the demand for niche data labeling.

Driven by demand, Daiyan is not the only one who sets up a new team to pan for gold. Zhang Wei from Dongying, Shandong Province also started to devote himself to data labeling at the end of last year, and developed into a small team of more than a dozen people in half a year. Relying on subsidies and support from the local government, Zhang Wei's company not only got a free office, but the government also helped channel resources from Party A.

There are a lot of project orders, from the initial project of more than 100,000 yuan to the latest order of 400,000 yuan, the urgent delivery task made Zhang Wei more actively looking for labeling workers: a few days ago, Zhang Wei bought 6 more computers in just one day.

In Zhengzhou, Henan, a crowdsourcing platform for data annotation is moving to a two-story office building that can accommodate 100 people. They write the company's positioning on the signboard at the door and in the office: "AI artificial intelligence big data research and development base" "repeated data cleaning is for your AI to be smarter".

"There are too many orders for the labeling project to be done." The person in charge said.

The relocation ceremony of a data labeling company

Image source/provided by interviewees

Hot money has also entered the pockets of labeling companies for a long time. According to the data, the share price of Haitian AAC, the leading company, rose by up to 4 times from March to May this year.

According to 36 Krypton news, since the beginning of this year, more than a dozen data labeling platforms in the B round and before have collectively ushered in high valuations with an increase of nearly 100%. Since the second half of last year, automatic labeling companies have successively obtained new financing.

In September 2022, Borden Intelligence received 10 million yuan in financing; in December, Stardust Data completed its A-round financing of 50 million yuan. It has been four and a half years since the last financing in June 2018.

In April 2023, the data labeling solution company "Kaiwang Data" received a new round of strategic financing; in June, the AI data company "Integer Intelligence" received tens of millions of Pre A rounds of financing.

They are full of enthusiasm to play slogans to replace manual labeling: "Reconstruct data label production", "Automated production line + large-scale manpower", "Break the manual mode of automatic driving labeling".

Obviously, the capital market is also paying attention to this emerging field again.

3. More volume and stricter

The chain of data labeling consists of three parts.

Upstream: data labeling companies with 1~150 employees, online stragglers and small workshops.

Midstream: Data service providers, one is the intermediary crowdsourcing platform that undertakes upstream and downstream, and the other is that enterprises choose to build their own labeling bases for stable investment in the industry.

Downstream: Technology companies, industry companies, AI companies, and scientific research units. Internet companies dominated around 2018, and now they are transferred to car companies and autonomous driving companies.

The industry generally adopts a subcontracting model, that is, the first party company issues the bid, and the third-party service provider participates in the bidding. After the bid is successful, it enters the supplier echelon of the company, and the core suppliers can enjoy the right to choose priority tasks and more orders.

The enterprise's requirements for core suppliers are to have a delivery team of at least 30 people, mature order delivery experience, establish a training system, and the ability to control delivery quality and quantity. A stable production team ultimately leads to a low quotation that makes the company more competitive.

However, the low price advantage brought by the management and control team has been disrupted. "This year's bidding is fierce!" A service provider told "Jiazi Guangnian", "We bid 200 yuan for a project, and some people bid 80 yuan a day."

In the end, the project was won by the team with the lowest bid, but it ended up going back to the more mature team. "They were transferred back to us by Party A when they couldn't finish it, but the price couldn't go up anymore."

Because Daiyan's online team does not directly contact Party A. Therefore, the chaotic situation of multi-level cladding and laminated prices on the market makes them feel pressured.

Data labeling is a resource-based industry, and whoever can get the cooperation with Party A will have an advantage. Dai Yan revealed that after registering a company, some individuals falsely claimed that they had a professional team of 40-50 people and participated in the bidding at a very low price. After winning the project, they split it into 4-5 shares and distributed them to different teams. The team is divided further down, and the commission is collected layer by layer. The middleman earns the difference, and the piece price distributed to the data labeling workers is getting lower and lower. **

As long as someone picks up the plate, it will keep spiraling downward.

A price list obtained by "Jiazi Guangnian" shows that from 2D labeling to 3D laser point cloud labeling, the unit price of labeling items is generally 0.5 to 1.5 yuan per frame. Dai Yan once received a single-frame price with a 50% discount, "at least four or five hands have been transferred."

**Introversion of the unit price directly leads to the shrinkage of the salary of the labeling staff. **Daiyanhe's team is part-time full-time. Most of the team members are mothers, college students, freelancers and vocational high school students. They work 6 hours a day. Maintaining this state, Daiyan will have a monthly income of 4 to 5 thousand yuan during the epidemic in 2022.

"If you have a computer and electricity, you can operate it." This is a common attractive phrase in data labeling recruitment posters. In the past, this was once the most significant advantage of the data labeling industry. But today this advantage has caused the entire industry to fall into involution. Now Daiyan's monthly income is only 2-3 thousand yuan.

While incomes have fallen, workloads have not. On the contrary, the work of data labeling is more complex and detailed.

Senior practitioners of data annotation miss the annotation market in the Internet era: the price of a single frame is three times higher, and the number of items is large. A team of 60-70 people can earn a monthly income of 300,000 yuan. "Now the market is full of projects with an output value (the value generated by a single person per day) of less than 100 yuan, which used to be hundreds of dollars a day." A practitioner said.

At that time, the project operation was simple and there were no requirements, such as marking the 2D scene for the unmanned vehicle, and when drawing the frame on the vehicle in the picture, as long as it could be framed, there was no requirement.

**But it is different now. "Fitness" is the most important acceptance criterion for Party A. ** "Last year, the error was required to be 5-7 mm, and this year it will be 3-5 mm. The error requirement is getting smaller and smaller." Dai Yan said.

Artificial intelligence scholar Wu Enda has repeatedly emphasized that the value of artificial intelligence can only be released with labeled high-quality data. The more high-quality data, the faster the development of artificial intelligence will be.

In the labeled data of unmanned vehicles, it is expressed as the degree of fit between the rectangular frame and the marked object. The higher the degree of fit, the higher the accuracy of the algorithm, and the more precise the algorithm can control the vehicle.

High-quality text annotation items are reflected in the correctness of semantic understanding and the correct rate of answering questions. The higher the correct rate, the smarter the large model being trained.

Skilled hands can ensure fast and good data delivery. Daiyan once asked a novice to participate in checking whether the math problems completed by ChatGPT are complete, whether the logic is correct, and whether the language can be understood by elementary school students. The 7,500 data marked by the novice were required to be reworked by Party A because the accuracy rate was too low. It took Dai Yan and his colleagues more than ten days to correct it.

Data labeling is increasingly not a job without threshold. Complex voice annotation, medical, legal, financial and other professional data set annotation production requires professionals with subject knowledge reserves to do professional annotation.

Dai Yan believes that, taking the unmanned vehicle project as an example, it takes 3 months for newcomers to become proficient in 2D labeling, and 4 to 6 months to become proficient in 3D labeling.

This kind of exercise refers to training the accuracy of drawing the frame, using the mouse to draw a rectangular frame on the computer's labeling page in one go, which can accurately cover the marked object, without stepping on the line, without missing points, and even seamlessly.

Figure/data annotation experts point out the problems in the annotation

It's just that when the machine starts to learn by itself and replaces the human to label the machine, is the skill that people spend time training still meaningful?

4. Alternative crisis

Dai Yan realized that AI was approaching, and it was in the picture annotation project he did some time ago.

This is an old project that Daiyan has been working on for two years - map recognition. Data labelers need to recognize the text in the picture and print it out, the price is 8 cents per piece. The data marked on behalf of the extension is fed into the image recognition model. The model is now proficient at recognizing text in images. Daiyan's labeling work began to be reduced to revision and review. The difficulty has decreased, and the marked unit price has also decreased.

** AI trained by humans with labeling is replacing human labeling work. **In the survey report of the University of Zurich, the researchers found through actual measurement that ChatGPT's processing ability in 15 labeling tasks is higher than that of crowdsourcers. **The progress bar of embedding the large model into the crowdsourcing platform has also been sped up. **Subsequent research by the Federal Institute of Technology in Lausanne found that more than 30% of crowdsourced annotators have used large models when processing text annotation.

AI is undoubtedly more time-saving and labor-saving than manual labor: the researchers said that the unit cost of ChatGPT is only equivalent to 1/20 of MTurk.

Daiyan is also prepared that this business line will be replaced by "more perfect AI" at any time. He bet the future on more skill-required self-driving labels.

But autonomous driving labeling is also being invaded by AI. Compared with the manual frame drawing method, automatic labeling only requires a built-in large model. After parameter setting, the rectangular frame that originally required manual labeling will be automatically generated. The only problem at present is that the generated rectangular frame has quality problems such as stepping on the line and low fit, which requires manual inspection one by one.

The improvement in efficiency has surprised car companies. Ideal is using large model 2.0 for automatic calibration, which is 1000 times more efficient than humans; Tesla has been actively promoting the progress of automatic labeling, such as canceling 200 Tesla labeling videos in June 2022 to improve the assistance system American employees, because Tesla's automatic labeling ability has been greatly improved, labeling 10,000 videos of less than 60 seconds, only needs a large model to run for a week, instead of manual labeling for several months.

Lin Qunshu, the founder of AI data company Integer Intelligence, said that more and more car companies and AIGC companies use large-scale model products for automatic labeling, and their revenue is increasing significantly. Their latest move is to establish a research and development branch in Singapore.

**However, third-party service providers are not so optimistic about the growth of automated labeling. **The project manager of a crowdsourcing platform in Henan said that automated labeling cannot replace more than 60% of labeling requirements, and can only be used as an auxiliary labeling tool to process single or specific data and improve human efficiency.

The product manager of another data labeling company believes that automatic labeling can only filter simple basic data, and cannot accurately identify objects from complex and controversial scenes like humans. This is also the reason why the data labeling market is still dominated by autonomous driving labeling data.

However, everyone agrees that future data labeling will shift from manpower to technology.

In short, either being "squeezed to death" by peers, or "squeezed to death" by technology. But it is definitely not possible to sit still, and the third-party companies that mark the data are looking for a way out in the future.

Daiyan's plan is to keep up with the market, stay vigilant, lay off staff at any time, and at the same time develop in the direction of an automated labeling tool. The founder of a crowdsourcing platform said when communicating with his peers that in the future, we should not pile up manpower, but must have research and development capabilities.

What about individuals? The career path circulated in the industry is that novice labelers-experienced labelers-labeling project administrators/managers-party A's company data analysts, and finally achieve a promotion with a monthly salary of tens of thousands.

None of the data labelers that Dai Yan knew was going in this direction. They either stayed where they were or quit. The best case was to build their own labeling team like Dai Yan did, but he didn’t feel any easier.

On the one hand, there is the increase in project demand brought about by the AI trend, and on the other hand, there are more chaotic bidding, lower per capita output value, and rapidly growing AI. The two emotions are intertwined, AI will bring infinite opportunities, and AI will also eliminate "us".

(At the request of the interviewees, the names in the article are all pseudonyms)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)