Connect with us

Deep Learning

Uber’s Fiber Is A New Distributed AI Model Training Framework

mm

Published

 on

According to VentureBeat, AI researchers at Uber have recently posted a paper to Arxiv outlining a new platform intended to assist in the creation of distributed AI models. The platform is called Fiber, and it can be used to drive both reinforcement learning tasks and population-based learning. Fiber is designed to make large-scale parallel computation more accessible to non-experts, letting them take advantage of the power of distributed AI algorithms and models.

Fiber has recently been made open-source on GitHub, and it’s compatible with Python 3.6 or above, with Kubernetes running on a Linux system and running in a cloud environment. According to the team of researchers, the platform is capable of easily scaling up to hundreds or thousands of individual machines.

The team of researchers from Uber explains that many of the most recent and relevant advances in artificial intelligence have been driven by larger models and more algorithms that are trained using distributed training techniques. However, creating population-based models and reinforcement models remains a difficult task for distributed training schemes, as they frequently have issues with efficiency and flexibility. Fiber makes the distributed system more reliable and flexible by combining cluster management software with dynamic scaling and letting users move their jobs from one machine to a large number of machines seamlessly.

Fiber is made out of three different components: an API, a backend, and a cluster layer. The API layer enables users to create things like queues, managers, and processes. The backend layer of Fiber lets the user create and terminate jobs that are being managed by different clusters, and the cluster layer manages the individual clusters themselves along with their resources, which greatly the number of items that Fiber has to keep tabs on.

Fiber enables jobs to be queued and run remotely on one local machine or many different machines, utilizing the concept of job-backed processes. Fiber also makes use of containers to ensure things like input data and dependent packages are self-contained. The Fiber framework even includes built-in error handling so that if a worker crashes it can be quickly revived. FIber is able to do all of this while interacting with cluster managers, letting Fiber apps run as if they were normal apps running on a given computer cluster.

Experimental results showed that on average Fiber’s response time was a few milliseconds and that it also scaled up better than baseline AI techniques when built with 2,048 processor cores/workers. The length of time required to complete jobs decreased gradually as the set number of workers increased. IPyParallel completed 50 iterations of training in approximately 1400 seconds, while Fiber was able to complete the same 50 iterations of training in approximately 50 seconds with 512 workers available.

The coauthors of the Fiber paper explain that Fiber is able to do achieve multiple goals like dynamically scaling algorithms and using large volumes of computing power:

“[Our work shows] that Fiber achieves many goals, including efficiently leveraging a large amount of heterogeneous computing hardware, dynamically scaling algorithms to improve resource usage efficiency, reducing the engineering burden required to make [reinforcement learning] and population-based algorithms work on computer clusters, and quickly adapting to different computing environments to improve research efficiency. We expect it will further enable progress in solving hard [reinforcement learning] problems with [reinforcement learning] algorithms and population-based methods by making it easier to develop these methods and train them at the scales necessary to truly see them shine.”