An Overview of Ray

One of the reasons we need efficient distributed computing is that we’re collecting ever more data with a large variety at increasing speeds. The storage systems, data processing and analytics engines that have emerged in the last decade are crucially important to the success of many companies. Interestingly, most “big data” technologies are built for and operated by (data) engineers, that are in charge of data collection and processing tasks. The rationale is to free up data scientists to do what they’re best at. As a data science practitioner you might want to focus on training complex machine learning models, running efficient hyperparameter selection, building entirely new and custom models or simulations, or serving your models to showcase them.

At the same time, it might be inevitable to scale these workloads to a compute cluster. To do that, the distributed system of your choice needs to support all of these fine-grained “big compute” tasks, potentially on specialized hardware. Ideally, it also fits into the big data tool chain you’re using and is fast enough to meet your latency requirements. In other words, distributed computing has to be powerful and flexible enough for complex data science workloads — and Ray can help you with that.

Python is likely the most popular language for data science today, and it’s certainly the one we find the most useful for our daily work. By now it’s over 30 years old, but has a still growing and active community. The rich [PyData ecosystem](https:// pydata.org/) is an essential part of a data scientist’s toolbox. How can you make sure to scale out your workloads while still leveraging the tools you need? That’s a difficult problem, especially since communities can’t be forced to just toss their toolbox, or programming language. That means distributed computing tools for data science have to be built for their existing community

Every chapter of this book has an executable notebook that you can run. If you want run the code while following this chapter, you can run this notebook locally or directly in Colab:

For this chapter you need to install the following dependencies. We guide you through each dependency and when you need it later on, but if you're impatient you can install them all right now:

! pip install "ray[rllib, serve, tune]==2.2.0"
! pip install "pyarrow==10.0.0"
! pip install "tensorflow>=2.9.0"
! pip install "transformers>=4.24.0"
! pip install "pygame==2.1.2" "gym==0.25.0"

To import utility files for this chapter on Colab you will also have to clone the repo and copy the code files to the base path of the runtime. You don't need to do this if you run the notebook locally, of course.

!git clone https://github.com/maxpumperla/learning_ray
%cp -r learning_ray/notebooks/* .

What is Ray?

Ray is a great computing framework for the Python data science community because it is flexible and distributed, making it easy to use and understand. It allows you to efficiently parallelize Python programs on your own computer and run them on a cluster without much modification. Additionally, its high-level libraries are easy to set up and can be used together smoothly, and some of them, such as the reinforcement learning library, have a promising future as standalone projects. Even though its core is written in C++, Ray has always been focused on Python and integrates well with many important data science tools. It also has a expanding ecosystem.

Ray is not the first framework for distributed Python, nor will it be the last, but it stands out for its ability to handle custom machine learning tasks with ease. Its various modules work well together, allowing for the flexible execution of complex workloads using familiar Python tools. This book aims to teach how to use Ray to effectively utilize distributed Python for machine learning purposes.

Programming distributed systems can be challenging because it requires specific skills and experience. While these systems are designed to be efficient and allow users to focus on their tasks, they often have "leaky abstractions" that can make it difficult to get clusters of computers to work as desired. In addition, many software systems require more resources than a single server can provide, and modern systems need to be able to handle failures and offer high availability. This means that applications may need to run on multiple machines or even in different data centers in order to function reliably.

Even if you are not very familiar with machine learning (ML) or artificial intelligence (AI), you have probably heard about recent advances in these fields. Some examples of these advances include Deepmind's Alpha-Fold, which is a system for solving the protein folding problem, and OpenAI's Codex, which helps software developers with the tedious parts of their job. It is commonly known that ML systems require a lot of data to be trained and that ML models tend to become larger. OpenAI has demonstrated that the amount of computing power needed to train AI models has been increasing exponentially, as shown in their paper "AI and Compute." In their study, the operations needed for AI systems were measured in petaflops (thousands of trillion operations per second) and have doubled every 3.4 months since 2012.

While Moore's Law suggests that computer transistors will double every two years, the use of distributed computing in machine learning can significantly increase the speed at which tasks are completed. While distributed computing may be seen as challenging, it would be beneficial to develop abstractions that allow for code to run on clusters without constantly considering individual machines and their interactions. By focusing specifically on AI workloads, it may be possible to make distributed computing more accessible and efficient.

Researchers at RISELab at UC Berkeley developed Ray to improve the efficiency of their workloads by distributing them. These workloads were flexible in nature and did not fit into existing frameworks. Ray was also designed to handle the distribution of the work and allow researchers to focus on their work without worrying about the specifics of their compute cluster. It was created with a focus on high-performance and diverse workloads, and allows researchers to use their preferred Python tools.

Design Philosophy

Ray was created with a focus on certain design principles. Its API aims to be straightforward and applicable to a wide range of situations, while the compute model is designed to be adaptable. Additionally, the system architecture is optimized for speed and the ability to handle increasing workloads. Let's delve into these points further.

Simplicity and abstraction

Ray's API is not only simple to use, but it is also intuitive and easy to learn, as you will see in Chapter 2. Whether you want to use all the CPU cores on your laptop or leverage all the machines in your cluster, you can do so with minimal changes to your code. Ray handles task distribution and coordination behind the scenes, allowing you to focus on your work rather than worrying about the mechanics of distributed computing. Additionally, the API is very flexible and can easily be integrated with other tools. For example, Ray actors can interact with other distributed Python workloads, making it a useful "glue code" for connecting different systems and frameworks.

Flexibility and heterogeneity

Ray's API is created to allow users to easily write flexible and modular code for artificial intelligence tasks, especially those involving reinforcement learning. As long as the workload can be expressed in Python, it can be distributed using Ray. However, it is important to ensure that sufficient resources are available and to consider what should be distributed. Ray does not impose any limitations on what can be done with it.

Ray is able to handle a variety of different computational tasks. For example, when working on a complex simulation, it is common for there to be different steps that take different amounts of time to complete. Some may take a long time while others only take a few milliseconds. Ray is able to efficiently schedule and execute these tasks, even if they need to be run in parallel. Additionally, Ray's framework allows for dynamic execution, which is helpful when subsequent tasks depend on the outcome of an earlier task. Overall, Ray provides flexibility in managing heterogeneous workflows.

It is important to be able to adapt your resource usage and Ray allows for the use of different types of hardware. For example, certain tasks may require the use of a GPU while others may perform better on a CPU. Ray gives you the ability to choose the most appropriate hardware for each task.

Speed and scalability

One of the key features of Ray is its speed. It can handle millions of tasks per second with minimal latency, making it an efficient choice for distributed systems. Additionally, Ray is effective at distributing and scheduling tasks across a compute cluster, and it does so in a way that is fault-tolerant. Its auto-scaler can adjust the number of machines in the cluster to match current demand, which helps to minimize costs and ensure there are enough resources available to run workloads. In the event of failures, Ray is designed to recover quickly, further contributing to its overall speed. While we will delve into the specifics of Ray's architecture later on, for now, let's focus on how it can be used in practice.